Selecting Matcher

py_entitymatching.select_matcher(matchers, x=None, y=None, table=None, exclude_attrs=None, target_attr=None, metric_to_select_matcher='precision', metrics_to_display=['precision', 'recall', 'f1'], k=5, n_jobs=-1, random_state=None)

This function selects a matcher from a given list of matchers based on a given metric.

Specifically, this function internally uses scikit-learn’s cross validation function to select a matcher. There are two ways the user can call the fit method. First, interface similar to scikit-learn where the feature vectors and target attribute given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) and the target attribute

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Parameters
  • matchers (MLMatcher) – List of ML matchers to be selected from.

  • x (DataFrame) – Input feature vectors given as pandas DataFrame ( defaults to None).

  • y (DatFrame) – Input target attribute given as pandas DataFrame with a single column (defaults to None).

  • table (DataFrame) – Input pandas DataFrame containing feature vectors and target attribute (defaults to None).

  • exclude_attrs (list) – The list of attributes that should be excluded from the input table to get the feature vectors.

  • target_attr (string) – The target attribute in the input table (defaults to None).

  • metric_to_select_matcher (string) – The metric based on which the matchers must be selected. The string can be one of ‘precision’, ‘recall’, ‘f1’ (defaults to ‘precision’).

  • metrics_to_display (list) – The metrics that will be displayed to the user. It should be a list of any of the strings ‘precision’, ‘recall’, or ‘f1’ (defaults to [‘precision’, ‘recall’, ‘f1’]).

  • k (int) – The k value for cross-validation (defaults to 5).

  • n_jobs (integer) – The number of CPUs to use to do the computation. -1 means ‘all CPUs (defaults to -1)’.

  • random_state (object) – Pseudo random number generator that should be used for splitting the data into folds (defaults to None).

Returns

A dictionary containing three keys - selected matcher, cv_stats, and drill_down_cv_stats.

The selected matcher has a value that is a matcher (MLMatcher) object, cv_stats is a Dataframe containing average metrics for each matcher, and drill_down_cv_stats is a dictionary containing a table for each metric the user wants to display containing the score of the matchers for each fold.

Raises:
AssertionError: If metric_to_select_matcher is not one of ‘precision’, ‘recall’,

or ‘f1’.

AssertionError: If each item in the list metrics_to_display is not one of

’precision’, ‘recall’, or ‘f1’.

Examples

>>> dt = em.DTMatcher()
>>> rf = em.RFMatcher()
# train is the feature vector containing user labels
>>> result = em.select_matcher(matchers=[dt, rf], table=train, exclude_attrs=['_id', 'ltable_id', 'rtable_id'], target_attr='gold_labels', k=5)
Scroll To Top