Supported Matchers

ML Matchers

class py_entitymatching.DTMatcher(*args, **kwargs)

Decision Tree matcher.

Parameters
  • *args,**kwargs – The arguments to scikit-learn’s Decision Tree classifier.

  • name (string) – The name of this matcher (defaults to None). If the matcher name is None, the class automatically generates a string and assigns it as the name.

Notes

For more details please see

fit(x=None, y=None, table=None, exclude_attrs=None, target_attr=None)

Fit interface for the matcher.

Specifically, there are two ways the user can call the fit method. First, interface similar to scikit-learn where the feature vectors and target attribute given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) and the target attribute.

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Parameters
  • x (DataFrame) – The input feature vectors given as pandas DataFrame (defaults to None).

  • y (DatFrame) – The input target attribute given as pandas DataFrame with a single column (defaults to None).

  • table (DataFrame) – The input pandas DataFrame containing feature vectors and target attribute (defaults to None).

  • exclude_attrs (list) – The list of attributes that should be excluded from the input table to get the feature vectors.

  • target_attr (string) – The target attribute in the input table.

predict(x=None, table=None, exclude_attrs=None, target_attr=None, append=False, return_probs=False, probs_attr=None, inplace=True)

Predict interface for the matcher.

Specifically, there are two ways the user can call the predict method. First, interface similar to scikit-learn where the feature vectors given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) .

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Parameters
  • x (DataFrame) – The input pandas DataFrame containing only feature vectors (defaults to None).

  • table (DataFrame) – The input pandas DataFrame containing feature vectors, and may be other attributes (defaults to None).

  • exclude_attrs (list) – A list of attributes to be excluded from the input table to get the feature vectors (defaults to None).

  • target_attr (string) – The attribute name where the predictions need to be stored in the input table (defaults to None).

  • probs_attr (string) – The attribute name where the prediction probabilities need to be stored in the input table (defaults to None).

  • append (boolean) – A flag to indicate whether the predictions need to be appended in the input DataFrame (defaults to False).

  • return_probs (boolean) – A flag to indicate where the prediction probabilities need to be returned (defaults to False). If set to True, returns the probability if the pair was a match.

  • inplace (boolean) – A flag to indicate whether the append needs to be done inplace (defaults to True).

Returns

An array of predictions or a DataFrame with predictions updated.

class py_entitymatching.RFMatcher(*args, **kwargs)

Random Forest matcher.

Parameters
  • *args,**kwargs – The arguments to scikit-learn’s Random Forest classifier.

  • name (string) – The name of this matcher (defaults to None). If the matcher name is None, the class automatically generates a string and assigns it as the name.

fit(x=None, y=None, table=None, exclude_attrs=None, target_attr=None)

Fit interface for the matcher.

Specifically, there are two ways the user can call the fit method. First, interface similar to scikit-learn where the feature vectors and target attribute given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) and the target attribute.

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Parameters
  • x (DataFrame) – The input feature vectors given as pandas DataFrame (defaults to None).

  • y (DatFrame) – The input target attribute given as pandas DataFrame with a single column (defaults to None).

  • table (DataFrame) – The input pandas DataFrame containing feature vectors and target attribute (defaults to None).

  • exclude_attrs (list) – The list of attributes that should be excluded from the input table to get the feature vectors.

  • target_attr (string) – The target attribute in the input table.

predict(x=None, table=None, exclude_attrs=None, target_attr=None, append=False, return_probs=False, probs_attr=None, inplace=True)

Predict interface for the matcher.

Specifically, there are two ways the user can call the predict method. First, interface similar to scikit-learn where the feature vectors given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) .

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Parameters
  • x (DataFrame) – The input pandas DataFrame containing only feature vectors (defaults to None).

  • table (DataFrame) – The input pandas DataFrame containing feature vectors, and may be other attributes (defaults to None).

  • exclude_attrs (list) – A list of attributes to be excluded from the input table to get the feature vectors (defaults to None).

  • target_attr (string) – The attribute name where the predictions need to be stored in the input table (defaults to None).

  • probs_attr (string) – The attribute name where the prediction probabilities need to be stored in the input table (defaults to None).

  • append (boolean) – A flag to indicate whether the predictions need to be appended in the input DataFrame (defaults to False).

  • return_probs (boolean) – A flag to indicate where the prediction probabilities need to be returned (defaults to False). If set to True, returns the probability if the pair was a match.

  • inplace (boolean) – A flag to indicate whether the append needs to be done inplace (defaults to True).

Returns

An array of predictions or a DataFrame with predictions updated.

class py_entitymatching.SVMMatcher(*args, **kwargs)

SVM matcher.

Parameters
  • *args,**kwargs – The arguments to scikit-learn’s SVM classifier.

  • name (string) – The name of this matcher (defaults to None). If the matcher name is None, the class automatically generates a string and assigns it as the name.

fit(x=None, y=None, table=None, exclude_attrs=None, target_attr=None)

Fit interface for the matcher.

Specifically, there are two ways the user can call the fit method. First, interface similar to scikit-learn where the feature vectors and target attribute given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) and the target attribute.

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Parameters
  • x (DataFrame) – The input feature vectors given as pandas DataFrame (defaults to None).

  • y (DatFrame) – The input target attribute given as pandas DataFrame with a single column (defaults to None).

  • table (DataFrame) – The input pandas DataFrame containing feature vectors and target attribute (defaults to None).

  • exclude_attrs (list) – The list of attributes that should be excluded from the input table to get the feature vectors.

  • target_attr (string) – The target attribute in the input table.

predict(x=None, table=None, exclude_attrs=None, target_attr=None, append=False, return_probs=False, probs_attr=None, inplace=True)

Predict interface for the matcher.

Specifically, there are two ways the user can call the predict method. First, interface similar to scikit-learn where the feature vectors given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) .

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Parameters
  • x (DataFrame) – The input pandas DataFrame containing only feature vectors (defaults to None).

  • table (DataFrame) – The input pandas DataFrame containing feature vectors, and may be other attributes (defaults to None).

  • exclude_attrs (list) – A list of attributes to be excluded from the input table to get the feature vectors (defaults to None).

  • target_attr (string) – The attribute name where the predictions need to be stored in the input table (defaults to None).

  • probs_attr (string) – The attribute name where the prediction probabilities need to be stored in the input table (defaults to None).

  • append (boolean) – A flag to indicate whether the predictions need to be appended in the input DataFrame (defaults to False).

  • return_probs (boolean) – A flag to indicate where the prediction probabilities need to be returned (defaults to False). If set to True, returns the probability if the pair was a match.

  • inplace (boolean) – A flag to indicate whether the append needs to be done inplace (defaults to True).

Returns

An array of predictions or a DataFrame with predictions updated.

class py_entitymatching.NBMatcher(*args, **kwargs)

Naive Bayes matcher.

Parameters
  • *args,**kwargs – The arguments to scikit-learn’s Naive Bayes classifier.

  • name (string) – The name of this matcher (defaults to None). If the matcher name is None, the class automatically generates a string and assigns it as the name.

fit(x=None, y=None, table=None, exclude_attrs=None, target_attr=None)

Fit interface for the matcher.

Specifically, there are two ways the user can call the fit method. First, interface similar to scikit-learn where the feature vectors and target attribute given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) and the target attribute.

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Parameters
  • x (DataFrame) – The input feature vectors given as pandas DataFrame (defaults to None).

  • y (DatFrame) – The input target attribute given as pandas DataFrame with a single column (defaults to None).

  • table (DataFrame) – The input pandas DataFrame containing feature vectors and target attribute (defaults to None).

  • exclude_attrs (list) – The list of attributes that should be excluded from the input table to get the feature vectors.

  • target_attr (string) – The target attribute in the input table.

predict(x=None, table=None, exclude_attrs=None, target_attr=None, append=False, return_probs=False, probs_attr=None, inplace=True)

Predict interface for the matcher.

Specifically, there are two ways the user can call the predict method. First, interface similar to scikit-learn where the feature vectors given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) .

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Parameters
  • x (DataFrame) – The input pandas DataFrame containing only feature vectors (defaults to None).

  • table (DataFrame) – The input pandas DataFrame containing feature vectors, and may be other attributes (defaults to None).

  • exclude_attrs (list) – A list of attributes to be excluded from the input table to get the feature vectors (defaults to None).

  • target_attr (string) – The attribute name where the predictions need to be stored in the input table (defaults to None).

  • probs_attr (string) – The attribute name where the prediction probabilities need to be stored in the input table (defaults to None).

  • append (boolean) – A flag to indicate whether the predictions need to be appended in the input DataFrame (defaults to False).

  • return_probs (boolean) – A flag to indicate where the prediction probabilities need to be returned (defaults to False). If set to True, returns the probability if the pair was a match.

  • inplace (boolean) – A flag to indicate whether the append needs to be done inplace (defaults to True).

Returns

An array of predictions or a DataFrame with predictions updated.

class py_entitymatching.LinRegMatcher(*args, **kwargs)

Linear regression matcher.

Parameters
  • *args,**kwargs – Arguments to scikit-learn’s Linear Regression matcher.

  • name (string) – Name that should be given to this matcher.

fit(x=None, y=None, table=None, exclude_attrs=None, target_attr=None)

Fit interface for the matcher.

Specifically, there are two ways the user can call the fit method. First, interface similar to scikit-learn where the feature vectors and target attribute given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) and the target attribute.

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Parameters
  • x (DataFrame) – The input feature vectors given as pandas DataFrame (defaults to None).

  • y (DatFrame) – The input target attribute given as pandas DataFrame with a single column (defaults to None).

  • table (DataFrame) – The input pandas DataFrame containing feature vectors and target attribute (defaults to None).

  • exclude_attrs (list) – The list of attributes that should be excluded from the input table to get the feature vectors.

  • target_attr (string) – The target attribute in the input table.

predict(x=None, table=None, exclude_attrs=None, target_attr=None, append=False, return_probs=False, probs_attr=None, inplace=True)

Predict interface for the matcher.

Specifically, there are two ways the user can call the predict method. First, interface similar to scikit-learn where the feature vectors given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) .

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Parameters
  • x (DataFrame) – The input pandas DataFrame containing only feature vectors (defaults to None).

  • table (DataFrame) – The input pandas DataFrame containing feature vectors, and may be other attributes (defaults to None).

  • exclude_attrs (list) – A list of attributes to be excluded from the input table to get the feature vectors (defaults to None).

  • target_attr (string) – The attribute name where the predictions need to be stored in the input table (defaults to None).

  • probs_attr (string) – The attribute name where the prediction probabilities need to be stored in the input table (defaults to None).

  • append (boolean) – A flag to indicate whether the predictions need to be appended in the input DataFrame (defaults to False).

  • return_probs (boolean) – A flag to indicate where the prediction probabilities need to be returned (defaults to False). If set to True, returns the probability if the pair was a match.

  • inplace (boolean) – A flag to indicate whether the append needs to be done inplace (defaults to True).

Returns

An array of predictions or a DataFrame with predictions updated.

class py_entitymatching.LogRegMatcher(*args, **kwargs)

Logistic Regression matcher.

Parameters
  • *args,**kwargs – THe Arguments to scikit-learn’s Logistic Regression classifier.

  • name (string) – The name of this matcher (defaults to None). If the matcher name is None, the class automatically generates a string and assigns it as the name.

fit(x=None, y=None, table=None, exclude_attrs=None, target_attr=None)

Fit interface for the matcher.

Specifically, there are two ways the user can call the fit method. First, interface similar to scikit-learn where the feature vectors and target attribute given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) and the target attribute.

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Parameters
  • x (DataFrame) – The input feature vectors given as pandas DataFrame (defaults to None).

  • y (DatFrame) – The input target attribute given as pandas DataFrame with a single column (defaults to None).

  • table (DataFrame) – The input pandas DataFrame containing feature vectors and target attribute (defaults to None).

  • exclude_attrs (list) – The list of attributes that should be excluded from the input table to get the feature vectors.

  • target_attr (string) – The target attribute in the input table.

predict(x=None, table=None, exclude_attrs=None, target_attr=None, append=False, return_probs=False, probs_attr=None, inplace=True)

Predict interface for the matcher.

Specifically, there are two ways the user can call the predict method. First, interface similar to scikit-learn where the feature vectors given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) .

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Parameters
  • x (DataFrame) – The input pandas DataFrame containing only feature vectors (defaults to None).

  • table (DataFrame) – The input pandas DataFrame containing feature vectors, and may be other attributes (defaults to None).

  • exclude_attrs (list) – A list of attributes to be excluded from the input table to get the feature vectors (defaults to None).

  • target_attr (string) – The attribute name where the predictions need to be stored in the input table (defaults to None).

  • probs_attr (string) – The attribute name where the prediction probabilities need to be stored in the input table (defaults to None).

  • append (boolean) – A flag to indicate whether the predictions need to be appended in the input DataFrame (defaults to False).

  • return_probs (boolean) – A flag to indicate where the prediction probabilities need to be returned (defaults to False). If set to True, returns the probability if the pair was a match.

  • inplace (boolean) – A flag to indicate whether the append needs to be done inplace (defaults to True).

Returns

An array of predictions or a DataFrame with predictions updated.

class py_entitymatching.XGBoostMatcher(*args, **kwargs)

XGBoost matcher.

Parameters
  • *args,**kwargs – The arguments to XGBoost classifier.

  • name (string) – The name of this matcher (defaults to None). If the matcher name is None, the class automatically generates a string and assigns it as the name.

fit(x=None, y=None, table=None, exclude_attrs=None, target_attr=None)

Fit interface for the matcher.

Specifically, there are two ways the user can call the fit method. First, interface similar to scikit-learn where the feature vectors and target attribute given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) and the target attribute.

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Parameters
  • x (DataFrame) – The input feature vectors given as pandas DataFrame (defaults to None).

  • y (DatFrame) – The input target attribute given as pandas DataFrame with a single column (defaults to None).

  • table (DataFrame) – The input pandas DataFrame containing feature vectors and target attribute (defaults to None).

  • exclude_attrs (list) – The list of attributes that should be excluded from the input table to get the feature vectors.

  • target_attr (string) – The target attribute in the input table.

predict(x=None, table=None, exclude_attrs=None, target_attr=None, append=False, return_probs=False, probs_attr=None, inplace=True)

Predict interface for the matcher.

Specifically, there are two ways the user can call the predict method. First, interface similar to scikit-learn where the feature vectors given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) .

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Parameters
  • x (DataFrame) – The input pandas DataFrame containing only feature vectors (defaults to None).

  • table (DataFrame) – The input pandas DataFrame containing feature vectors, and may be other attributes (defaults to None).

  • exclude_attrs (list) – A list of attributes to be excluded from the input table to get the feature vectors (defaults to None).

  • target_attr (string) – The attribute name where the predictions need to be stored in the input table (defaults to None).

  • probs_attr (string) – The attribute name where the prediction probabilities need to be stored in the input table (defaults to None).

  • append (boolean) – A flag to indicate whether the predictions need to be appended in the input DataFrame (defaults to False).

  • return_probs (boolean) – A flag to indicate where the prediction probabilities need to be returned (defaults to False). If set to True, returns the probability if the pair was a match.

  • inplace (boolean) – A flag to indicate whether the append needs to be done inplace (defaults to True).

Returns

An array of predictions or a DataFrame with predictions updated.

Rule-Based Matcher

Scroll To Top