Labeling

py_entitymatching.label_table(table, label_column_name, verbose=False)[source]

Label a pandas DataFrame (for supervised learning purposes).

This functions labels a DataFrame, typically used for supervised learning purposes. This function expects the input DataFrame containing the metadata of a candidate set (such as key, fk_ltable, fk_rtable, ltable, rtable). This function creates a copy of the input DataFrame, adds label column at the end of the DataFrame, fills the column values with 0, invokes a GUI for the user to enter labels (0/1, 0: non-match, 1: match) and finally returns the labeled DataFrame. Further, this function also copies the properties from the input DataFrame to the output DataFrame.

Parameters
  • table (DataFrame) – The input DataFrame to be labeled. Specifically, a DataFrame containing the metadata of a candidate set (such as key, fk_ltable, fk_rtable, ltable, rtable) in the catalog.

  • label_column_name (string) – The column name to be given for the labels entered by the user.

  • verbose (boolean) – A flag to indicate whether more detailed information about the execution steps should be printed out (default value is False).

Returns

A new DataFrame with the labels entered by the user. Further, this function sets the output DataFrame’s properties same as input DataFrame.

Raises
  • AssertionError – If table is not of type pandas DataFrame.

  • AssertionError – If label_column_name is not of type string.

  • AssertionError – If the label_column_name is already present in the input table.

Examples

>>> import py_entitymatching as em
>>> G = em.label_table(S, label_column_name='label') # S is the (sampled) table that has to be labeled.