Extracting Feature Vectors

py_entitymatching.extract_feature_vecs(candset, attrs_before=None, feature_table=None, attrs_after=None, verbose=False, show_progress=True, n_jobs=1)[source]

This function extracts feature vectors from a DataFrame (typically a labeled candidate set).

Specifically, this function uses feature table, ltable and rtable (that is present in the candset’s metadata) to extract feature vectors.

Parameters
  • candset (DataFrame) – The input candidate set for which the features vectors should be extracted.

  • attrs_before (list) – The list of attributes from the input candset, that should be added before the feature vectors (defaults to None).

  • feature_table (DataFrame) – A DataFrame containing a list of features that should be used to compute the feature vectors ( defaults to None).

  • attrs_after (list) – The list of attributes from the input candset that should be added after the feature vectors (defaults to None).

  • verbose (boolean) – A flag to indicate whether the debug information should be displayed (defaults to False).

  • show_progress (boolean) – A flag to indicate whether the progress of extracting feature vectors must be displayed (defaults to True).

Returns

A pandas DataFrame containing feature vectors.

The DataFrame will have metadata ltable and rtable, pointing to the same ltable and rtable as the input candset.

Also, the output DataFrame will have three columns: key, foreign key ltable, foreign key rtable copied from input candset to the output DataFrame. These three columns precede the columns mentioned in attrs_before.

Raises
  • AssertionError – If candset is not of type pandas DataFrame.

  • AssertionError – If attrs_before has attributes that are not present in the input candset.

  • AssertionError – If attrs_after has attribtues that are not present in the input candset.

  • AssertionError – If feature_table is set to None.

Examples

>>> import py_entitymatching as em
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> match_f = em.get_features_for_matching(A, B)
>>> # G is the labeled dataframe which should be converted into feature vectors
>>> H = em.extract_feature_vecs(G, features=match_f, attrs_before=['title'], attrs_after=['gold_labels'])