Extracting Feature Vectors¶
- 
py_entitymatching.extract_feature_vecs(candset, attrs_before=None, feature_table=None, attrs_after=None, verbose=False, show_progress=True, n_jobs=1)¶
- This function extracts feature vectors from a DataFrame (typically a labeled candidate set). - Specifically, this function uses feature table, ltable and rtable (that is present in the candset’s metadata) to extract feature vectors. - Parameters: - candset (DataFrame) – The input candidate set for which the features vectors should be extracted.
- attrs_before (list) – The list of attributes from the input candset, that should be added before the feature vectors (defaults to None).
- feature_table (DataFrame) – A DataFrame containing a list of features that should be used to compute the feature vectors ( defaults to None).
- attrs_after (list) – The list of attributes from the input candset that should be added after the feature vectors (defaults to None).
- verbose (boolean) – A flag to indicate whether the debug information should be displayed (defaults to False).
- show_progress (boolean) – A flag to indicate whether the progress of extracting feature vectors must be displayed (defaults to True).
 - Returns: - A pandas DataFrame containing feature vectors. - The DataFrame will have metadata ltable and rtable, pointing to the same ltable and rtable as the input candset. - Also, the output DataFrame will have three columns: key, foreign key ltable, foreign key rtable copied from input candset to the output DataFrame. These three columns precede the columns mentioned in attrs_before. - Raises: - AssertionError– If candset is not of type pandas DataFrame.
- AssertionError– If attrs_before has attributes that are not present in the input candset.
- AssertionError– If attrs_after has attribtues that are not present in the input candset.
- AssertionError– If feature_table is set to None.
 - Examples - >>> import py_entitymatching as em >>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID') >>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID') >>> match_f = em.get_features_for_matching(A, B) >>> # G is the labeled dataframe which should be converted into feature vectors >>> H = em.extract_feature_vecs(G, features=match_f, attrs_before=['title'], attrs_after=['gold_labels'])