Creating the Features Automatically¶
- py_entitymatching.get_features_for_blocking(ltable, rtable, validate_inferred_attr_types=True)[source]¶
This function automatically generates features that can be used for blocking purposes.
- Parameters
ltable (DataFrame) – The pandas DataFrames for which the features are to be generated.
rtable (DataFrame) – The pandas DataFrames for which the features are to be generated.
validate_inferred_attr_types (boolean) – A flag to indicate whether to show the user the inferred attribute types and the features chosen for those types.
- Returns
A pandas DataFrame containing automatically generated features.
Specifically, the DataFrame contains the following attributes: ‘feature_name’, ‘left_attribute’, ‘right_attribute’, ‘left_attr_tokenizer’, ‘right_attr_tokenizer’, ‘simfunction’, ‘function’, ‘function_source’, and ‘is_auto_generated’.
Further, this function also sets the following global variables: _block_t, _block_s, _atypes1, _atypes2, and _block_c.
The variable _block_t contains the tokenizers used and _block_s contains the similarity functions used for creating features.
The variables _atypes1, and _atypes2 contain the attribute types for ltable and rtable respectively. The variable _block_c contains the attribute correspondences between the two input tables.
- Raises
AssertionError – If ltable is not of type pandas DataFrame.
AssertionError – If rtable is not of type pandas DataFrame.
AssertionError – If validate_inferred_attr_types is not of type pandas DataFrame.
Examples
>>> import py_entitymatching as em >>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID') >>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID') >>> block_f = em.get_features_for_blocking(A, B)
Note
In the output DataFrame, two attributes demand some explanation: (1) function, and (2) is_auto_generated. The function, points to the actual Python function that implements the feature. Specifically, the function takes in two tuples (one from each input table) and returns a numeric value. The attribute is_auto_generated contains either True or False. The flag is True only if the feature is automatically generated by py_entitymatching. This is important because this flag is used to make some assumptions about the semantics of the similarity function used and use that information for scaling purposes.
- py_entitymatching.get_features_for_matching(ltable, rtable, validate_inferred_attr_types=True)[source]¶
This function automatically generates features that can be used for matching purposes.
- Parameters
ltable (DataFrame) – The pandas DataFrames for which the features are to be generated.
rtable (DataFrame) – The pandas DataFrames for which the features are to be generated.
validate_inferred_attr_types (boolean) – A flag to indicate whether to show the user the inferred attribute types and the features chosen for those types.
- Returns
A pandas DataFrame containing automatically generated features.
Specifically, the DataFrame contains the following attributes: ‘feature_name’, ‘left_attribute’, ‘right_attribute’, ‘left_attr_tokenizer’, ‘right_attr_tokenizer’, ‘simfunction’, ‘function’, ‘function_source’, and ‘is_auto_generated’.
Further, this function also sets the following global variables: _match_t, _match_s, _atypes1, _atypes2, and _match_c.
The variable _match_t contains the tokenizers used and _match_s contains the similarity functions used for creating features.
The variables _atypes1, and _atypes2 contain the attribute types for ltable and rtable respectively. The variable _match_c contains the attribute correspondences between the two input tables.
- Raises
AssertionError – If ltable is not of type pandas DataFrame.
AssertionError – If rtable is not of type pandas DataFrame.
AssertionError – If validate_inferred_attr_types is not of type pandas DataFrame.
Examples
>>> import py_entitymatching as em >>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID') >>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID') >>> match_f = em.get_features_for_matching(A, B)
Note
In the output DataFrame, two attributes demand some explanation: (1) function, and (2) is_auto_generated. The function, points to the actual Python function that implements the feature. Specifically, the function takes in two tuples (one from each input table) and returns a numeric value. The attribute is_auto_generated contains either True or False. The flag is True only if the feature is automatically generated by py_entitymatching. This is important because this flag is used to make some assumptions about the semantics of the similarity function used and use that information for scaling purposes.