Creating the Features Automatically¶

py_entitymatching.get_features_for_blocking(ltable, rtable, validate_inferred_attr_types=True)¶

This function automatically generates features that can be used for blocking purposes.

Parameters

ltable,rtable (DataFrame) – The pandas DataFrames for which the features are to be generated.
validate_inferred_attr_types (boolean) – A flag to indicate whether to show the user the inferred attribute types and the features chosen for those types.

Returns

A pandas DataFrame containing automatically generated features.

Specifically, the DataFrame contains the following attributes: ‘feature_name’, ‘left_attribute’, ‘right_attribute’, ‘left_attr_tokenizer’, ‘right_attr_tokenizer’, ‘simfunction’, ‘function’, ‘function_source’, and ‘is_auto_generated’.

Further, this function also sets the following global variables: _block_t, _block_s, _atypes1, _atypes2, and _block_c.

The variable _block_t contains the tokenizers used and _block_s contains the similarity functions used for creating features.

The variables _atypes1, and _atypes2 contain the attribute types for ltable and rtable respectively. The variable _block_c contains the attribute correspondences between the two input tables.

Raises

AssertionError – If ltable is not of type pandas DataFrame.
AssertionError – If rtable is not of type pandas DataFrame.
AssertionError – If validate_inferred_attr_types is not of type pandas DataFrame.

Examples

>>> import py_entitymatching as em
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> block_f = em.get_features_for_blocking(A, B)

Note

In the output DataFrame, two attributes demand some explanation: (1) function, and (2) is_auto_generated. The function, points to the actual Python function that implements the feature. Specifically, the function takes in two tuples (one from each input table) and returns a numeric value. The attribute is_auto_generated contains either True or False. The flag is True only if the feature is automatically generated by py_entitymatching. This is important because this flag is used to make some assumptions about the semantics of the similarity function used and use that information for scaling purposes.

py_entitymatching.get_features_for_matching(ltable, rtable, validate_inferred_attr_types=True)¶

This function automatically generates features that can be used for matching purposes.

Parameters

ltable,rtable (DataFrame) – The pandas DataFrames for which the features are to be generated.
validate_inferred_attr_types (boolean) – A flag to indicate whether to show the user the inferred attribute types and the features chosen for those types.

Returns

A pandas DataFrame containing automatically generated features.

Specifically, the DataFrame contains the following attributes: ‘feature_name’, ‘left_attribute’, ‘right_attribute’, ‘left_attr_tokenizer’, ‘right_attr_tokenizer’, ‘simfunction’, ‘function’, ‘function_source’, and ‘is_auto_generated’.

Further, this function also sets the following global variables: _match_t, _match_s, _atypes1, _atypes2, and _match_c.

The variable _match_t contains the tokenizers used and _match_s contains the similarity functions used for creating features.

The variables _atypes1, and _atypes2 contain the attribute types for ltable and rtable respectively. The variable _match_c contains the attribute correspondences between the two input tables.

Raises

AssertionError – If ltable is not of type pandas DataFrame.
AssertionError – If rtable is not of type pandas DataFrame.
AssertionError – If validate_inferred_attr_types is not of type pandas DataFrame.

Examples

>>> import py_entitymatching as em
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> match_f = em.get_features_for_matching(A, B)

Note

In the output DataFrame, two attributes demand some explanation: (1) function, and (2) is_auto_generated. The function, points to the actual Python function that implements the feature. Specifically, the function takes in two tuples (one from each input table) and returns a numeric value. The attribute is_auto_generated contains either True or False. The flag is True only if the feature is automatically generated by py_entitymatching. This is important because this flag is used to make some assumptions about the semantics of the similarity function used and use that information for scaling purposes.

Table Of Contents

Search

Creating the Features Automatically¶