Creating the Features Automatically

py_entitymatching.get_features_for_blocking(ltable, rtable, validate_inferred_attr_types=True)[source]

This function automatically generates features that can be used for blocking purposes.

Parameters
  • ltable (DataFrame) – The pandas DataFrames for which the features are to be generated.

  • rtable (DataFrame) – The pandas DataFrames for which the features are to be generated.

  • validate_inferred_attr_types (boolean) – A flag to indicate whether to show the user the inferred attribute types and the features chosen for those types.

Returns

A pandas DataFrame containing automatically generated features.

Specifically, the DataFrame contains the following attributes: ‘feature_name’, ‘left_attribute’, ‘right_attribute’, ‘left_attr_tokenizer’, ‘right_attr_tokenizer’, ‘simfunction’, ‘function’, ‘function_source’, and ‘is_auto_generated’.

Further, this function also sets the following global variables: _block_t, _block_s, _atypes1, _atypes2, and _block_c.

The variable _block_t contains the tokenizers used and _block_s contains the similarity functions used for creating features.

The variables _atypes1, and _atypes2 contain the attribute types for ltable and rtable respectively. The variable _block_c contains the attribute correspondences between the two input tables.

Raises
  • AssertionError – If ltable is not of type pandas DataFrame.

  • AssertionError – If rtable is not of type pandas DataFrame.

  • AssertionError – If validate_inferred_attr_types is not of type pandas DataFrame.

Examples

>>> import py_entitymatching as em
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> block_f = em.get_features_for_blocking(A, B)

Note

In the output DataFrame, two attributes demand some explanation: (1) function, and (2) is_auto_generated. The function, points to the actual Python function that implements the feature. Specifically, the function takes in two tuples (one from each input table) and returns a numeric value. The attribute is_auto_generated contains either True or False. The flag is True only if the feature is automatically generated by py_entitymatching. This is important because this flag is used to make some assumptions about the semantics of the similarity function used and use that information for scaling purposes.

py_entitymatching.get_features_for_matching(ltable, rtable, validate_inferred_attr_types=True)[source]

This function automatically generates features that can be used for matching purposes.

Parameters
  • ltable (DataFrame) – The pandas DataFrames for which the features are to be generated.

  • rtable (DataFrame) – The pandas DataFrames for which the features are to be generated.

  • validate_inferred_attr_types (boolean) – A flag to indicate whether to show the user the inferred attribute types and the features chosen for those types.

Returns

A pandas DataFrame containing automatically generated features.

Specifically, the DataFrame contains the following attributes: ‘feature_name’, ‘left_attribute’, ‘right_attribute’, ‘left_attr_tokenizer’, ‘right_attr_tokenizer’, ‘simfunction’, ‘function’, ‘function_source’, and ‘is_auto_generated’.

Further, this function also sets the following global variables: _match_t, _match_s, _atypes1, _atypes2, and _match_c.

The variable _match_t contains the tokenizers used and _match_s contains the similarity functions used for creating features.

The variables _atypes1, and _atypes2 contain the attribute types for ltable and rtable respectively. The variable _match_c contains the attribute correspondences between the two input tables.

Raises
  • AssertionError – If ltable is not of type pandas DataFrame.

  • AssertionError – If rtable is not of type pandas DataFrame.

  • AssertionError – If validate_inferred_attr_types is not of type pandas DataFrame.

Examples

>>> import py_entitymatching as em
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> match_f = em.get_features_for_matching(A, B)

Note

In the output DataFrame, two attributes demand some explanation: (1) function, and (2) is_auto_generated. The function, points to the actual Python function that implements the feature. Specifically, the function takes in two tuples (one from each input table) and returns a numeric value. The attribute is_auto_generated contains either True or False. The flag is True only if the feature is automatically generated by py_entitymatching. This is important because this flag is used to make some assumptions about the semantics of the similarity function used and use that information for scaling purposes.