Adding Features to Feature Table¶

py_entitymatching.get_feature_fn(feature_string, tokenizers, similarity_functions)[source]¶

This function creates a feature in a declarative manner.

Specifically, this function uses the feature string, parses it and compiles it into a function using the given tokenizers and similarity functions. This compiled function will take in two tuples and return a feature value (typically a number).

Parameters

feature_string (string) – A feature expression to be converted into a function.
tokenizers (dictionary) – A Python dictionary containing tokenizers. Specifically, the dictionary contains tokenizer names as keys and tokenizer functions as values. The tokenizer function typically takes in a string and returns a list of tokens.
similarity_functions (dictionary) – A Python dictionary containing similarity functions. Specifically, the dictionary contains similarity function names as keys and similarity functions as values. The similarity function typically takes in a string or two lists of tokens and returns a number.

Returns

This function returns a Python dictionary which contains sufficient information (such as attributes, tokenizers, function code) to be added to the feature table.

Specifically the Python dictionary contains the following keys: ‘left_attribute’, ‘right_attribute’, ‘left_attr_tokenizer’, ‘right_attr_tokenizer’, ‘simfunction’, ‘function’, and ‘function_source’.

For all the keys except the ‘function’ and ‘function_source’ the value will be either a valid string (if the input feature string is parsed correctly) or PARSE_EXP (if the parsing was not successful). The ‘function’ will have a valid Python function as value, and ‘function_source’ will have the Python function’s source in string format.

The created function is a self-contained function which means that the tokenizers and sim functions that it calls are bundled along with the returned function code.

Raises

AssertionError – If feature_string is not of type string.
AssertionError – If the input tokenizers is not of type dictionary.
AssertionError – If the input similarity_functions is not of type dictionary.

Examples

>>> import py_entitymatching as em
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')

>>> block_t = em.get_tokenizers_for_blocking()
>>> block_s = em.get_sim_funs_for_blocking()
>>> block_f = em.get_features_for_blocking(A, B)
>>> r = get_feature_fn('jaccard(qgm_3(ltuple.name), qgm_3(rtuple.name)', block_t, block_s)
>>> em.add_feature(block_f, 'name_name_jac_qgm3_qgm3', r)

>>> match_t = em.get_tokenizers_for_matching()
>>> match_s = em.get_sim_funs_for_matching()
>>> match_f = em.get_features_for_matching(A, B)
>>> r = get_feature_fn('jaccard(qgm_3(ltuple.name), qgm_3(rtuple.name)', match_t, match_s)
>>> em.add_feature(match_f, 'name_name_jac_qgm3_qgm3', r)

py_entitymatching.add_feature(feature_table, feature_name, feature_dict)[source]¶

Adds a feature to the feature table.

Specifically, this function is used in combination with get_feature_fn(). First the user creates a dictionary using get_feature_fn(), then the user uses this function to add feature_dict to the feature table.

Parameters

feature_table (DataFrame) – A DataFrame containing features.
feature_name (string) – The name that should be given to the feature.
feature_dict (dictionary) – A Python dictionary, that is typically returned by executing get_feature_fn().

Returns

A Boolean value of True is returned if the addition was successful.

Raises

AssertionError – If the input feature_table is not of type pandas DataFrame.
AssertionError – If feature_name is not of type string.
AssertionError – If feature_dict is not of type Python dictionary.
AssertionError – If the feature_table does not have necessary columns such as ‘feature_name’, ‘left_attribute’, ‘right_attribute’, ‘left_attr_tokenizer’, ‘right_attr_tokenizer’, ‘simfunction’, ‘function’, and ‘function_source’ in the DataFrame.
AssertionError – If the feature_name is already present in the feature table.

Examples

>>> import py_entitymatching as em
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')

>>> block_t = em.get_tokenizers_for_blocking()
>>> block_s = em.get_sim_funs_for_blocking()
>>> block_f = em.get_features_for_blocking(A, B)
>>> r = get_feature_fn('jaccard(qgm_3(ltuple.name), qgm_3(rtuple.name)', block_t, block_s)
>>> em.add_feature(block_f, 'name_name_jac_qgm3_qgm3', r)

>>> match_t = em.get_tokenizers_for_matching()
>>> match_s = em.get_sim_funs_for_matching()
>>> match_f = em.get_features_for_matching(A, B)
>>> r = get_feature_fn('jaccard(qgm_3(ltuple.name), qgm_3(rtuple.name)', match_t, match_s)
>>> em.add_feature(match_f, 'name_name_jac_qgm3_qgm3', r)

py_entitymatching.add_blackbox_feature(feature_table, feature_name, feature_function, **kwargs)[source]¶

Adds a black box feature to the feature table.

Parameters

feature_table (DataFrame) – The input DataFrame (typically a feature table) to which the feature must be added.
feature_name (string) – The name that should be given to the feature.
feature_function (Python function) – A Python function for the black box feature.

Returns

A Boolean value of True is returned if the addition was successful.

Raises

AssertionError – If the input feature_table is not of type DataFrame.
AssertionError – If the input feature_name is not of type string.
AssertionError – If the feature_table does not have necessary columns such as ‘feature_name’, ‘left_attribute’, ‘right_attribute’, ‘left_attr_tokenizer’, ‘right_attr_tokenizer’, ‘simfunction’, ‘function’, and ‘function_source’ in the DataFrame.
AssertionError – If the feature_name is already present in the feature table.

Examples

>>> import py_entitymatching as em
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> block_f = em.get_features_for_blocking(A, B)
>>> def age_diff(ltuple, rtuple):
>>>     # assume that the tuples have age attribute and values are valid numbers.
>>>   return ltuple['age'] - rtuple['age']
>>> status = em.add_blackbox_feature(block_f, 'age_difference', age_diff)