Adding Features to Feature Table¶
- py_entitymatching.get_feature_fn(feature_string, tokenizers, similarity_functions)[source]¶
This function creates a feature in a declarative manner.
Specifically, this function uses the feature string, parses it and compiles it into a function using the given tokenizers and similarity functions. This compiled function will take in two tuples and return a feature value (typically a number).
- Parameters
feature_string (string) – A feature expression to be converted into a function.
tokenizers (dictionary) – A Python dictionary containing tokenizers. Specifically, the dictionary contains tokenizer names as keys and tokenizer functions as values. The tokenizer function typically takes in a string and returns a list of tokens.
similarity_functions (dictionary) – A Python dictionary containing similarity functions. Specifically, the dictionary contains similarity function names as keys and similarity functions as values. The similarity function typically takes in a string or two lists of tokens and returns a number.
- Returns
This function returns a Python dictionary which contains sufficient information (such as attributes, tokenizers, function code) to be added to the feature table.
Specifically the Python dictionary contains the following keys: ‘left_attribute’, ‘right_attribute’, ‘left_attr_tokenizer’, ‘right_attr_tokenizer’, ‘simfunction’, ‘function’, and ‘function_source’.
For all the keys except the ‘function’ and ‘function_source’ the value will be either a valid string (if the input feature string is parsed correctly) or PARSE_EXP (if the parsing was not successful). The ‘function’ will have a valid Python function as value, and ‘function_source’ will have the Python function’s source in string format.
The created function is a self-contained function which means that the tokenizers and sim functions that it calls are bundled along with the returned function code.
- Raises
AssertionError – If feature_string is not of type string.
AssertionError – If the input tokenizers is not of type dictionary.
AssertionError – If the input similarity_functions is not of type dictionary.
Examples
>>> import py_entitymatching as em >>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID') >>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> block_t = em.get_tokenizers_for_blocking() >>> block_s = em.get_sim_funs_for_blocking() >>> block_f = em.get_features_for_blocking(A, B) >>> r = get_feature_fn('jaccard(qgm_3(ltuple.name), qgm_3(rtuple.name)', block_t, block_s) >>> em.add_feature(block_f, 'name_name_jac_qgm3_qgm3', r)
>>> match_t = em.get_tokenizers_for_matching() >>> match_s = em.get_sim_funs_for_matching() >>> match_f = em.get_features_for_matching(A, B) >>> r = get_feature_fn('jaccard(qgm_3(ltuple.name), qgm_3(rtuple.name)', match_t, match_s) >>> em.add_feature(match_f, 'name_name_jac_qgm3_qgm3', r)
- py_entitymatching.add_feature(feature_table, feature_name, feature_dict)[source]¶
Adds a feature to the feature table.
Specifically, this function is used in combination with
get_feature_fn()
. First the user creates a dictionary usingget_feature_fn()
, then the user uses this function to add feature_dict to the feature table.- Parameters
feature_table (DataFrame) – A DataFrame containing features.
feature_name (string) – The name that should be given to the feature.
feature_dict (dictionary) – A Python dictionary, that is typically returned by executing
get_feature_fn()
.
- Returns
A Boolean value of True is returned if the addition was successful.
- Raises
AssertionError – If the input feature_table is not of type pandas DataFrame.
AssertionError – If feature_name is not of type string.
AssertionError – If feature_dict is not of type Python dictionary.
AssertionError – If the feature_table does not have necessary columns such as ‘feature_name’, ‘left_attribute’, ‘right_attribute’, ‘left_attr_tokenizer’, ‘right_attr_tokenizer’, ‘simfunction’, ‘function’, and ‘function_source’ in the DataFrame.
AssertionError – If the feature_name is already present in the feature table.
Examples
>>> import py_entitymatching as em >>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID') >>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> block_t = em.get_tokenizers_for_blocking() >>> block_s = em.get_sim_funs_for_blocking() >>> block_f = em.get_features_for_blocking(A, B) >>> r = get_feature_fn('jaccard(qgm_3(ltuple.name), qgm_3(rtuple.name)', block_t, block_s) >>> em.add_feature(block_f, 'name_name_jac_qgm3_qgm3', r)
>>> match_t = em.get_tokenizers_for_matching() >>> match_s = em.get_sim_funs_for_matching() >>> match_f = em.get_features_for_matching(A, B) >>> r = get_feature_fn('jaccard(qgm_3(ltuple.name), qgm_3(rtuple.name)', match_t, match_s) >>> em.add_feature(match_f, 'name_name_jac_qgm3_qgm3', r)
- py_entitymatching.add_blackbox_feature(feature_table, feature_name, feature_function, **kwargs)[source]¶
Adds a black box feature to the feature table.
- Parameters
feature_table (DataFrame) – The input DataFrame (typically a feature table) to which the feature must be added.
feature_name (string) – The name that should be given to the feature.
feature_function (Python function) – A Python function for the black box feature.
- Returns
A Boolean value of True is returned if the addition was successful.
- Raises
AssertionError – If the input feature_table is not of type DataFrame.
AssertionError – If the input feature_name is not of type string.
AssertionError – If the feature_table does not have necessary columns such as ‘feature_name’, ‘left_attribute’, ‘right_attribute’, ‘left_attr_tokenizer’, ‘right_attr_tokenizer’, ‘simfunction’, ‘function’, and ‘function_source’ in the DataFrame.
AssertionError – If the feature_name is already present in the feature table.
Examples
>>> import py_entitymatching as em >>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID') >>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID') >>> block_f = em.get_features_for_blocking(A, B) >>> def age_diff(ltuple, rtuple): >>> # assume that the tuples have age attribute and values are valid numbers. >>> return ltuple['age'] - rtuple['age'] >>> status = em.add_blackbox_feature(block_f, 'age_difference', age_diff)