Adding Features to Feature Table¶
- 
py_entitymatching.get_feature_fn(feature_string, tokenizers, similarity_functions)¶
- This function creates a feature in a declarative manner. - Specifically, this function uses the feature string, parses it and compiles it into a function using the given tokenizers and similarity functions. This compiled function will take in two tuples and return a feature value (typically a number). - Parameters: - feature_string (string) – A feature expression to be converted into a function.
- tokenizers (dictionary) – A Python dictionary containing tokenizers. Specifically, the dictionary contains tokenizer names as keys and tokenizer functions as values. The tokenizer function typically takes in a string and returns a list of tokens.
- similarity_functions (dictionary) – A Python dictionary containing similarity functions. Specifically, the dictionary contains similarity function names as keys and similarity functions as values. The similarity function typically takes in a string or two lists of tokens and returns a number.
 - Returns: - This function returns a Python dictionary which contains sufficient information (such as attributes, tokenizers, function code) to be added to the feature table. - Specifically the Python dictionary contains the following keys: ‘left_attribute’, ‘right_attribute’, ‘left_attr_tokenizer’, ‘right_attr_tokenizer’, ‘simfunction’, ‘function’, and ‘function_source’. - For all the keys except the ‘function’ and ‘function_source’ the value will be either a valid string (if the input feature string is parsed correctly) or PARSE_EXP (if the parsing was not successful). The ‘function’ will have a valid Python function as value, and ‘function_source’ will have the Python function’s source in string format. - The created function is a self-contained function which means that the tokenizers and sim functions that it calls are bundled along with the returned function code. - Raises: - AssertionError– If feature_string is not of type string.
- AssertionError– If the input tokenizers is not of type dictionary.
- AssertionError– If the input similarity_functions is not of type dictionary.
 - Examples - >>> import py_entitymatching as em >>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID') >>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID') - >>> block_t = em.get_tokenizers_for_blocking() >>> block_s = em.get_sim_funs_for_blocking() >>> block_f = em.get_features_for_blocking(A, B) >>> r = get_feature_fn('jaccard(qgm_3(ltuple.name), qgm_3(rtuple.name)', block_t, block_s) >>> em.add_feature(block_f, 'name_name_jac_qgm3_qgm3', r) - >>> match_t = em.get_tokenizers_for_matching() >>> match_s = em.get_sim_funs_for_matching() >>> match_f = em.get_features_for_matching(A, B) >>> r = get_feature_fn('jaccard(qgm_3(ltuple.name), qgm_3(rtuple.name)', match_t, match_s) >>> em.add_feature(match_f, 'name_name_jac_qgm3_qgm3', r) 
- 
py_entitymatching.add_feature(feature_table, feature_name, feature_dict)¶
- Adds a feature to the feature table. - Specifically, this function is used in combination with - get_feature_fn(). First the user creates a dictionary using- get_feature_fn(), then the user uses this function to add feature_dict to the feature table.- Parameters: - feature_table (DataFrame) – A DataFrame containing features.
- feature_name (string) – The name that should be given to the feature.
- feature_dict (dictionary) – A Python dictionary, that is typically
returned by executing get_feature_fn().
 - Returns: - A Boolean value of True is returned if the addition was successful. - Raises: - AssertionError– If the input feature_table is not of type pandas DataFrame.
- AssertionError– If feature_name is not of type string.
- AssertionError– If feature_dict is not of type Python dictionary.
- AssertionError– If the feature_table does not have necessary columns such as ‘feature_name’, ‘left_attribute’, ‘right_attribute’, ‘left_attr_tokenizer’, ‘right_attr_tokenizer’, ‘simfunction’, ‘function’, and ‘function_source’ in the DataFrame.
- AssertionError– If the feature_name is already present in the feature table.
 - Examples - >>> import py_entitymatching as em >>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID') >>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID') - >>> block_t = em.get_tokenizers_for_blocking() >>> block_s = em.get_sim_funs_for_blocking() >>> block_f = em.get_features_for_blocking(A, B) >>> r = get_feature_fn('jaccard(qgm_3(ltuple.name), qgm_3(rtuple.name)', block_t, block_s) >>> em.add_feature(block_f, 'name_name_jac_qgm3_qgm3', r) - >>> match_t = em.get_tokenizers_for_matching() >>> match_s = em.get_sim_funs_for_matching() >>> match_f = em.get_features_for_matching(A, B) >>> r = get_feature_fn('jaccard(qgm_3(ltuple.name), qgm_3(rtuple.name)', match_t, match_s) >>> em.add_feature(match_f, 'name_name_jac_qgm3_qgm3', r) 
- 
py_entitymatching.add_blackbox_feature(feature_table, feature_name, feature_function)¶
- Adds a black box feature to the feature table. - Parameters: - feature_table (DataFrame) – The input DataFrame (typically a feature table) to which the feature must be added.
- feature_name (string) – The name that should be given to the feature.
- feature_function (Python function) – A Python function for the black box feature.
 - Returns: - A Boolean value of True is returned if the addition was successful. - Raises: - AssertionError– If the input feature_table is not of type DataFrame.
- AssertionError– If the input feature_name is not of type string.
- AssertionError– If the feature_table does not have necessary columns such as ‘feature_name’, ‘left_attribute’, ‘right_attribute’, ‘left_attr_tokenizer’, ‘right_attr_tokenizer’, ‘simfunction’, ‘function’, and ‘function_source’ in the DataFrame.
- AssertionError– If the feature_name is already present in the feature table.
 - Examples - >>> import py_entitymatching as em >>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID') >>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID') >>> block_f = em.get_features_for_blocking(A, B) >>> def age_diff(ltuple, rtuple): >>> # assume that the tuples have age attribute and values are valid numbers. >>> return ltuple['age'] - rtuple['age'] >>> status = em.add_blackbox_feature(block_f, 'age_difference', age_diff)