Creating the Features Manually¶
- py_entitymatching.get_features(ltable, rtable, l_attr_types, r_attr_types, attr_corres, tok_funcs, sim_funcs)[source]¶
This function will automatically generate a set of features based on the attributes of the input tables.
Specifically, this function will go through the attribute correspondences between the input tables. For each correspondence , it examines the types of the involved attributes, then apply the appropriate tokenizers and sim functions to generate all appropriate features for this correspondence.
- Parameters
ltable (DataFrame) – The pandas DataFrames for which the features must be generated.
rtable (DataFrame) – The pandas DataFrames for which the features must be generated.
l_attr_types (dictionary) – The attribute types for the input DataFrames. Typically this is generated using the function ‘get_attr_types’.
r_attr_types (dictionary) – The attribute types for the input DataFrames. Typically this is generated using the function ‘get_attr_types’.
attr_corres (dictionary) – The attribute correspondences between the input DataFrames.
tok_funcs (dictionary) – A Python dictionary containing tokenizer functions.
sim_funcs (dictionary) – A Python dictionary containing similarity functions.
- Returns
A pandas DataFrame containing automatically generated features. Specifically, the DataFrame contains the following attributes: ‘feature_name’, ‘left_attribute’, ‘right_attribute’, ‘left_attr_tokenizer’, ‘right_attr_tokenizer’, ‘simfunction’, ‘function’, ‘function_source’, ‘is_auto_generated’.
- Raises
AssertionError – If ltable is not of type pandas DataFrame.
AssertionError – If rtable is not of type pandas DataFrame.
AssertionError – If l_attr_types is not of type python dictionary.
AssertionError – If r_attr_types is not of type python dictionary.
AssertionError – If attr_corres is not of type python dictionary.
AssertionError – If sim_funcs is not of type python dictionary.
AssertionError – If tok_funcs is not of type python dictionary.
AssertionError – If the ltable and rtable order is same as mentioned in the l_attr_types/r_attr_types and attr_corres.
Examples
>>> import py_entitymatching as em >>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID') >>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID') >>> match_t = em.get_tokenizers_for_matching() >>> match_s = em.get_sim_funs_for_matching() >>> atypes1 = em.get_attr_types(A) # don't need, if atypes1 exists from blocking step >>> atypes2 = em.get_attr_types(B) # don't need, if atypes2 exists from blocking step >>> match_c = em.get_attr_corres(A, B) >>> match_f = em.get_features(A, B, atypes1, atype2, match_c, match_t, match_s)
See also
py_entitymatching.get_attr_corres()
,py_entitymatching.get_attr_types()
,py_entitymatching.get_sim_funs_for_blocking()
,py_entitymatching.get_tokenizers_for_blocking()
,py_entitymatching.get_sim_funs_for_matching()
,py_entitymatching.get_tokenizers_for_matching()
Note
In the output DataFrame, two attributes demand some explanation: (1)function, and (2) is_auto_generated. The function, points to the actual python function that implements feature. Specifically, the function takes in two tuples (one from each input table) and returns a numeric value. The attribute is_auto_generated contains either True or False. The flag is True only if the feature is automatically generated by py_entitymatching. This is important because this flag is used to make some assumptions about the semantics of the similarity function used and use that information for scaling purposes.
- py_entitymatching.get_attr_corres(ltable, rtable)[source]¶
This function gets the attribute correspondences between the attributes of ltable and rtable.
The user may need to get the correspondences so that he/she can generate features based those correspondences.
- Parameters
ltable (DataFrame) – Input DataFrames for which the attribute correspondences must be obtained.
rtable (DataFrame) – Input DataFrames for which the attribute correspondences must be obtained.
- Returns
A Python dictionary is returned containing the attribute correspondences.
Specifically, this returns a dictionary with the following key-value pairs:
corres: points to the list correspondences as tuples. Each correspondence is a tuple with two attributes: one from ltable and the other from rtable.
ltable: points to ltable.
rtable: points to rtable.
Currently, ‘corres’ contains only pairs of attributes with exact names in ltable and rtable.
- Raises
AssertionError – If ltable is not of type pandas DataFrame.
AssertionError – If rtable is not of type pandas DataFrame.
Examples
>>> import py_entitymatching as em >>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID') >>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID') >>> match_c = em.get_attr_corres(A, B)
- py_entitymatching.get_attr_types(data_frame)[source]¶
This function gets the attribute types for a DataFrame.
Specifically this function gets the attribute types based on the statistics of the attributes. These attribute types can be str_eq_1w, str_bt_1w_5w, str_bt_5w_10w, str_gt_10w, boolean or numeric.
The types roughly capture whether the attribute is of type string, boolean or numeric. Further, with in the string type the subtypes are capture the average number of tokens in the column values. For example, str_bt_1w_5w means the average number of tokens in that column is greater than one word but less than 5 words.
- Parameters
data_frame (DataFrame) – The input DataFrame for which types of attributes must be determined.
- Returns
A Python dictionary is returned containing the attribute types.
Specifically, in the dictionary key is an attribute name, value is the type of that attribute.
Further, the dictionary will have a key _table, and the value of that should be a pointer to the input DataFrame.
- Raises
AssertionError – If data_frame is not of type pandas DataFrame.
Examples
>>> import py_entitymatching as em >>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID') >>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID') >>> atypes1 = em.get_attr_types(A) >>> atypes2 = em.get_attr_types(B)
- py_entitymatching.get_sim_funs_for_blocking()[source]¶
This function returns the similarity functions that can be used for blocking purposes.
- Returns
A Python dictionary containing the similarity functions.
Specifically, the key is the similarity function name and the value is the actual similary function.
Examples
>>> import py_entitymatching as em >>> block_s = em.get_sim_funs_for_blocking()
- py_entitymatching.get_sim_funs_for_matching()[source]¶
This function returns the similarity functions that can be used for matching purposes.
- Returns
A Python dictionary containing the similarity functions.
Specifically, the key is the similarity function name and the value is the actual similarity function.
Examples
>>> import py_entitymatching as em >>> match_s = em.get_sim_funs_for_matching()
- py_entitymatching.get_tokenizers_for_blocking(q=[2, 3], dlm_char=[' '])[source]¶
This function returns the single argument tokenizers that can be used for blocking purposes (typically in rule-based blocking).
- Parameters
q (list) – The list of integers (i.e q value) for which the q-gram tokenizer must be generated (defaults to [2, 3]).
dlm_char (list) – The list of characters (i.e delimiter character) for which the delimiter tokenizer must be generated (defaults to [` ‘]).
- Returns
A Python dictionary with tokenizer name as the key and tokenizer function as the value.
- Raises
AssertionError – If both q and dlm_char are set to None.
Examples
>>> import py_entitymatching as em >>> block_t = em.get_tokenizers_for_blocking() >>> block_t = em.get_tokenizers_for_blocking(q=[3], dlm_char=None) >>> block_t = em.get_tokenizers_for_blocking(q=None, dlm_char=[' '])
- py_entitymatching.get_tokenizers_for_matching(q=[2, 3], dlm_char=[' '])[source]¶
This function returns the single argument tokenizers that can be used for matching purposes.
- Parameters
q (list) – The list of integers (i.e q value) for which the q-gram tokenizer must be generated (defaults to [2, 3]).
dlm_char (list) – The list of characters (i.e delimiter character) for which the delimiter tokenizer must be generated (defaults to [` ‘]).
- Returns
A Python dictionary with tokenizer name as the key and tokenizer function as the value.
- Raises
AssertionError – If both q and dlm_char are set to None.
Examples
>>> import py_entitymatching as em >>> match_t = em.get_tokenizers_for_blocking() >>> match_t = em.get_tokenizers_for_blocking(q=[3], dlm_char=None) >>> match_t = em.get_tokenizers_for_blocking(q=None, dlm_char=[' '])