Combining Blocker Outputs¶
- py_entitymatching.combine_blocker_outputs_via_union(blocker_output_list, l_prefix='ltable_', r_prefix='rtable_', verbose=False)[source]¶
Combines multiple blocker outputs by doing a union of their tuple pair ids (foreign key ltable, foreign key rtable).
Specifically, this function takes in a list of DataFrames (candidate sets, typically the output from blockers) and returns a consolidated DataFrame. The output DataFrame contains the union of tuple pair ids (foreign key ltable, foreign key rtable) and other attributes from the input list of DataFrames.
This function makes some assumptions about the input DataFrames. First, each DataFrame is expected to contain the following metadata in the catalog: key, fk_ltable, fk_rtable, ltable, and rtable. Second, all the DataFrames must be a result of blocking from the same underlying tables. Concretely the ltable and rtable properties must refer to the same DataFrame across all the input tables. Third, all the input DataFrames must have the same fk_ltable and fk_rtable properties. Finally, in each input DataFrame, for the attributes included from the ltable or rtable, the attribute names must be prefixed with the given l_prefix and r_prefix in the function.
The input DataFrames may contain different attribute lists and it demands the question of how to combine them. Currently py_entitymatching takes an union of attribute names that has prefix l_prefix or r_prefix across input tables. After taking the union, for each tuple id pair included in output, the attribute values (for union-ed attribute names) are probed from ltable/rtable and included in the output.
A subtle point to note here is, if an input DataFrame has a column added by user (say label for some reason), then that column will not be present in the output. The reason is, the same column may not be present in other candidate sets so it is not clear about how to combine them. One possibility is to include label in output for all tuple id pairs, but set as NaN for the values not present. Currently py_entitymatching does not include such columns and addressing it will be part of future work.
- Parameters
blocker_output_list (list of DataFrames) – The list of DataFrames that should be combined.
l_prefix (string) – The prefix given to the attributes from the ltable.
r_prefix (string) – The prefix given to the attributes from the rtable.
verbose (boolean) – A flag to indicate whether more detailed information about the execution steps should be printed out (default value is False).
- Returns
A new DataFrame with the combined tuple pairs and other attributes from all the blocker lists.
- Raises
AssertionError – If l_prefix is not of type string.
AssertionError – If r_prefix is not of type string.
AssertionError – If the length of the input DataFrame list is 0.
AssertionError – If blocker_output_list is not a list of DataFrames.
AssertionError – If the ltables are different across the input list of DataFrames.
AssertionError – If the rtables are different across the input list of DataFrames.
AssertionError – If the fk_ltable values are different across the input list of DataFrames.
AssertionError – If the fk_rtable values are different across the input list of DataFrames.
Examples
>>> import py_entitymatching as em >>> ab = em.AttrEquivalenceBlocker() >>> C = ab.block_tables(A, B, 'zipcode', 'zipcode') >>> ob = em.OverlapBlocker() >>> D = ob.block_candset(C, 'address', 'address') >>> block_f = em.get_features_for_blocking(A, B) >>> rb = em.RuleBasedBlocker() >>> rule = ['address_address_lev(ltuple, rtuple) > 6'] >>> rb.add_rule(rule, block_f) >>> E = rb.block_tables(A, B) >>> F = em.combine_blocker_outputs_via_union([C, E])