Blocking¶

class py_entitymatching.AttrEquivalenceBlocker¶

Blocks based on the equivalence of attribute values.

block_candset(candset, l_block_attr, r_block_attr, allow_missing=False, verbose=False, show_progress=True, n_jobs=1)¶

Blocks an input candidate set of tuple pairs based on attribute equivalence.

Finds tuple pairs from an input candidate set of tuple pairs such that the value of attribute l_block_attr of the left tuple in a tuple pair exactly matches the value of attribute r_block_attr of the right tuple in the tuple pair.

Parameters

candset (DataFrame) – The input candidate set of tuple pairs.
l_block_attr (string) – The blocking attribute in left table.
r_block_attr (string) – The blocking attribute in right table.
allow_missing (boolean) – A flag to indicate whether tuple pairs with missing value in at least one of the blocking attributes should be included in the output candidate set (defaults to False). If this flag is set to True, a tuple pair with missing value in either blocking attribute will be retained in the output candidate set.
verbose (boolean) – A flag to indicate whether the debug information should be logged (defaults to False).
show_progress (boolean) – A flag to indicate whether progress should be displayed to the user (defaults to True).
n_jobs (int) – The number of parallel jobs to be used for computation (defaults to 1). If -1 all CPUs are used. If 0 or 1, no parallel computation is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus, for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) is less than 1, then no parallel computation is used (i.e., equivalent to the default).

Returns

A candidate set of tuple pairs that survived blocking (DataFrame).

Raises

AssertionError – If candset is not of type pandas DataFrame.
AssertionError – If l_block_attr is not of type string.
AssertionError – If r_block_attr is not of type string.
AssertionError – If verbose is not of type boolean.
AssertionError – If n_jobs is not of type int.
AssertionError – If l_block_attr is not in the ltable columns.
AssertionError – If r_block_attr is not in the rtable columns.

Examples

>>> import py_entitymatching as em
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> ab = em.AttrEquivalenceBlocker()
>>> C = ab.block_tables(A, B, 'zipcode', 'zipcode', l_output_attrs=['name'], r_output_attrs=['name'])

>>> D1 = ab.block_candset(C, 'age', 'age', allow_missing=True)
# Include all possible tuple pairs with missing values
>>> D2 = ab.block_candset(C, 'age', 'age', allow_missing=True)
# Execute blocking using multiple cores
>>> D3 = ab.block_candset(C, 'age', 'age', n_jobs=-1)

block_tables(ltable, rtable, l_block_attr, r_block_attr, l_output_attrs=None, r_output_attrs=None, l_output_prefix='ltable_', r_output_prefix='rtable_', allow_missing=False, verbose=False, n_jobs=1)¶

Blocks two tables based on attribute equivalence.

Conceptually, this will check l_block_attr=r_block_attr for each tuple pair from the Cartesian product of tables ltable and rtable. It outputs a Pandas dataframe object with tuple pairs that satisfy the equality condition. The dataframe will include attributes ‘_id’, key attribute from ltable, key attributes from rtable, followed by lists l_output_attrs and r_output_attrs if they are specified. Each of these output and key attributes will be prefixed with given l_output_prefix and r_output_prefix. If allow_missing is set to True then all tuple pairs with missing value in at least one of the tuples will be included in the output dataframe. Further, this will update the following metadata in the catalog for the output table: (1) key, (2) ltable, (3) rtable, (4) fk_ltable, and (5) fk_rtable.

Parameters

ltable (DataFrame) – The left input table.
rtable (DataFrame) – The right input table.
l_block_attr (string) – The blocking attribute in left table.
r_block_attr (string) – The blocking attribute in right table.
l_output_attrs (list) – A list of attribute names from the left table to be included in the output candidate set (defaults to None).
r_output_attrs (list) – A list of attribute names from the right table to be included in the output candidate set (defaults to None).
l_output_prefix (string) – The prefix to be used for the attribute names coming from the left table in the output candidate set (defaults to ‘ltable_’).
r_output_prefix (string) – The prefix to be used for the attribute names coming from the right table in the output candidate set (defaults to ‘rtable_’).
allow_missing (boolean) – A flag to indicate whether tuple pairs with missing value in at least one of the blocking attributes should be included in the output candidate set (defaults to False). If this flag is set to True, a tuple in ltable with missing value in the blocking attribute will be matched with every tuple in rtable and vice versa.
verbose (boolean) – A flag to indicate whether the debug information should be logged (defaults to False).
n_jobs (int) – The number of parallel jobs to be used for computation (defaults to 1). If -1 all CPUs are used. If 0 or 1, no parallel computation is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus, for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) is less than 1, then no parallel computation is used (i.e., equivalent to the default).

Returns

A candidate set of tuple pairs that survived blocking (DataFrame).

Raises

AssertionError – If ltable is not of type pandas DataFrame.
AssertionError – If rtable is not of type pandas DataFrame.
AssertionError – If l_block_attr is not of type string.
AssertionError – If r_block_attr is not of type string.
AssertionError – If l_output_attrs is not of type of list.
AssertionError – If r_output_attrs is not of type of list.
AssertionError – If the values in l_output_attrs is not of type string.
AssertionError – If the values in r_output_attrs is not of type string.
AssertionError – If l_output_prefix is not of type string.
AssertionError – If r_output_prefix is not of type string.
AssertionError – If verbose is not of type boolean.
AssertionError – If allow_missing is not of type boolean.
AssertionError – If n_jobs is not of type int.
AssertionError – If l_block_attr is not in the ltable columns.
AssertionError – If r_block_attr is not in the rtable columns.
AssertionError – If l_out_attrs are not in the ltable.
AssertionError – If r_out_attrs are not in the rtable.

Examples

>>> import py_entitymatching as em
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> ab = em.AttrEquivalenceBlocker()
>>> C1 = ab.block_tables(A, B, 'zipcode', 'zipcode', l_output_attrs=['name'], r_output_attrs=['name'])
# Include all possible tuple pairs with missing values
>>> C2 = ab.block_tables(A, B, 'zipcode', 'zipcode', l_output_attrs=['name'], r_output_attrs=['name'], allow_missing=True)

block_tuples(ltuple, rtuple, l_block_attr, r_block_attr, allow_missing=False)¶

Blocks a tuple pair based on attribute equivalence.

Parameters

ltuple (Series) – The input left tuple.
rtuple (Series) – The input right tuple.
l_block_attr (string) – The blocking attribute in left tuple.
r_block_attr (string) – The blocking attribute in right tuple.
allow_missing (boolean) – A flag to indicate whether a tuple pair with missing value in at least one of the blocking attributes should be blocked (defaults to False). If this flag is set to True, the pair will be kept if either ltuple has missing value in l_block_attr or rtuple has missing value in r_block_attr or both.

Returns

A status indicating if the tuple pair is blocked, i.e., the values of l_block_attr in ltuple and r_block_attr in rtuple are different (boolean).

Examples

>>> import py_entitymatching as em
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> ab = em.AttrEquivalenceBlocker()
>>> status = ab.block_tuples(A.ix[0], B.ix[0], 'zipcode', 'zipcode')

class py_entitymatching.OverlapBlocker¶

Blocks based on the overlap of token sets of attribute values.

block_candset(candset, l_overlap_attr, r_overlap_attr, rem_stop_words=False, q_val=None, word_level=True, overlap_size=1, allow_missing=False, verbose=False, show_progress=True, n_jobs=1)¶

Blocks an input candidate set of tuple pairs based on the overlap: of token sets of attribute values.

Finds tuple pairs from an input candidate set of tuple pairs such that the overlap between (a) the set of tokens obtained by tokenizing the value of attribute l_overlap_attr of the left tuple in a tuple pair, and (b) the set of tokens obtained by tokenizing the value of attribute r_overlap_attr of the right tuple in the tuple pair, is above a certain threshold.

Parameters

candset (DataFrame) – The input candidate set of tuple pairs.
l_overlap_attr (string) – The overlap attribute in left table.
r_overlap_attr (string) – The overlap attribute in right table.
rem_stop_words (boolean) – A flag to indicate whether stop words (e.g., a, an, the) should be removed from the token sets of the overlap attribute values (defaults to False).
q_val (int) – The value of q to use if the overlap attributes values are to be tokenized as qgrams (defaults to None).
word_level (boolean) – A flag to indicate whether the overlap attributes should be tokenized as words (i.e, using whitespace as delimiter) (defaults to True).
overlap_size (int) – The minimum number of tokens that must overlap (defaults to 1).
allow_missing (boolean) – A flag to indicate whether tuple pairs with missing value in at least one of the blocking attributes should be included in the output candidate set (defaults to False). If this flag is set to True, a tuple pair with missing value in either blocking attribute will be retained in the output candidate set.
verbose (boolean) –
A flag to indicate whether the debug information

should be logged (defaults to False).
show_progress (boolean) – A flag to indicate whether progress should be displayed to the user (defaults to True).
n_jobs (int) – The number of parallel jobs to be used for computation (defaults to 1). If -1 all CPUs are used. If 0 or 1, no parallel computation is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus are the total number of CPUs in the machine).Thus, for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) is less than 1, then no parallel computation is used (i.e., equivalent to the default).

Returns

A candidate set of tuple pairs that survived blocking (DataFrame).

Raises

AssertionError – If candset is not of type pandas DataFrame.
AssertionError – If l_overlap_attr is not of type string.
AssertionError – If r_overlap_attr is not of type string.
AssertionError – If q_val is not of type int.
AssertionError – If word_level is not of type boolean.
AssertionError – If overlap_size is not of type int.
AssertionError – If verbose is not of type boolean.
AssertionError – If allow_missing is not of type boolean.
AssertionError – If show_progress is not of type boolean.
AssertionError – If n_jobs is not of type int.
AssertionError – If l_overlap_attr is not in the ltable columns.
AssertionError – If r_block_attr is not in the rtable columns.
SyntaxError – If q_val is set to a valid value and word_level is set to True.
SyntaxError – If q_val is set to None and word_level is set to False.

Examples

>>> import py_entitymatching as em
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> ob = em.OverlapBlocker()
>>> C = ob.block_tables(A, B, 'address', 'address', l_output_attrs=['name'], r_output_attrs=['name'])

>>> D1 = ob.block_candset(C, 'name', 'name', allow_missing=True)
# Include all possible tuple pairs with missing values
>>> D2 = ob.block_candset(C, 'name', 'name', allow_missing=True)
# Execute blocking using multiple cores
>>> D3 = ob.block_candset(C, 'name', 'name', n_jobs=-1)
# Use q-gram tokenizer
>>> D2 = ob.block_candset(C, 'name', 'name', word_level=False, q_val=2)

block_tables(ltable, rtable, l_overlap_attr, r_overlap_attr, rem_stop_words=False, q_val=None, word_level=True, overlap_size=1, l_output_attrs=None, r_output_attrs=None, l_output_prefix='ltable_', r_output_prefix='rtable_', allow_missing=False, verbose=False, show_progress=True, n_jobs=1)¶

Blocks two tables based on the overlap of token sets of attribute: values.

Finds tuple pairs from left and right tables such that the overlap between (a) the set of tokens obtained by tokenizing the value of attribute l_overlap_attr of a tuple from the left table, and (b) the set of tokens obtained by tokenizing the value of attribute r_overlap_attr of a tuple from the right table, is above a certain threshold.

Parameters

ltable (DataFrame) – The left input table.
rtable (DataFrame) – The right input table.
l_overlap_attr (string) – The overlap attribute in left table.
r_overlap_attr (string) – The overlap attribute in right table.
rem_stop_words (boolean) – A flag to indicate whether stop words (e.g., a, an, the) should be removed from the token sets of the overlap attribute values (defaults to False).
q_val (int) – The value of q to use if the overlap attributes values are to be tokenized as qgrams (defaults to None).
word_level (boolean) – A flag to indicate whether the overlap attributes should be tokenized as words (i.e, using whitespace as delimiter) (defaults to True).
overlap_size (int) – The minimum number of tokens that must overlap (defaults to 1).
l_output_attrs (list) – A list of attribute names from the left table to be included in the output candidate set (defaults to None).
r_output_attrs (list) – A list of attribute names from the right table to be included in the output candidate set (defaults to None).
l_output_prefix (string) – The prefix to be used for the attribute names coming from the left table in the output candidate set (defaults to ‘ltable_’).
r_output_prefix (string) – The prefix to be used for the attribute names coming from the right table in the output candidate set (defaults to ‘rtable_’).
allow_missing (boolean) – A flag to indicate whether tuple pairs with missing value in at least one of the blocking attributes should be included in the output candidate set (defaults to False). If this flag is set to True, a tuple in ltable with missing value in the blocking attribute will be matched with every tuple in rtable and vice versa.
verbose (boolean) – A flag to indicate whether the debug information should be logged (defaults to False).
show_progress (boolean) – A flag to indicate whether progress should be displayed to the user (defaults to True).
n_jobs (int) – The number of parallel jobs to be used for computation (defaults to 1). If -1 all CPUs are used. If 0 or 1, no parallel computation is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus, for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) is less than 1, then no parallel computation is used (i.e., equivalent to the default).

Returns

A candidate set of tuple pairs that survived blocking (DataFrame).

Raises

AssertionError – If ltable is not of type pandas DataFrame.
AssertionError – If rtable is not of type pandas DataFrame.
AssertionError – If l_overlap_attr is not of type string.
AssertionError – If r_overlap_attr is not of type string.
AssertionError – If l_output_attrs is not of type of list.
AssertionError – If r_output_attrs is not of type of list.
AssertionError – If the values in l_output_attrs is not of type string.
AssertionError – If the values in r_output_attrs is not of type string.
AssertionError – If l_output_prefix is not of type string.
AssertionError – If r_output_prefix is not of type string.
AssertionError – If q_val is not of type int.
AssertionError – If word_level is not of type boolean.
AssertionError – If overlap_size is not of type int.
AssertionError – If verbose is not of type boolean.
AssertionError – If allow_missing is not of type boolean.
AssertionError – If show_progress is not of type boolean.
AssertionError – If n_jobs is not of type int.
AssertionError – If l_overlap_attr is not in the ltable columns.
AssertionError – If r_block_attr is not in the rtable columns.
AssertionError – If l_output_attrs are not in the ltable.
AssertionError – If r_output_attrs are not in the rtable.
SyntaxError – If q_val is set to a valid value and word_level is set to True.
SyntaxError – If q_val is set to None and word_level is set to False.

Examples

>>> import py_entitymatching as em
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> ob = em.OverlapBlocker()
# Use word-level tokenizer
>>> C1 = ob.block_tables(A, B, 'address', 'address', l_output_attrs=['name'], r_output_attrs=['name'], word_level=True, overlap_size=1)
# Use q-gram tokenizer
>>> C2 = ob.block_tables(A, B, 'address', 'address', l_output_attrs=['name'], r_output_attrs=['name'], word_level=False, q_val=2)
# Include all possible missing values
>>> C3 = ob.block_tables(A, B, 'address', 'address', l_output_attrs=['name'], r_output_attrs=['name'], allow_missing=True)
# Use all the cores in the machine
>>> C3 = ob.block_tables(A, B, 'address', 'address', l_output_attrs=['name'], r_output_attrs=['name'], n_jobs=-1)

block_tuples(ltuple, rtuple, l_overlap_attr, r_overlap_attr, rem_stop_words=False, q_val=None, word_level=True, overlap_size=1, allow_missing=False)¶

Blocks a tuple pair based on the overlap of token sets of attribute: values.

Parameters

ltuple (Series) – The input left tuple.
rtuple (Series) – The input right tuple.
l_overlap_attr (string) – The overlap attribute in left tuple.
r_overlap_attr (string) – The overlap attribute in right tuple.
rem_stop_words (boolean) – A flag to indicate whether stop words (e.g., a, an, the) should be removed from the token sets of the overlap attribute values (defaults to False).
q_val (int) – A value of q to use if the overlap attributes values are to be tokenized as qgrams (defaults to None).
word_level (boolean) – A flag to indicate whether the overlap attributes should be tokenized as words (i.e, using whitespace as delimiter) (defaults to True).
overlap_size (int) – The minimum number of tokens that must overlap (defaults to 1).
allow_missing (boolean) – A flag to indicate whether a tuple pair with missing value in at least one of the blocking attributes should be blocked (defaults to False). If this flag is set to True, the pair will be kept if either ltuple has missing value in l_block_attr or rtuple has missing value in r_block_attr or both.

Returns

A status indicating if the tuple pair is blocked (boolean).

Examples

>>> import py_entitymatching as em
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> ob = em.OverlapBlocker()
>>> status = ob.block_tuples(A.ix[0], B.ix[0], 'address', 'address')

class py_entitymatching.RuleBasedBlocker(*args, **kwargs)¶

Blocks based on a sequence of blocking rules supplied by the user.

add_rule(conjunct_list, feature_table=None, rule_name=None)¶

Adds a rule to the rule-based blocker.

Parameters

conjunct_list (list) – A list of conjuncts specifying the rule.
feature_table (DataFrame) – A DataFrame containing all the features that are being referenced by the rule (defaults to None). If the feature_table is not supplied here, then it must have been specified during the creation of the rule-based blocker or using set_feature_table function. Otherwise an AssertionError will be raised and the rule will not be added to the rule-based blocker.
rule_name (string) – A string specifying the name of the rule to be added (defaults to None). If the rule_name is not specified then a name will be automatically chosen. If there is already a rule with the specified rule_name, then an AssertionError will be raised and the rule will not be added to the rule-based blocker.

Returns

The name of the rule added (string).

Raises

AssertionError – If rule_name already exists.
AssertionError – If feature_table is not a valid value parameter.

Examples

>>> import py_entitymatching as em
>>> rb = em.RuleBasedBlocker()
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='id')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='id')
>>> block_f = em.get_features_for_blocking(A, B)
>>> rule = ['name_name_lev(ltuple, rtuple) > 3']
>>> rb.add_rule(rule, rule_name='rule1')

block_candset(candset, verbose=False, show_progress=True, n_jobs=1)¶

Blocks an input candidate set of tuple pairs based on a sequence of blocking rules supplied by the user.

Finds tuple pairs from an input candidate set of tuple pairs that survive the sequence of blocking rules. A tuple pair survives the sequence of blocking rules if none of the rules in the sequence returns True for that pair. If any of the rules returns True, then the pair is blocked (dropped).

Parameters

candset (DataFrame) – The input candidate set of tuple pairs.
verbose (boolean) – A flag to indicate whether the debug information should be logged (defaults to False).
show_progress (boolean) – A flag to indicate whether progress should be displayed to the user (defaults to True).
n_jobs (int) – The number of parallel jobs to be used for computation (defaults to 1). If -1 all CPUs are used. If 0 or 1, no parallel computation is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus are the total number of CPUs in the machine).Thus, for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) is less than 1, then no parallel computation is used (i.e., equivalent to the default).

Returns

A candidate set of tuple pairs that survived blocking (DataFrame).

Raises

AssertionError – If candset is not of type pandas DataFrame.
AssertionError – If verbose is not of type boolean.
AssertionError – If n_jobs is not of type int.
AssertionError – If show_progress is not of type boolean.
AssertionError – If l_block_attr is not in the ltable columns.
AssertionError – If r_block_attr is not in the rtable columns.
AssertionError – If there are no rules to apply.

Examples

>>> import py_entitymatching as em
>>> rb = em.RuleBasedBlocker()
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='id')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='id')
>>> block_f = em.get_features_for_blocking(A, B)
>>> rule = ['name_name_lev(ltuple, rtuple) > 3']
>>> rb.add_rule(rule, feature_table=block_f)
>>> D = rb.block_tables(C) # C is the candidate set.

block_tables(ltable, rtable, l_output_attrs=None, r_output_attrs=None, l_output_prefix='ltable_', r_output_prefix='rtable_', verbose=False, show_progress=True, n_jobs=1)¶

Blocks two tables based on the sequence of rules supplied by the user.

Finds tuple pairs from left and right tables that survive the sequence of blocking rules. A tuple pair survives the sequence of blocking rules if none of the rules in the sequence returns True for that pair. If any of the rules returns True, then the pair is blocked.

Parameters

ltable (DataFrame) – The left input table.
rtable (DataFrame) – The right input table.
l_output_attrs (list) – A list of attribute names from the left table to be included in the output candidate set (defaults to None).
r_output_attrs (list) – A list of attribute names from the right table to be included in the output candidate set (defaults to None).
l_output_prefix (string) – The prefix to be used for the attribute names coming from the left table in the output candidate set (defaults to ‘ltable_’).
r_output_prefix (string) – The prefix to be used for the attribute names coming from the right table in the output candidate set (defaults to ‘rtable_’).
verbose (boolean) – A flag to indicate whether the debug information should be logged (defaults to False).
show_progress (boolean) – A flag to indicate whether progress should be displayed to the user (defaults to True).
n_jobs (int) – The number of parallel jobs to be used for computation (defaults to 1). If -1 all CPUs are used. If 0 or 1, no parallel computation is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine).Thus, for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) is less than 1, then no parallel computation is used (i.e., equivalent to the default).

Returns

A candidate set of tuple pairs that survived the sequence of blocking rules (DataFrame).

Raises

AssertionError – If ltable is not of type pandas DataFrame.
AssertionError – If rtable is not of type pandas DataFrame.
AssertionError – If l_output_attrs is not of type of list.
AssertionError – If r_output_attrs is not of type of list.
AssertionError – If the values in l_output_attrs is not of type string.
AssertionError – If the values in r_output_attrs is not of type string.
AssertionError – If the input l_output_prefix is not of type string.
AssertionError – If the input r_output_prefix is not of type string.
AssertionError – If verbose is not of type boolean.
AssertionError – If show_progress is not of type boolean.
AssertionError – If n_jobs is not of type int.
AssertionError – If l_out_attrs are not in the ltable.
AssertionError – If r_out_attrs are not in the rtable.
AssertionError – If there are no rules to apply.

Examples

>>> import py_entitymatching as em
>>> rb = em.RuleBasedBlocker()
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='id')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='id')
>>> block_f = em.get_features_for_blocking(A, B)
>>> rule = ['name_name_lev(ltuple, rtuple) > 3']
>>> rb.add_rule(rule, feature_table=block_f)
>>> C = rb.block_tables(A, B)

block_tuples(ltuple, rtuple)¶

Blocks a tuple pair based on a sequence of blocking rules supplied by the user.

Parameters

ltuple (Series) – The input left tuple.
rtuple (Series) – The input right tuple.

Returns

A status indicating if the tuple pair is blocked by applying the sequence of blocking rules (boolean).

Examples

>>> import py_entitymatching as em
>>> rb = em.RuleBasedBlocker()
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='id')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='id')
>>> block_f = em.get_features_for_blocking(A, B)
>>> rule = ['name_name_lev(ltuple, rtuple) > 3']
>>> rb.add_rule(rule, feature_table=block_f)
>>> D = rb.block_tuples(A.ix[0], B.ix[1)

delete_rule(rule_name)¶

Deletes a rule from the rule-based blocker.

Parameters: rule_name (string) – Name of the rule to be deleted.

Examples

>>> import py_entitymatching as em
>>> rb = em.RuleBasedBlocker()
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='id')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='id')
>>> block_f = em.get_features_for_blocking(A, B)
>>> rule = ['name_name_lev(ltuple, rtuple) > 3']
>>> rb.add_rule(rule, block_f, rule_name='rule_1')
>>> rb.delete_rule('rule_1')

get_rule(rule_name)¶

Returns the function corresponding to a rule.

Parameters: rule_name (string) – Name of the rule.
Returns: A function object corresponding to the specified rule.

Examples

>>> import py_entitymatching as em
>>> rb = em.RuleBasedBlocker()
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='id')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='id')
>>> block_f = em.get_features_for_blocking(A, B)
>>> rule = ['name_name_lev(ltuple, rtuple) > 3']
>>> rb.add_rule(rule, feature_table=block_f, rule_name='rule_1')
>>> rb.get_rule()

get_rule_names()¶

Returns the names of all the rules in the rule-based blocker.

Returns: A list of names of all the rules in the rule-based blocker (list).

Examples

>>> import py_entitymatching as em
>>> rb = em.RuleBasedBlocker()
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='id')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='id')
>>> block_f = em.get_features_for_blocking(A, B)
>>> rule = ['name_name_lev(ltuple, rtuple) > 3']
>>> rb.add_rule(rule, block_f, rule_name='rule_1')
>>> rb.get_rule_names()

set_feature_table(feature_table)¶

Sets feature table for the rule-based blocker.

Parameters: feature_table (DataFrame) – A DataFrame containing features.

Examples

>>> import py_entitymatching as em
>>> rb = em.RuleBasedBlocker()
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='id')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='id')
>>> block_f = em.get_features_for_blocking(A, B)
>>> rb.set_feature_table(block_f)

view_rule(rule_name)¶

Prints the source code of the function corresponding to a rule.

Parameters: rule_name (string) – Name of the rule to be viewed.

Examples

>>> import py_entitymatching as em
>>> rb = em.RuleBasedBlocker()
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='id')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='id')
>>> block_f = em.get_features_for_blocking(A, B)
>>> rule = ['name_name_lev(ltuple, rtuple) > 3']
>>> rb.add_rule(rule, block_f, rule_name='rule_1')
>>> rb.view_rule('rule_1')

class py_entitymatching.BlackBoxBlocker(*args, **kwargs)¶

Blocks based on a black box function specified by the user.

block_candset(candset, verbose=True, show_progress=True, n_jobs=1)¶

Blocks an input candidate set of tuple pairs based on a black box blocking function specified by the user.

Finds tuple pairs from an input candidate set of tuple pairs that survive the black box function. A tuple pair survives the black box blocking function if the function returns False for that pair, otherwise the tuple pair is dropped.

Parameters

candset (DataFrame) – The input candidate set of tuple pairs.
verbose (boolean) – A flag to indicate whether logging should be done (defaults to False).
show_progress (boolean) – A flag to indicate whether progress should be displayed to the user (defaults to True).
n_jobs (int) – The number of parallel jobs to be used for computation (defaults to 1). If -1 all CPUs are used. If 0 or 1, no parallel computation is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine).Thus, for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) is less than 1, then no parallel computation is used (i.e., equivalent to the default).

Returns

A candidate set of tuple pairs that survived blocking (DataFrame).

Raises

AssertionError – If candset is not of type pandas DataFrame.
AssertionError – If verbose is not of type boolean.
AssertionError – If n_jobs is not of type int.
AssertionError – If show_progress is not of type boolean.
AssertionError – If l_block_attr is not in the ltable columns.
AssertionError – If r_block_attr is not in the rtable columns.

Examples

>>> def match_last_name(ltuple, rtuple):
    # assume that there is a 'name' attribute in the input tables
    # and each value in it has two words
    l_last_name = ltuple['name'].split()[1]
    r_last_name = rtuple['name'].split()[1]
    if l_last_name != r_last_name:
        return True
    else:
        return False
>>> import py_entitymatching as em
>>> bb = em.BlackBoxBlocker()
>>> bb.set_black_box_function(match_last_name)
>>> D = bb.block_candset(C) # C is an output from block_tables

block_tables(ltable, rtable, l_output_attrs=None, r_output_attrs=None, l_output_prefix='ltable_', r_output_prefix='rtable_', verbose=False, show_progress=True, n_jobs=1)¶

Blocks two tables based on a black box blocking function specified by the user.

Finds tuple pairs from left and right tables that survive the black box function. A tuple pair survives the black box blocking function if the function returns False for that pair, otherwise the tuple pair is dropped.

Parameters

ltable (DataFrame) – The left input table.
rtable (DataFrame) – The right input table.
l_output_attrs (list) – A list of attribute names from the left table to be included in the output candidate set (defaults to None).
r_output_attrs (list) – A list of attribute names from the right table to be included in the output candidate set (defaults to None).
l_output_prefix (string) – The prefix to be used for the attribute names coming from the left table in the output candidate set (defaults to ‘ltable_’).
r_output_prefix (string) – The prefix to be used for the attribute names coming from the right table in the output candidate set (defaults to ‘rtable_’).
verbose (boolean) – A flag to indicate whether the debug information should be logged (defaults to False).
show_progress (boolean) – A flag to indicate whether progress should be displayed to the user (defaults to True).
n_jobs (int) – The number of parallel jobs to be used for computation (defaults to 1). If -1 all CPUs are used. If 0 or 1, no parallel computation is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus are the total number of CPUs in the machine).Thus, for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) is less than 1, then no parallel computation is used (i.e., equivalent to the default).

Returns

A candidate set of tuple pairs that survived blocking (DataFrame).

Raises

AssertionError – If ltable is not of type pandas DataFrame.
AssertionError – If rtable is not of type pandas DataFrame.
AssertionError – If l_output_attrs is not of type of list.
AssertionError – If r_output_attrs is not of type of list.
AssertionError – If values in l_output_attrs is not of type string.
AssertionError – If values in r_output_attrs is not of type string.
AssertionError – If l_output_prefix is not of type string.
AssertionError – If r_output_prefix is not of type string.
AssertionError – If verbose is not of type boolean.
AssertionError – If show_progress is not of type boolean.
AssertionError – If n_jobs is not of type int.
AssertionError – If l_out_attrs are not in the ltable.
AssertionError – If r_out_attrs are not in the rtable.

Examples

>>> def match_last_name(ltuple, rtuple):
    # assume that there is a 'name' attribute in the input tables
    # and each value in it has two words
    l_last_name = ltuple['name'].split()[1]
    r_last_name = rtuple['name'].split()[1]
    if l_last_name != r_last_name:
        return True
    else:
        return False
>>> import py_entitymatching as em
>>> bb = em.BlackBoxBlocker()
>>> bb.set_black_box_function(match_last_name)

>>> C = bb.block_tables(A, B, l_output_attrs=['name'], r_output_attrs=['name'] )

block_tuples(ltuple, rtuple)¶

Blocks a tuple pair based on a black box blocking function specified by the user.

Takes a tuple pair as input, applies the black box blocking function to it, and returns True (if the intention is to drop the pair) or False (if the intention is to keep the tuple pair).

Parameters

ltuple (Series) – input left tuple.
rtuple (Series) – input right tuple.

Returns

A status indicating if the tuple pair should be dropped or kept, based on the black box blocking function (boolean).

Examples

>>> def match_last_name(ltuple, rtuple):
    # assume that there is a 'name' attribute in the input tables
    # and each value in it has two words
    l_last_name = ltuple['name'].split()[1]
    r_last_name = rtuple['name'].split()[1]
    if l_last_name != r_last_name:
        return True
    else:
        return False
>>> import py_entitymatching as em
>>> bb = em.BlackBoxBlocker()
>>> bb.set_black_box_function(match_last_name)
>>> status = bb.block_tuples(A.ix[0], B.ix[0]) # A, B are input tables.

set_black_box_function(function)¶

Sets black box function to be used for blocking.

Parameters: function (function) – the black box function to be used for blocking .

class py_entitymatching.SortedNeighborhoodBlocker¶

WARNING: THIS IS AN EXPERIMENTAL CLASS. THIS CLASS IS NOT TESTED. USE AT YOUR OWN RISK.

Blocks based on the sorted neighborhood blocking method

static block_candset(*args, **kwargs)¶: block_candset does not apply to sn_blocker, return unimplemented

block_tables(ltable, rtable, l_block_attr, r_block_attr, window_size=2, l_output_attrs=None, r_output_attrs=None, l_output_prefix='ltable_', r_output_prefix='rtable_', allow_missing=False, verbose=False, n_jobs=1)¶

WARNING: THIS IS AN EXPERIMENTAL COMMAND. THIS COMMAND IS NOT TESTED. USE AT YOUR OWN RISK.

Blocks two tables based on sorted neighborhood.

Finds tuple pairs from left and right tables such that when each table is sorted based upon a blocking attribute, tuple pairs are within a distance w of each other. The blocking attribute is created prior to calling this function.

Parameters

ltable (DataFrame) – The left input table.
rtable (DataFrame) – The right input table.
l_block_attr (string) – The blocking attribute for left table.
r_block_attr (string) – The blocking attribute for right table.
window_size (int) – size of sliding window. Defaults to 2
l_output_attrs (list) – A list of attribute names from the left table to be included in the output candidate set (defaults to None).
r_output_attrs (list) – A list of attribute names from the right table to be included in the output candidate set (defaults to None).
l_output_prefix (string) – The prefix to be used for the attribute names coming from the left table in the output candidate set (defaults to ‘ltable_’).
r_output_prefix (string) – The prefix to be used for the attribute names coming from the right table in the output candidate set (defaults to ‘rtable_’).
allow_missing (boolean) – A flag to indicate whether tuple pairs with missing value in at least one of the blocking attributes should be included in the output candidate set (defaults to False). If this flag is set to True, a tuple in ltable with missing value in the blocking attribute will be matched with every tuple in rtable and vice versa.
verbose (boolean) – A flag to indicate whether the debug information should be logged (defaults to False).
n_jobs (int) – The number of parallel jobs to be used for computation (defaults to 1). If -1 all CPUs are used. If 0 or 1, no parallel computation is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus, for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) is less than 1, then no parallel computation is used (i.e., equivalent to the default).

Returns

A candidate set of tuple pairs that survived blocking (DataFrame).

Raises

AssertionError – If ltable is not of type pandas DataFrame.
AssertionError – If rtable is not of type pandas DataFrame.
AssertionError – If l_block_attr is not of type string.
AssertionError – If r_block_attr is not of type string.
AssertionError – If window_size is not of type of int or if window_size < 2.
AssertionError – If the values in l_output_attrs is not of type string.
AssertionError – If the values in r_output_attrs is not of type string.
AssertionError – If l_output_prefix is not of type string.
AssertionError – If r_output_prefix is not of type string.
AssertionError – If verbose is not of type boolean.
AssertionError – If allow_missing is not of type boolean.
AssertionError – If n_jobs is not of type int.
AssertionError – If l_block_attr is not in the ltable columns.
AssertionError – If r_block_attr is not in the rtable columns.
AssertionError – If l_out_attrs are not in the ltable.
AssertionError – If r_out_attrs are not in the rtable.

static block_tuples(*args, **kwargs)¶: block_tuples does not apply to sn_blocker, return unimplemented

static validate_block_attrs(ltable, rtable, l_block_attr, r_block_attr)¶: validate the blocking attributes

static validate_types_block_attrs(l_block_attr, r_block_attr)¶: validate the data types of the blocking attributes

Table Of Contents

Search

Blocking¶