Commands Implemented Using Dask¶

Downsampling¶

py_entitymatching.dask.dask_down_sample.dask_down_sample(ltable, rtable, size, y_param, show_progress=True, verbose=False, seed=None, rem_stop_words=True, rem_puncs=True, n_ltable_chunks=1, n_sample_rtable_chunks=1)¶

WARNING THIS COMMAND IS EXPERIMENTAL AND NOT TESTED. USE AT YOUR OWN RISK.

This command down samples two tables A and B into smaller tables A’ and B’ respectively. Specifically, first it randomly selects size tuples from the table B to be table B’. Next, it builds an inverted index I (token, tuple_id) on table A. For each tuple x ∈ B’, the algorithm finds a set P of k/2 tuples from I that match x, and a set Q of k/2 tuples randomly selected from A - P. The idea is for A’ and B’ to share some matches yet be as representative of A and B as possible.

Parameters

ltable (DataFrame) – The left input table, i.e., table A.
rtable (DataFrame) – The right input table, i.e., table B.
size (int) – The size that table B should be down sampled to.
y_param (int) – The parameter to control the down sample size of table A. Specifically, the down sampled size of table A should be close to size * y_param.
show_progress (boolean) – A flag to indicate whether a progress bar should be displayed (defaults to True).
verbose (boolean) – A flag to indicate whether the debug information should be displayed (defaults to False).
seed (int) – The seed for the pseudo random number generator to select the tuples from A and B (defaults to None).
rem_stop_words (boolean) – A flag to indicate whether a default set of stop words must be removed.
rem_puncs (boolean) – A flag to indicate whether the punctuations must be removed from the strings.
n_ltable_chunks (int) – The number of partitions for ltable (defaults to 1). If it is set to -1, the number of partitions will be set to the number of cores in the machine.
n_sample_rtable_chunks (int) – The number of partitions for the sampled rtable (defaults to 1)

Returns

Down sampled tables A and B as pandas DataFrames.

Raises

AssertionError – If any of the input tables (table_a, table_b) are empty or not a DataFrame.
AssertionError – If size or y_param is empty or 0 or not a valid integer value.
AssertionError – If seed is not a valid integer value.
AssertionError – If verbose is not of type bool.
AssertionError – If show_progress is not of type bool.
AssertionError – If n_ltable_chunks is not of type int.
AssertionError – If n_sample_rtable_chunks is not of type int.

Examples

>>> from py_entitymatching.dask.dask_down_sample import dask_down_sample
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> sample_A, sample_B = dask_down_sample(A, B, 500, 1, n_ltable_chunks=-1, n_sample_rtable_chunks=-1)
# Example with seed = 0. This means the same sample data set will be returned
# each time this function is run.
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> sample_A, sample_B = dask_down_sample(A, B, 500, 1, seed=0, n_ltable_chunks=-1, n_sample_rtable_chunks=-1)

Blocking¶

class py_entitymatching.dask.dask_attr_equiv_blocker.DaskAttrEquivalenceBlocker(*args, **kwargs)¶

WARNING THIS BLOCKER IS EXPERIMENTAL AND NOT TESTED. USE AT YOUR OWN RISK.

Blocks based on the equivalence of attribute values.

block_candset(candset, l_block_attr, r_block_attr, allow_missing=False, verbose=False, show_progress=True, n_chunks=1)¶

WARNING THIS COMMAND IS EXPERIMENTAL AND NOT TESTED. USE AT YOUR OWN RISK.

Blocks an input candidate set of tuple pairs based on attribute equivalence. Finds tuple pairs from an input candidate set of tuple pairs such that the value of attribute l_block_attr of the left tuple in a tuple pair exactly matches the value of attribute r_block_attr of the right tuple in the tuple pair.

Parameters

candset (DataFrame) – The input candidate set of tuple pairs.
l_block_attr (string) – The blocking attribute in left table.
r_block_attr (string) – The blocking attribute in right table.
allow_missing (boolean) – A flag to indicate whether tuple pairs with missing value in at least one of the blocking attributes should be included in the output candidate set (defaults to False). If this flag is set to True, a tuple pair with missing value in either blocking attribute will be retained in the output candidate set.
verbose (boolean) – A flag to indicate whether the debug information should be logged (defaults to False).
show_progress (boolean) – A flag to indicate whether progress should be displayed to the user (defaults to True).
n_chunks (int) – The number of partitions to split the candidate set. If it is set to -1, the number of partitions will be set to the number of cores in the machine.

Returns

A candidate set of tuple pairs that survived blocking (DataFrame).

Raises

AssertionError – If candset is not of type pandas DataFrame.
AssertionError – If l_block_attr is not of type string.
AssertionError – If r_block_attr is not of type string.
AssertionError – If verbose is not of type boolean.
AssertionError – If n_chunks is not of type int.
AssertionError – If l_block_attr is not in the ltable columns.
AssertionError – If r_block_attr is not in the rtable columns.

Examples

>>> import py_entitymatching as em
>>> from py_entitymatching.dask.dask_attr_equiv_blocker import DaskAttrEquivalenceBlocker
>>> ab = DaskAttrEquivalenceBlocker()
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> C = ab.block_tables(A, B, 'zipcode', 'zipcode', l_output_attrs=['name'], r_output_attrs=['name'])
>>> D1 = ab.block_candset(C, 'age', 'age', allow_missing=True)
# Include all possible tuple pairs with missing values
>>> D2 = ab.block_candset(C, 'age', 'age', allow_missing=True)
# Execute blocking using multiple cores
>>> D3 = ab.block_candset(C, 'age', 'age', n_chunks=-1)

block_tables(ltable, rtable, l_block_attr, r_block_attr, l_output_attrs=None, r_output_attrs=None, l_output_prefix='ltable_', r_output_prefix='rtable_', allow_missing=False, verbose=False, n_ltable_chunks=1, n_rtable_chunks=1)¶

WARNING THIS COMMAND IS EXPERIMENTAL AND NOT TESTED. USE AT YOUR OWN RISK

Blocks two tables based on attribute equivalence. Conceptually, this will check l_block_attr=r_block_attr for each tuple pair from the Cartesian product of tables ltable and rtable. It outputs a Pandas dataframe object with tuple pairs that satisfy the equality condition. The dataframe will include attributes ‘_id’, key attribute from ltable, key attributes from rtable, followed by lists l_output_attrs and r_output_attrs if they are specified. Each of these output and key attributes will be prefixed with given l_output_prefix and r_output_prefix. If allow_missing is set to True then all tuple pairs with missing value in at least one of the tuples will be included in the output dataframe. Further, this will update the following metadata in the catalog for the output table: (1) key, (2) ltable, (3) rtable, (4) fk_ltable, and (5) fk_rtable.

Parameters

ltable (DataFrame) – The left input table.
rtable (DataFrame) – The right input table.
l_block_attr (string) – The blocking attribute in left table.
r_block_attr (string) – The blocking attribute in right table.
l_output_attrs (list) – A list of attribute names from the left table to be included in the output candidate set (defaults to None).
r_output_attrs (list) – A list of attribute names from the right table to be included in the output candidate set (defaults to None).
l_output_prefix (string) – The prefix to be used for the attribute names coming from the left table in the output candidate set (defaults to ‘ltable_’).
r_output_prefix (string) – The prefix to be used for the attribute names coming from the right table in the output candidate set (defaults to ‘rtable_’).
allow_missing (boolean) – A flag to indicate whether tuple pairs with missing value in at least one of the blocking attributes should be included in the output candidate set (defaults to False). If this flag is set to True, a tuple in ltable with missing value in the blocking attribute will be matched with every tuple in rtable and vice versa.
verbose (boolean) – A flag to indicate whether the debug information should be logged (defaults to False).
n_ltable_chunks (int) – The number of partitions to split the left table ( defaults to 1). If it is set to -1, then the number of partitions is set to the number of cores in the machine.
n_rtable_chunks (int) – The number of partitions to split the right table ( defaults to 1). If it is set to -1, then the number of partitions is set to the number of cores in the machine.

Returns

A candidate set of tuple pairs that survived blocking (DataFrame).

Raises

AssertionError – If ltable is not of type pandas DataFrame.
AssertionError – If rtable is not of type pandas DataFrame.
AssertionError – If l_block_attr is not of type string.
AssertionError – If r_block_attr is not of type string.
AssertionError – If l_output_attrs is not of type of list.
AssertionError – If r_output_attrs is not of type of list.
AssertionError – If the values in l_output_attrs is not of type string.
AssertionError – If the values in r_output_attrs is not of type string.
AssertionError – If l_output_prefix is not of type string.
AssertionError – If r_output_prefix is not of type string.
AssertionError – If verbose is not of type boolean.
AssertionError – If allow_missing is not of type boolean.
AssertionError – If n_ltable_chunks is not of type int.
AssertionError – If n_rtable_chunks is not of type int.
AssertionError – If l_block_attr is not in the ltable columns.
AssertionError – If r_block_attr is not in the rtable columns.
AssertionError – If l_out_attrs are not in the ltable.
AssertionError – If r_out_attrs are not in the rtable.

Examples

>>> import py_entitymatching as em
>>> from py_entitymatching.dask.dask_attr_equiv_blocker import DaskAttrEquivalenceBlocker
>>> ab = DaskAttrEquivalenceBlocker()
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> C1 = ab.block_tables(A, B, 'zipcode', 'zipcode', l_output_attrs=['name'], r_output_attrs=['name'])
# Include all possible tuple pairs with missing values
>>> C2 = ab.block_tables(A, B, 'zipcode', 'zipcode', l_output_attrs=['name'], r_output_attrs=['name'], allow_missing=True)

block_tuples(ltuple, rtuple, l_block_attr, r_block_attr, allow_missing=False)¶

Blocks a tuple pair based on attribute equivalence.

Parameters

ltuple (Series) – The input left tuple.
rtuple (Series) – The input right tuple.
l_block_attr (string) – The blocking attribute in left tuple.
r_block_attr (string) – The blocking attribute in right tuple.
allow_missing (boolean) – A flag to indicate whether a tuple pair with missing value in at least one of the blocking attributes should be blocked (defaults to False). If this flag is set to True, the pair will be kept if either ltuple has missing value in l_block_attr or rtuple has missing value in r_block_attr or both.

Returns

A status indicating if the tuple pair is blocked, i.e., the values of l_block_attr in ltuple and r_block_attr in rtuple are different (boolean).

Examples

>>> import py_entitymatching as em
>>> from py_entitymatching.dask.dask_attr_equiv_blocker import DaskAttrEquivalenceBlocker
>>> ab = DaskAttrEquivalenceBlocker()
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> status = ab.block_tuples(A.ix[0], B.ix[0], 'zipcode', 'zipcode')

class py_entitymatching.dask.dask_overlap_blocker.DaskOverlapBlocker¶

block_candset(candset, l_overlap_attr, r_overlap_attr, rem_stop_words=False, q_val=None, word_level=True, overlap_size=1, allow_missing=False, verbose=False, show_progress=True, n_chunks=-1)¶

WARNING THIS COMMAND IS EXPERIMENTAL AND NOT TESTED. USE AT YOUR OWN RISK.

Blocks an input candidate set of tuple pairs based on the overlap of token sets of attribute values. Finds tuple pairs from an input candidate set of tuple pairs such that the overlap between (a) the set of tokens obtained by tokenizing the value of attribute l_overlap_attr of the left tuple in a tuple pair, and (b) the set of tokens obtained by tokenizing the value of attribute r_overlap_attr of the right tuple in the tuple pair, is above a certain threshold.

Parameters

candset (DataFrame) – The input candidate set of tuple pairs.
l_overlap_attr (string) – The overlap attribute in left table.
r_overlap_attr (string) – The overlap attribute in right table.
rem_stop_words (boolean) – A flag to indicate whether stop words (e.g., a, an, the) should be removed from the token sets of the overlap attribute values (defaults to False).
q_val (int) – The value of q to use if the overlap attributes values are to be tokenized as qgrams (defaults to None).
word_level (boolean) – A flag to indicate whether the overlap attributes should be tokenized as words (i.e, using whitespace as delimiter) (defaults to True).
overlap_size (int) – The minimum number of tokens that must overlap (defaults to 1).
allow_missing (boolean) – A flag to indicate whether tuple pairs with missing value in at least one of the blocking attributes should be included in the output candidate set (defaults to False). If this flag is set to True, a tuple pair with missing value in either blocking attribute will be retained in the output candidate set.
verbose (boolean) –
A flag to indicate whether the debug information

should be logged (defaults to False).
show_progress (boolean) – A flag to indicate whether progress should be displayed to the user (defaults to True).
n_chunks (int) – The number of partitions to split the candidate set. If it is set to -1, the number of partitions will be set to the number of cores in the machine.

Returns

A candidate set of tuple pairs that survived blocking (DataFrame).

Raises

AssertionError – If candset is not of type pandas DataFrame.
AssertionError – If l_overlap_attr is not of type string.
AssertionError – If r_overlap_attr is not of type string.
AssertionError – If q_val is not of type int.
AssertionError – If word_level is not of type boolean.
AssertionError – If overlap_size is not of type int.
AssertionError – If verbose is not of type boolean.
AssertionError – If allow_missing is not of type boolean.
AssertionError – If show_progress is not of type boolean.
AssertionError – If n_chunks is not of type int.
AssertionError – If l_overlap_attr is not in the ltable columns.
AssertionError – If r_block_attr is not in the rtable columns.
SyntaxError – If q_val is set to a valid value and word_level is set to True.
SyntaxError – If q_val is set to None and word_level is set to False.

Examples

>>> import py_entitymatching as em
>>> from py_entitymatching.dask.dask_overlap_blocker import DaskOverlapBlocker
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> ob = DaskOverlapBlocker()
>>> C = ob.block_tables(A, B, 'address', 'address', l_output_attrs=['name'], r_output_attrs=['name'])

>>> D1 = ob.block_candset(C, 'name', 'name', allow_missing=True)
# Include all possible tuple pairs with missing values
>>> D2 = ob.block_candset(C, 'name', 'name', allow_missing=True)
# Execute blocking using multiple cores
>>> D3 = ob.block_candset(C, 'name', 'name', n_chunks=-1)
# Use q-gram tokenizer
>>> D2 = ob.block_candset(C, 'name', 'name', word_level=False, q_val=2)

block_tables(ltable, rtable, l_overlap_attr, r_overlap_attr, rem_stop_words=False, q_val=None, word_level=True, overlap_size=1, l_output_attrs=None, r_output_attrs=None, l_output_prefix='ltable_', r_output_prefix='rtable_', allow_missing=False, verbose=False, show_progress=True, n_ltable_chunks=1, n_rtable_chunks=1)¶

WARNING THIS COMMAND IS EXPERIMENTAL AND NOT TESTED. USE AT YOUR OWN RISK.

Blocks two tables based on the overlap of token sets of attribute values. Finds tuple pairs from left and right tables such that the overlap between (a) the set of tokens obtained by tokenizing the value of attribute l_overlap_attr of a tuple from the left table, and (b) the set of tokens obtained by tokenizing the value of attribute r_overlap_attr of a tuple from the right table, is above a certain threshold.

Parameters

ltable (DataFrame) – The left input table.
rtable (DataFrame) – The right input table.
l_overlap_attr (string) – The overlap attribute in left table.
r_overlap_attr (string) – The overlap attribute in right table.
rem_stop_words (boolean) – A flag to indicate whether stop words (e.g., a, an, the) should be removed from the token sets of the overlap attribute values (defaults to False).
q_val (int) – The value of q to use if the overlap attributes values are to be tokenized as qgrams (defaults to None).
word_level (boolean) – A flag to indicate whether the overlap attributes should be tokenized as words (i.e, using whitespace as delimiter) (defaults to True).
overlap_size (int) – The minimum number of tokens that must overlap (defaults to 1).
l_output_attrs (list) – A list of attribute names from the left table to be included in the output candidate set (defaults to None).
r_output_attrs (list) – A list of attribute names from the right table to be included in the output candidate set (defaults to None).
l_output_prefix (string) – The prefix to be used for the attribute names coming from the left table in the output candidate set (defaults to ‘ltable_’).
r_output_prefix (string) – The prefix to be used for the attribute names coming from the right table in the output candidate set (defaults to ‘rtable_’).
allow_missing (boolean) – A flag to indicate whether tuple pairs with missing value in at least one of the blocking attributes should be included in the output candidate set (defaults to False). If this flag is set to True, a tuple in ltable with missing value in the blocking attribute will be matched with every tuple in rtable and vice versa.
verbose (boolean) – A flag to indicate whether the debug information should be logged (defaults to False).
show_progress (boolean) – A flag to indicate whether progress should be displayed to the user (defaults to True).
n_ltable_chunks (int) – The number of partitions to split the left table ( defaults to 1). If it is set to -1, then the number of partitions is set to the number of cores in the machine.
n_rtable_chunks (int) – The number of partitions to split the right table ( defaults to 1). If it is set to -1, then the number of partitions is set to the number of cores in the machine.

Returns

A candidate set of tuple pairs that survived blocking (DataFrame).

Raises

AssertionError – If ltable is not of type pandas DataFrame.
AssertionError – If rtable is not of type pandas DataFrame.
AssertionError – If l_overlap_attr is not of type string.
AssertionError – If r_overlap_attr is not of type string.
AssertionError – If l_output_attrs is not of type of list.
AssertionError – If r_output_attrs is not of type of list.
AssertionError – If the values in l_output_attrs is not of type string.
AssertionError – If the values in r_output_attrs is not of type string.
AssertionError – If l_output_prefix is not of type string.
AssertionError – If r_output_prefix is not of type string.
AssertionError – If q_val is not of type int.
AssertionError – If word_level is not of type boolean.
AssertionError – If overlap_size is not of type int.
AssertionError – If verbose is not of type boolean.
AssertionError – If allow_missing is not of type boolean.
AssertionError – If show_progress is not of type boolean.
AssertionError – If n_ltable_chunks is not of type int.
AssertionError – If n_rtable_chunks is not of type int.
AssertionError – If l_overlap_attr is not in the ltable columns.
AssertionError – If r_block_attr is not in the rtable columns.
AssertionError – If l_output_attrs are not in the ltable.
AssertionError – If r_output_attrs are not in the rtable.
SyntaxError – If q_val is set to a valid value and word_level is set to True.
SyntaxError – If q_val is set to None and word_level is set to False.

Examples

>>> from py_entitymatching.dask.dask_overlap_blocker import DaskOverlapBlocker
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> ob = DaskOverlapBlocker()
# Use all cores
# # Use word-level tokenizer
>>> C1 = ob.block_tables(A, B, 'address', 'address', l_output_attrs=['name'], r_output_attrs=['name'], word_level=True, overlap_size=1, n_ltable_chunks=-1, n_rtable_chunks=-1)
# # Use q-gram tokenizer
>>> C2 = ob.block_tables(A, B, 'address', 'address', l_output_attrs=['name'], r_output_attrs=['name'], word_level=False, q_val=2, n_ltable_chunks=-1, n_rtable_chunks=-1)
# # Include all possible missing values
>>> C3 = ob.block_tables(A, B, 'address', 'address', l_output_attrs=['name'], r_output_attrs=['name'], allow_missing=True, n_ltable_chunks=-1, n_rtable_chunks=-1)

block_tuples(ltuple, rtuple, l_overlap_attr, r_overlap_attr, rem_stop_words=False, q_val=None, word_level=True, overlap_size=1, allow_missing=False)¶

Blocks a tuple pair based on the overlap of token sets of attribute values.

Parameters

ltuple (Series) – The input left tuple.
rtuple (Series) – The input right tuple.
l_overlap_attr (string) – The overlap attribute in left tuple.
r_overlap_attr (string) – The overlap attribute in right tuple.
rem_stop_words (boolean) – A flag to indicate whether stop words (e.g., a, an, the) should be removed from the token sets of the overlap attribute values (defaults to False).
q_val (int) – A value of q to use if the overlap attributes values are to be tokenized as qgrams (defaults to None).
word_level (boolean) – A flag to indicate whether the overlap attributes should be tokenized as words (i.e, using whitespace as delimiter) (defaults to True).
overlap_size (int) – The minimum number of tokens that must overlap (defaults to 1).
allow_missing (boolean) – A flag to indicate whether a tuple pair with missing value in at least one of the blocking attributes should be blocked (defaults to False). If this flag is set to True, the pair will be kept if either ltuple has missing value in l_block_attr or rtuple has missing value in r_block_attr or both.

Returns

A status indicating if the tuple pair is blocked (boolean).

Examples

>>> import py_entitymatching as em
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> ob = em.OverlapBlocker()
>>> status = ob.block_tuples(A.ix[0], B.ix[0], 'address', 'address')

class py_entitymatching.dask.dask_rule_based_blocker.DaskRuleBasedBlocker(*args, **kwargs)¶

WARNING THIS BLOCKER IS EXPERIMENTAL AND NOT TESTED. USE AT YOUR OWN RISK.

Blocks based on a sequence of blocking rules supplied by the user.

add_rule(conjunct_list, feature_table=None, rule_name=None)¶

Adds a rule to the rule-based blocker.

Parameters

conjunct_list (list) – A list of conjuncts specifying the rule.
feature_table (DataFrame) – A DataFrame containing all the features that are being referenced by the rule (defaults to None). If the feature_table is not supplied here, then it must have been specified during the creation of the rule-based blocker or using set_feature_table function. Otherwise an AssertionError will be raised and the rule will not be added to the rule-based blocker.
rule_name (string) – A string specifying the name of the rule to be added (defaults to None). If the rule_name is not specified then a name will be automatically chosen. If there is already a rule with the specified rule_name, then an AssertionError will be raised and the rule will not be added to the rule-based blocker.

Returns

The name of the rule added (string).

Raises

AssertionError – If rule_name already exists.
AssertionError – If feature_table is not a valid value parameter.

Examples

>>> import py_entitymatching
>>> from py_entitymatching.dask.dask_rule_based_blocker import DaskRuleBasedBlocker
>>> rb = DaskRuleBasedBlocker()
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='id')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='id')
>>> block_f = em.get_features_for_blocking(A, B)
>>> rule = ['name_name_lev(ltuple, rtuple) > 3']
>>> rb.add_rule(rule, rule_name='rule1')

block_candset(candset, verbose=False, show_progress=True, n_chunks=1)¶

WARNING THIS COMMAND IS EXPERIMENTAL AND NOT TESTED. USE AT YOUR OWN RISK

Blocks an input candidate set of tuple pairs based on a sequence of blocking rules supplied by the user. Finds tuple pairs from an input candidate set of tuple pairs that survive the sequence of blocking rules. A tuple pair survives the sequence of blocking rules if none of the rules in the sequence returns True for that pair. If any of the rules returns True, then the pair is blocked (dropped).

Parameters

candset (DataFrame) – The input candidate set of tuple pairs.
verbose (boolean) – A flag to indicate whether the debug information should be logged (defaults to False).
show_progress (boolean) – A flag to indicate whether progress should be displayed to the user (defaults to True).
n_chunks (int) – The number of partitions to split the candidate set. If it is set to -1, the number of partitions will be set to the number of cores in the machine.

Returns

A candidate set of tuple pairs that survived blocking (DataFrame).

Raises

AssertionError – If candset is not of type pandas DataFrame.
AssertionError – If verbose is not of type boolean.
AssertionError – If n_chunks is not of type int.
AssertionError – If show_progress is not of type boolean.
AssertionError – If l_block_attr is not in the ltable columns.
AssertionError – If r_block_attr is not in the rtable columns.
AssertionError – If there are no rules to apply.

Examples

>>> import py_entitymatching as em
>>> from py_entitymatching.dask.dask_rule_based_blocker import DaskRuleBasedBlocker
>>> rb = DaskRuleBasedBlocker()
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='id')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='id')
>>> block_f = em.get_features_for_blocking(A, B)
>>> rule = ['name_name_lev(ltuple, rtuple) > 3']
>>> rb.add_rule(rule, feature_table=block_f)
>>> D = rb.block_tables(C) # C is the candidate set.

block_tables(ltable, rtable, l_output_attrs=None, r_output_attrs=None, l_output_prefix='ltable_', r_output_prefix='rtable_', verbose=False, show_progress=True, n_ltable_chunks=1, n_rtable_chunks=1)¶

WARNING THIS COMMAND IS EXPERIMENTAL AND NOT TESTED. USE AT YOUR OWN RISK

Blocks two tables based on the sequence of rules supplied by the user. Finds tuple pairs from left and right tables that survive the sequence of blocking rules. A tuple pair survives the sequence of blocking rules if none of the rules in the sequence returns True for that pair. If any of the rules returns True, then the pair is blocked.

Parameters

ltable (DataFrame) – The left input table.
rtable (DataFrame) – The right input table.
l_output_attrs (list) – A list of attribute names from the left table to be included in the output candidate set (defaults to None).
r_output_attrs (list) – A list of attribute names from the right table to be included in the output candidate set (defaults to None).
l_output_prefix (string) – The prefix to be used for the attribute names coming from the left table in the output candidate set (defaults to ‘ltable_’).
r_output_prefix (string) – The prefix to be used for the attribute names coming from the right table in the output candidate set (defaults to ‘rtable_’).
verbose (boolean) – A flag to indicate whether the debug information should be logged (defaults to False).
show_progress (boolean) – A flag to indicate whether progress should be displayed to the user (defaults to True).
n_ltable_chunks (int) – The number of partitions to split the left table ( defaults to 1). If it is set to -1, then the number of partitions is set to the number of cores in the machine.
n_rtable_chunks (int) – The number of partitions to split the right table ( defaults to 1). If it is set to -1, then the number of partitions is set to the number of cores in the machine.

Returns

A candidate set of tuple pairs that survived the sequence of blocking rules (DataFrame).

Raises

AssertionError – If ltable is not of type pandas DataFrame.
AssertionError – If rtable is not of type pandas DataFrame.
AssertionError – If l_output_attrs is not of type of list.
AssertionError – If r_output_attrs is not of type of list.
AssertionError – If the values in l_output_attrs is not of type string.
AssertionError – If the values in r_output_attrs is not of type string.
AssertionError – If the input l_output_prefix is not of type string.
AssertionError – If the input r_output_prefix is not of type string.
AssertionError – If verbose is not of type boolean.
AssertionError – If show_progress is not of type boolean.
AssertionError – If n_ltable_chunks is not of type int.
AssertionError – If n_rtable_chunks is not of type int.
AssertionError – If l_out_attrs are not in the ltable.
AssertionError – If r_out_attrs are not in the rtable.
AssertionError – If there are no rules to apply.

Examples

>>> import py_entitymatching as em
>>> from py_entitymatching.dask.dask_rule_based_blocker import DaskRuleBasedBlocker
>>> rb = DaskRuleBasedBlocker()
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='id')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='id')
>>> block_f = em.get_features_for_blocking(A, B)
>>> rule = ['name_name_lev(ltuple, rtuple) > 3']
>>> rb.add_rule(rule, feature_table=block_f)
>>> C = rb.block_tables(A, B)

block_tuples(ltuple, rtuple)¶

Blocks a tuple pair based on a sequence of blocking rules supplied by the user.

Parameters

ltuple (Series) – The input left tuple.
rtuple (Series) – The input right tuple.

Returns

A status indicating if the tuple pair is blocked by applying the sequence of blocking rules (boolean).

Examples

>>> import py_entitymatching as em
>>> from py_entitymatching.dask.dask_rule_based_blocker import DaskRuleBasedBlocker
>>> rb = DaskRuleBasedBlocker()
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='id')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='id')
>>> block_f = em.get_features_for_blocking(A, B)
>>> rule = ['name_name_lev(ltuple, rtuple) > 3']
>>> rb.add_rule(rule, feature_table=block_f)
>>> D = rb.block_tuples(A.ix[0], B.ix[1)

delete_rule(rule_name)¶

Deletes a rule from the rule-based blocker.

Parameters: rule_name (string) – Name of the rule to be deleted.

Examples

>>> import py_entitymatching as em
>>> from py_entitymatching.dask.dask_rule_based_blocker import DaskRuleBasedBlocker
>>> rb = DaskRuleBasedBlocker()
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='id')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='id')
>>> block_f = em.get_features_for_blocking(A, B)
>>> rule = ['name_name_lev(ltuple, rtuple) > 3']
>>> rb.add_rule(rule, block_f, rule_name='rule_1')
>>> rb.delete_rule('rule_1')

get_rule(rule_name)¶

Returns the function corresponding to a rule.

Parameters: rule_name (string) – Name of the rule.
Returns: A function object corresponding to the specified rule.

Examples

>>> import py_entitymatching as em
>>> rb = em.DaskRuleBasedBlocker()
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='id')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='id')
>>> block_f = em.get_features_for_blocking(A, B)
>>> rule = ['name_name_lev(ltuple, rtuple) > 3']
>>> rb.add_rule(rule, feature_table=block_f, rule_name='rule_1')
>>> rb.get_rule()

get_rule_names()¶

Returns the names of all the rules in the rule-based blocker.

Returns: A list of names of all the rules in the rule-based blocker (list).

Examples

>>> import py_entitymatching as em
>>> rb = em.DaskRuleBasedBlocker()
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='id')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='id')
>>> block_f = em.get_features_for_blocking(A, B)
>>> rule = ['name_name_lev(ltuple, rtuple) > 3']
>>> rb.add_rule(rule, block_f, rule_name='rule_1')
>>> rb.get_rule_names()

set_feature_table(feature_table)¶

Sets feature table for the rule-based blocker.

Parameters: feature_table (DataFrame) – A DataFrame containing features.

Examples

>>> import py_entitymatching as em
>>> rb = em.DaskRuleBasedBlocker()
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='id')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='id')
>>> block_f = em.get_features_for_blocking(A, B)
>>> rb.set_feature_table(block_f)

view_rule(rule_name)¶

Prints the source code of the function corresponding to a rule.

Parameters: rule_name (string) – Name of the rule to be viewed.

Examples

>>> import py_entitymatching as em
>>> rb = em.DaskRuleBasedBlocker()
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='id')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='id')
>>> block_f = em.get_features_for_blocking(A, B)
>>> rule = ['name_name_lev(ltuple, rtuple) > 3']
>>> rb.add_rule(rule, block_f, rule_name='rule_1')
>>> rb.view_rule('rule_1')

class py_entitymatching.dask.dask_black_box_blocker.DaskBlackBoxBlocker(*args, **kwargs)¶

WARNING THIS BLOCKER IS EXPERIMENTAL AND NOT TESTED. USE A0T YOUR OWN RISK.

Blocks based on a black box function specified by the user.

block_candset(candset, verbose=True, show_progress=True, n_chunks=1)¶

WARNING THIS COMMAND IS EXPERIMENTAL AND NOT TESTED. USE AT YOUR OWN RISK.

Blocks an input candidate set of tuple pairs based on a black box blocking function specified by the user.

Finds tuple pairs from an input candidate set of tuple pairs that survive the black box function. A tuple pair survives the black box blocking function if the function returns False for that pair, otherwise the tuple pair is dropped.

Parameters

candset (DataFrame) – The input candidate set of tuple pairs.
verbose (boolean) – A flag to indicate whether logging should be done (defaults to False).
show_progress (boolean) – A flag to indicate whether progress should be displayed to the user (defaults to True).
n_chunks (int) – The number of partitions to split the candidate set. If it is set to -1, the number of partitions will be set to the number of cores in the machine.

Returns

A candidate set of tuple pairs that survived blocking (DataFrame).

Raises

AssertionError – If candset is not of type pandas DataFrame.
AssertionError – If verbose is not of type boolean.
AssertionError – If n_chunks is not of type int.
AssertionError – If show_progress is not of type boolean.
AssertionError – If l_block_attr is not in the ltable columns.
AssertionError – If r_block_attr is not in the rtable columns.

Examples

>>> def match_last_name(ltuple, rtuple):
    # assume that there is a 'name' attribute in the input tables
    # and each value in it has two words
    l_last_name = ltuple['name'].split()[1]
    r_last_name = rtuple['name'].split()[1]
    if l_last_name != r_last_name:
        return True
    else:
        return False
>>> import py_entitymatching as em
>>> from py_entitymatching.dask.dask_black_box_blocker import DaskBlackBoxBlocker
>>> bb = DaskBlackBoxBlocker()
>>> bb.set_black_box_function(match_last_name)
>>> D = bb.block_candset(C) # C is an output from block_tables

block_tables(ltable, rtable, l_output_attrs=None, r_output_attrs=None, l_output_prefix='ltable_', r_output_prefix='rtable_', verbose=False, show_progress=True, n_ltable_chunks=1, n_rtable_chunks=1)¶

WARNING THIS COMMAND IS EXPERIMENTAL AND NOT TESTED. USE AT YOUR OWN RISK.

Blocks two tables based on a black box blocking function specified by the user. Finds tuple pairs from left and right tables that survive the black box function. A tuple pair survives the black box blocking function if the function returns False for that pair, otherwise the tuple pair is dropped.

Parameters

ltable (DataFrame) – The left input table.
rtable (DataFrame) – The right input table.
l_output_attrs (list) – A list of attribute names from the left table to be included in the output candidate set (defaults to None).
r_output_attrs (list) – A list of attribute names from the right table to be included in the output candidate set (defaults to None).
l_output_prefix (string) – The prefix to be used for the attribute names coming from the left table in the output candidate set (defaults to ‘ltable_’).
r_output_prefix (string) – The prefix to be used for the attribute names coming from the right table in the output candidate set (defaults to ‘rtable_’).
verbose (boolean) – A flag to indicate whether the debug information should be logged (defaults to False).
show_progress (boolean) – A flag to indicate whether progress should be displayed to the user (defaults to True).
n_ltable_chunks (int) – The number of partitions to split the left table ( defaults to 1). If it is set to -1, then the number of partitions is set to the number of cores in the machine.
n_rtable_chunks (int) – The number of partitions to split the right table ( defaults to 1). If it is set to -1, then the number of partitions is set to the number of cores in the machine.

Returns

A candidate set of tuple pairs that survived blocking (DataFrame).

Raises

AssertionError – If ltable is not of type pandas DataFrame.
AssertionError – If rtable is not of type pandas DataFrame.
AssertionError – If l_output_attrs is not of type of list.
AssertionError – If r_output_attrs is not of type of list.
AssertionError – If values in l_output_attrs is not of type string.
AssertionError – If values in r_output_attrs is not of type string.
AssertionError – If l_output_prefix is not of type string.
AssertionError – If r_output_prefix is not of type string.
AssertionError – If verbose is not of type boolean.
AssertionError – If show_progress is not of type boolean.
AssertionError – If n_ltable_chunks is not of type int.
AssertionError – If n_rtable_chunks is not of type int.
AssertionError – If l_out_attrs are not in the ltable.
AssertionError – If r_out_attrs are not in the rtable.

Examples

>>> def match_last_name(ltuple, rtuple):
    # assume that there is a 'name' attribute in the input tables
    # and each value in it has two words
    l_last_name = ltuple['name'].split()[1]
    r_last_name = rtuple['name'].split()[1]
    if l_last_name != r_last_name:
        return True
    else:
        return False
>>> import py_entitymatching as em
>>> from py_entitymatching.dask.dask_black_box_blocker DaskBlackBoxBlocker
>>> bb = DaskBlackBoxBlocker()
>>> bb.set_black_box_function(match_last_name)
>>> C = bb.block_tables(A, B, l_output_attrs=['name'], r_output_attrs=['name'] )

block_tuples(ltuple, rtuple)¶

Blocks a tuple pair based on a black box blocking function specified by the user.

Takes a tuple pair as input, applies the black box blocking function to it, and returns True (if the intention is to drop the pair) or False (if the intention is to keep the tuple pair).

Parameters

ltuple (Series) – input left tuple.
rtuple (Series) – input right tuple.

Returns

A status indicating if the tuple pair should be dropped or kept, based on the black box blocking function (boolean).

Examples

>>> def match_last_name(ltuple, rtuple):
    # assume that there is a 'name' attribute in the input tables
    # and each value in it has two words
    l_last_name = ltuple['name'].split()[1]
    r_last_name = rtuple['name'].split()[1]
    if l_last_name != r_last_name:
        return True
    else:
        return False
>>> from py_entitymatching.dask.dask_black_box_blocker import DaskBlackBoxBlocker
>>> bb = DaskBlackBoxBlocker()
>>> bb.set_black_box_function(match_last_name)
>>> status = bb.block_tuples(A.ix[0], B.ix[0]) # A, B are input tables.

set_black_box_function(function)¶

Sets black box function to be used for blocking.

Parameters: function (function) – the black box function to be used for blocking .

Extracting Feature Vectors¶

py_entitymatching.dask.dask_extract_features.dask_extract_feature_vecs(candset, attrs_before=None, feature_table=None, attrs_after=None, verbose=False, show_progress=True, n_chunks=1)¶

WARNING THIS COMMAND IS EXPERIMENTAL AND NOT TESTED. USE AT YOUR OWN RISK

This function extracts feature vectors from a DataFrame (typically a labeled candidate set).

Specifically, this function uses feature table, ltable and rtable (that is present in the candset’s metadata) to extract feature vectors.

Parameters

candset (DataFrame) – The input candidate set for which the features vectors should be extracted.
attrs_before (list) – The list of attributes from the input candset, that should be added before the feature vectors (defaults to None).
feature_table (DataFrame) – A DataFrame containing a list of features that should be used to compute the feature vectors ( defaults to None).
attrs_after (list) – The list of attributes from the input candset that should be added after the feature vectors (defaults to None).
verbose (boolean) – A flag to indicate whether the debug information should be displayed (defaults to False).
show_progress (boolean) – A flag to indicate whether the progress of extracting feature vectors must be displayed (defaults to True).
n_chunks (int) – The number of partitions to split the candidate set. If it is set to -1, the number of partitions will be set to the number of cores in the machine.

Returns

A pandas DataFrame containing feature vectors.

The DataFrame will have metadata ltable and rtable, pointing to the same ltable and rtable as the input candset.

Also, the output DataFrame will have three columns: key, foreign key ltable, foreign key rtable copied from input candset to the output DataFrame. These three columns precede the columns mentioned in attrs_before.

Raises

AssertionError – If candset is not of type pandas DataFrame.
AssertionError – If attrs_before has attributes that are not present in the input candset.
AssertionError – If attrs_after has attribtues that are not present in the input candset.
AssertionError – If feature_table is set to None.
AssertionError – If n_chunks is not of type int.

Examples

>>> import py_entitymatching as em
>>> from py_entitymatching.dask.dask_extract_features import dask_extract_feature_vecs
>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> match_f = em.get_features_for_matching(A, B)
>>> # G is the labeled dataframe which should be converted into feature vectors
>>> H = dask_extract_feature_vecs(G, features=match_f, attrs_before=['title'], attrs_after=['gold_labels'])

ML-Matchers¶

class py_entitymatching.dask.dask_dtmatcher.DaskDTMatcher(*args, **kwargs)¶

WARNING THIS MATCHER IS EXPERIMENTAL AND NOT TESTED. USE AT YOUR OWN RISK.

Decision Tree matcher.

Parameters

*args,**kwargs – The arguments to scikit-learn’s Decision Tree classifier.
name (string) – The name of this matcher (defaults to None). If the matcher name is None, the class automatically generates a string and assigns it as the name.

fit(x=None, y=None, table=None, exclude_attrs=None, target_attr=None)¶

Fit interface for the matcher.

Specifically, there are two ways the user can call the fit method. First, interface similar to scikit-learn where the feature vectors and target attribute given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) and the target attribute.

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Parameters

x (DataFrame) – The input feature vectors given as pandas DataFrame (defaults to None).
y (DatFrame) – The input target attribute given as pandas DataFrame with a single column (defaults to None).
table (DataFrame) – The input pandas DataFrame containing feature vectors and target attribute (defaults to None).
exclude_attrs (list) – The list of attributes that should be excluded from the input table to get the feature vectors.
target_attr (string) – The target attribute in the input table.

predict(x=None, table=None, exclude_attrs=None, target_attr=None, append=False, return_probs=False, probs_attr=None, inplace=True, show_progress=False, n_chunks=1)¶

WARNING THIS COMMAND IS EXPERIMENTAL AND NOT TESTED. USE AT YOUR OWN RISK.

Predict interface for the matcher.

Specifically, there are two ways the user can call the predict method. First, interface similar to scikit-learn where the feature vectors given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) .

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Currently, the Dask implementation supports only the cases when the table is not None and the flags inplace, append are False.

Parameters

x (DataFrame) – The input pandas DataFrame containing only feature vectors (defaults to None).
table (DataFrame) – The input pandas DataFrame containing feature vectors, and may be other attributes (defaults to None).
exclude_attrs (list) – A list of attributes to be excluded from the input table to get the feature vectors (defaults to None).
target_attr (string) – The attribute name where the predictions need to be stored in the input table (defaults to None).
probs_attr (string) – The attribute name where the prediction probabilities need to be stored in the input table (defaults to None).
append (boolean) – A flag to indicate whether the predictions need to be appended in the input DataFrame (defaults to False).
return_probs (boolean) – A flag to indicate where the prediction probabilities need to be returned (defaults to False). If set to True, returns the probability if the pair was a match.
inplace (boolean) – A flag to indicate whether the append needs to be done inplace (defaults to True).
show_progress (boolean) – A flag to indicate whether the progress of extracting feature vectors must be displayed (defaults to True).
n_chunks (int) – The number of partitions to split the candidate set. If it is set to -1, the number of partitions will be set to the number of cores in the machine.

Returns

An array of predictions or a DataFrame with predictions updated.

class py_entitymatching.dask.dask_rfmatcher.DaskRFMatcher(*args, **kwargs)¶

WARNING THIS MATCHER IS EXPERIMENTAL AND NOT TESTED. USE AT YOUR OWN RISK.

Random Forest matcher.

Parameters

*args,**kwargs – The arguments to scikit-learn’s Random Forest classifier.
name (string) – The name of this matcher (defaults to None). If the matcher name is None, the class automatically generates a string and assigns it as the name.

fit(x=None, y=None, table=None, exclude_attrs=None, target_attr=None)¶

Fit interface for the matcher.

Specifically, there are two ways the user can call the fit method. First, interface similar to scikit-learn where the feature vectors and target attribute given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) and the target attribute.

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Parameters

x (DataFrame) – The input feature vectors given as pandas DataFrame (defaults to None).
y (DatFrame) – The input target attribute given as pandas DataFrame with a single column (defaults to None).
table (DataFrame) – The input pandas DataFrame containing feature vectors and target attribute (defaults to None).
exclude_attrs (list) – The list of attributes that should be excluded from the input table to get the feature vectors.
target_attr (string) – The target attribute in the input table.

predict(x=None, table=None, exclude_attrs=None, target_attr=None, append=False, return_probs=False, probs_attr=None, inplace=True, show_progress=False, n_chunks=1)¶

WARNING THIS COMMAND IS EXPERIMENTAL AND NOT TESTED. USE AT YOUR OWN RISK.

Predict interface for the matcher.

Specifically, there are two ways the user can call the predict method. First, interface similar to scikit-learn where the feature vectors given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) .

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Currently, the Dask implementation supports only the cases when the table is not None and the flags inplace, append are False.

Parameters

x (DataFrame) – The input pandas DataFrame containing only feature vectors (defaults to None).
table (DataFrame) – The input pandas DataFrame containing feature vectors, and may be other attributes (defaults to None).
exclude_attrs (list) – A list of attributes to be excluded from the input table to get the feature vectors (defaults to None).
target_attr (string) – The attribute name where the predictions need to be stored in the input table (defaults to None).
probs_attr (string) – The attribute name where the prediction probabilities need to be stored in the input table (defaults to None).
append (boolean) – A flag to indicate whether the predictions need to be appended in the input DataFrame (defaults to False).
return_probs (boolean) – A flag to indicate where the prediction probabilities need to be returned (defaults to False). If set to True, returns the probability if the pair was a match.
inplace (boolean) – A flag to indicate whether the append needs to be done inplace (defaults to True).
show_progress (boolean) – A flag to indicate whether the progress of extracting feature vectors must be displayed (defaults to True).
n_chunks (int) – The number of partitions to split the candidate set. If it is set to -1, the number of partitions will be set to the number of cores in the machine.

Returns

An array of predictions or a DataFrame with predictions updated.

class py_entitymatching.dask.dask_nbmatcher.DaskNBMatcher(*args, **kwargs)¶

WARNING THIS MATCHER IS EXPERIMENTAL AND NOT TESTED. USE AT YOUR OWN RISK.

Naive Bayes matcher.

Parameters

*args,**kwargs – The arguments to scikit-learn’s Naive Bayes classifier.
name (string) – The name of this matcher (defaults to None). If the matcher name is None, the class automatically generates a string and assigns it as the name.

fit(x=None, y=None, table=None, exclude_attrs=None, target_attr=None)¶

Fit interface for the matcher.

Specifically, there are two ways the user can call the fit method. First, interface similar to scikit-learn where the feature vectors and target attribute given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) and the target attribute.

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Parameters

x (DataFrame) – The input feature vectors given as pandas DataFrame (defaults to None).
y (DatFrame) – The input target attribute given as pandas DataFrame with a single column (defaults to None).
table (DataFrame) – The input pandas DataFrame containing feature vectors and target attribute (defaults to None).
exclude_attrs (list) – The list of attributes that should be excluded from the input table to get the feature vectors.
target_attr (string) – The target attribute in the input table.

predict(x=None, table=None, exclude_attrs=None, target_attr=None, append=False, return_probs=False, probs_attr=None, inplace=True, show_progress=False, n_chunks=1)¶

WARNING THIS COMMAND IS EXPERIMENTAL AND NOT TESTED. USE AT YOUR OWN RISK.

Predict interface for the matcher.

Specifically, there are two ways the user can call the predict method. First, interface similar to scikit-learn where the feature vectors given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) .

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Currently, the Dask implementation supports only the cases when the table is not None and the flags inplace, append are False.

Parameters

x (DataFrame) – The input pandas DataFrame containing only feature vectors (defaults to None).
table (DataFrame) – The input pandas DataFrame containing feature vectors, and may be other attributes (defaults to None).
exclude_attrs (list) – A list of attributes to be excluded from the input table to get the feature vectors (defaults to None).
target_attr (string) – The attribute name where the predictions need to be stored in the input table (defaults to None).
probs_attr (string) – The attribute name where the prediction probabilities need to be stored in the input table (defaults to None).
append (boolean) – A flag to indicate whether the predictions need to be appended in the input DataFrame (defaults to False).
return_probs (boolean) – A flag to indicate where the prediction probabilities need to be returned (defaults to False). If set to True, returns the probability if the pair was a match.
inplace (boolean) – A flag to indicate whether the append needs to be done inplace (defaults to True).
show_progress (boolean) – A flag to indicate whether the progress of extracting feature vectors must be displayed (defaults to True).
n_chunks (int) – The number of partitions to split the candidate set. If it is set to -1, the number of partitions will be set to the number of cores in the machine.

Returns

An array of predictions or a DataFrame with predictions updated.

class py_entitymatching.dask.dask_logregmatcher.DaskLogRegMatcher(*args, **kwargs)¶

WARNING THIS MATCHER IS EXPERIMENTAL AND NOT TESTED. USE AT YOUR OWN RISK.

Logistic Regression matcher.

Parameters

*args,**kwargs – THe Arguments to scikit-learn’s Logistic Regression classifier.
name (string) – The name of this matcher (defaults to None). If the matcher name is None, the class automatically generates a string and assigns it as the name.

fit(x=None, y=None, table=None, exclude_attrs=None, target_attr=None)¶

Fit interface for the matcher.

Specifically, there are two ways the user can call the fit method. First, interface similar to scikit-learn where the feature vectors and target attribute given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) and the target attribute.

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Parameters

x (DataFrame) – The input feature vectors given as pandas DataFrame (defaults to None).
y (DatFrame) – The input target attribute given as pandas DataFrame with a single column (defaults to None).
table (DataFrame) – The input pandas DataFrame containing feature vectors and target attribute (defaults to None).
exclude_attrs (list) – The list of attributes that should be excluded from the input table to get the feature vectors.
target_attr (string) – The target attribute in the input table.

predict(x=None, table=None, exclude_attrs=None, target_attr=None, append=False, return_probs=False, probs_attr=None, inplace=True, show_progress=False, n_chunks=1)¶

WARNING THIS COMMAND IS EXPERIMENTAL AND NOT TESTED. USE AT YOUR OWN RISK.

Predict interface for the matcher.

Specifically, there are two ways the user can call the predict method. First, interface similar to scikit-learn where the feature vectors given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) .

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Currently, the Dask implementation supports only the cases when the table is not None and the flags inplace, append are False.

Parameters

x (DataFrame) – The input pandas DataFrame containing only feature vectors (defaults to None).
table (DataFrame) – The input pandas DataFrame containing feature vectors, and may be other attributes (defaults to None).
exclude_attrs (list) – A list of attributes to be excluded from the input table to get the feature vectors (defaults to None).
target_attr (string) – The attribute name where the predictions need to be stored in the input table (defaults to None).
probs_attr (string) – The attribute name where the prediction probabilities need to be stored in the input table (defaults to None).
append (boolean) – A flag to indicate whether the predictions need to be appended in the input DataFrame (defaults to False).
return_probs (boolean) – A flag to indicate where the prediction probabilities need to be returned (defaults to False). If set to True, returns the probability if the pair was a match.
inplace (boolean) – A flag to indicate whether the append needs to be done inplace (defaults to True).
show_progress (boolean) – A flag to indicate whether the progress of extracting feature vectors must be displayed (defaults to True).
n_chunks (int) – The number of partitions to split the candidate set. If it is set to -1, the number of partitions will be set to the number of cores in the machine.

Returns

An array of predictions or a DataFrame with predictions updated.

class py_entitymatching.dask.dask_xgboost_matcher.DaskXGBoostMatcher(*args, **kwargs)¶

WARNING THIS MATCHER IS EXPERIMENTAL AND NOT TESTED. USE AT YOUR OWN RISK

XGBoost matcher.

Parameters

*args,**kwargs – The arguments to XGBoost classifier.
name (string) – The name of this matcher (defaults to None). If the matcher name is None, the class automatically generates a string and assigns it as the name.

fit(x=None, y=None, table=None, exclude_attrs=None, target_attr=None)¶

Fit interface for the matcher.

Specifically, there are two ways the user can call the fit method. First, interface similar to scikit-learn where the feature vectors and target attribute given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) and the target attribute.

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Parameters

x (DataFrame) – The input feature vectors given as pandas DataFrame (defaults to None).
y (DatFrame) – The input target attribute given as pandas DataFrame with a single column (defaults to None).
table (DataFrame) – The input pandas DataFrame containing feature vectors and target attribute (defaults to None).
exclude_attrs (list) – The list of attributes that should be excluded from the input table to get the feature vectors.
target_attr (string) – The target attribute in the input table.

predict(x=None, table=None, exclude_attrs=None, target_attr=None, append=False, return_probs=False, probs_attr=None, inplace=True, show_progress=False, n_chunks=1)¶

WARNING THIS COMMAND IS EXPERIMENTAL AND NOT TESTED. USE AT YOUR OWN RISK.

Predict interface for the matcher.

Specifically, there are two ways the user can call the predict method. First, interface similar to scikit-learn where the feature vectors given as projected DataFrame. Second, give the DataFrame and explicitly specify the feature vectors (by specifying the attributes to be excluded) .

A point to note is all the input parameters have a default value of None. This is done to support both the interfaces in a single function.

Currently, the Dask implementation supports only the cases when the table is not None and the flags inplace, append are False.

Parameters

x (DataFrame) – The input pandas DataFrame containing only feature vectors (defaults to None).
table (DataFrame) – The input pandas DataFrame containing feature vectors, and may be other attributes (defaults to None).
exclude_attrs (list) – A list of attributes to be excluded from the input table to get the feature vectors (defaults to None).
target_attr (string) – The attribute name where the predictions need to be stored in the input table (defaults to None).
probs_attr (string) – The attribute name where the prediction probabilities need to be stored in the input table (defaults to None).
append (boolean) – A flag to indicate whether the predictions need to be appended in the input DataFrame (defaults to False).
return_probs (boolean) – A flag to indicate where the prediction probabilities need to be returned (defaults to False). If set to True, returns the probability if the pair was a match.
inplace (boolean) – A flag to indicate whether the append needs to be done inplace (defaults to True).
show_progress (boolean) – A flag to indicate whether the progress of extracting feature vectors must be displayed (defaults to True).
n_chunks (int) – The number of partitions to split the candidate set. If it is set to -1, the number of partitions will be set to the number of cores in the machine.

Returns

An array of predictions or a DataFrame with predictions updated.

Table Of Contents

Search

Commands Implemented Using Dask¶

Downsampling¶

Blocking¶

Extracting Feature Vectors¶

ML-Matchers¶