Position Filter¶

class py_stringsimjoin.filter.position_filter.PositionFilter(tokenizer, sim_measure_type, threshold, allow_empty=True, allow_missing=False)[source]¶

Finds candidate matching pairs of strings using position filtering technique.

For similarity measures such as cosine, Dice, Jaccard and overlap, the filter finds candidate string pairs that may have similarity score greater than or equal to the input threshold, as specified in “threshold”. For distance measures such as edit distance, the filter finds candidate string pairs that may have distance score less than or equal to the threshold.

To know more about position filtering, refer to the string matching chapter of the “Principles of Data Integration” book.

Parameters

tokenizer (Tokenizer) – tokenizer to be used.
sim_measure_type (string) – similarity measure type. Supported types are ‘JACCARD’, ‘COSINE’, ‘DICE’, ‘OVERLAP’ and ‘EDIT_DISTANCE’.
threshold (float) – threshold to be used by the filter.
allow_empty (boolean) – A flag to indicate whether pairs in which both strings are tokenized into an empty set of tokens should survive the filter (defaults to True). This flag is not valid for measures such as ‘OVERLAP’ and ‘EDIT_DISTANCE’.
allow_missing (boolean) – A flag to indicate whether pairs containing missing value should survive the filter (defaults to False).

tokenizer¶

An attribute to store the tokenizer.

Type: Tokenizer

sim_measure_type¶

An attribute to store the similarity measure type.

Type: string

threshold¶

An attribute to store the threshold value.

Type: float

allow_empty¶

An attribute to store the value of the flag allow_empty.

Type: boolean

allow_missing¶

An attribute to store the value of the flag allow_missing.

Type: boolean

filter_candset(candset, candset_l_key_attr, candset_r_key_attr, ltable, rtable, l_key_attr, r_key_attr, l_filter_attr, r_filter_attr, n_jobs=1, show_progress=True)¶

Finds candidate matching pairs of strings from the input candidate set.

Parameters

candset (DataFrame) – input candidate set.
candset_l_key_attr (string) – attribute in candidate set which is a key in left table.
candset_r_key_attr (string) – attribute in candidate set which is a key in right table.
ltable (DataFrame) – left input table.
rtable (DataFrame) – right input table.
l_key_attr (string) – key attribute in left table.
r_key_attr (string) – key attribute in right table.
l_filter_attr (string) – attribute in left table on which the filter should be applied.
r_filter_attr (string) – attribute in right table on which the filter should be applied.
n_jobs (int) – number of parallel jobs to use for the computation (defaults to 1). If -1 is given, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).
show_progress (boolean) – flag to indicate whether task progress should be displayed to the user (defaults to True).

Returns

An output table containing tuple pairs from the candidate set that survive the filter (DataFrame).

filter_pair(lstring, rstring)[source]¶

Checks if the input strings get dropped by the position filter.

Parameters

lstring (string) – input strings
rstring (string) – input strings

Returns

A flag indicating whether the string pair is dropped (boolean).

filter_tables(ltable, rtable, l_key_attr, r_key_attr, l_filter_attr, r_filter_attr, l_out_attrs=None, r_out_attrs=None, l_out_prefix='l_', r_out_prefix='r_', n_jobs=1, show_progress=True)[source]¶

Finds candidate matching pairs of strings from the input tables using position filtering technique.

Parameters

ltable (DataFrame) – left input table.
rtable (DataFrame) – right input table.
l_key_attr (string) – key attribute in left table.
r_key_attr (string) – key attribute in right table.
l_filter_attr (string) – attribute in left table on which the filter should be applied.
r_filter_attr (string) – attribute in right table on which the filter should be applied.
l_out_attrs (list) – list of attribute names from the left table to be included in the output table (defaults to None).
r_out_attrs (list) – list of attribute names from the right table to be included in the output table (defaults to None).
l_out_prefix (string) – prefix to be used for the attribute names coming from the left table, in the output table (defaults to ‘l_’).
r_out_prefix (string) – prefix to be used for the attribute names coming from the right table, in the output table (defaults to ‘r_’).
n_jobs (int) – number of parallel jobs to use for the computation (defaults to 1). If -1 is given, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).
show_progress (boolean) – flag to indicate whether task progress should be displayed to the user (defaults to True).

Returns

An output table containing tuple pairs that survive the filter (DataFrame).