Overlap Filter

class py_stringsimjoin.filter.overlap_filter.OverlapFilter(tokenizer, overlap_size=1, comp_op='>=', allow_missing=False)[source]

Finds candidate matching pairs of strings using overlap filtering technique.

A string pair is output by overlap filter only if the number of common tokens in the strings satisfy the condition on overlap size threshold. For example, if the comparison operator is ‘>=’, a string pair is output if the number of common tokens is greater than or equal to the overlap size threshold, as specified by “overlap_size”.

Parameters:
  • tokenizer (Tokenizer) – tokenizer to be used.
  • overlap_size (int) – overlap threshold to be used by the filter.
  • comp_op (string) – comparison operator. Supported values are ‘>=’, ‘>’ and ‘=’ (defaults to ‘>=’).
  • allow_missing (boolean) – A flag to indicate whether pairs containing missing value should survive the filter (defaults to False).
tokenizer

Tokenizer – An attribute to store the tokenizer.

overlap_size

int – An attribute to store the overlap threshold value.

comp_op

string – An attribute to store the comparison operator.

allow_missing

boolean – An attribute to store the value of the flag allow_missing.

filter_candset(candset, candset_l_key_attr, candset_r_key_attr, ltable, rtable, l_key_attr, r_key_attr, l_filter_attr, r_filter_attr, n_jobs=1, show_progress=True)

Finds candidate matching pairs of strings from the input candidate set.

Parameters:
  • candset (DataFrame) – input candidate set.
  • candset_l_key_attr (string) – attribute in candidate set which is a key in left table.
  • candset_r_key_attr (string) – attribute in candidate set which is a key in right table.
  • ltable (DataFrame) – left input table.
  • rtable (DataFrame) – right input table.
  • l_key_attr (string) – key attribute in left table.
  • r_key_attr (string) – key attribute in right table.
  • l_filter_attr (string) – attribute in left table on which the filter should be applied.
  • r_filter_attr (string) – attribute in right table on which the filter should be applied.
  • n_jobs (int) – number of parallel jobs to use for the computation (defaults to 1). If -1 is given, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).
  • show_progress (boolean) – flag to indicate whether task progress should be displayed to the user (defaults to True).
Returns:

An output table containing tuple pairs from the candidate set that survive the filter (DataFrame).

filter_pair(lstring, rstring)[source]

Checks if the input strings get dropped by the overlap filter.

Parameters:lstring,rstring (string) – input strings
Returns:A flag indicating whether the string pair is dropped (boolean).
filter_tables(ltable, rtable, l_key_attr, r_key_attr, l_filter_attr, r_filter_attr, l_out_attrs=None, r_out_attrs=None, l_out_prefix='l_', r_out_prefix='r_', out_sim_score=False, n_jobs=1, show_progress=True)[source]

Finds candidate matching pairs of strings from the input tables using overlap filtering technique.

Parameters:
  • ltable (DataFrame) – left input table.
  • rtable (DataFrame) – right input table.
  • l_key_attr (string) – key attribute in left table.
  • r_key_attr (string) – key attribute in right table.
  • l_filter_attr (string) – attribute in left table on which the filter should be applied.
  • r_filter_attr (string) – attribute in right table on which the filter should be applied.
  • l_out_attrs (list) – list of attribute names from the left table to be included in the output table (defaults to None).
  • r_out_attrs (list) – list of attribute names from the right table to be included in the output table (defaults to None).
  • l_out_prefix (string) – prefix to be used for the attribute names coming from the left table, in the output table (defaults to ‘l_’).
  • r_out_prefix (string) – prefix to be used for the attribute names coming from the right table, in the output table (defaults to ‘r_’).
  • out_sim_score (boolean) – flag to indicate whether the overlap score should be included in the output table (defaults to True). Setting this flag to True will add a column named ‘_sim_score’ in the output table. This column will contain the overlap scores for the tuple pairs in the output.
  • n_jobs (int) – number of parallel jobs to use for the computation (defaults to 1). If -1 is given, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).
  • show_progress (boolean) – flag to indicate whether task progress should be displayed to the user (defaults to True).
Returns:

An output table containing tuple pairs that survive the filter (DataFrame).