Position Filter

class py_stringsimjoin.filter.position_filter.PositionFilter(tokenizer, sim_measure_type, threshold, allow_empty=True, allow_missing=False)[source]

Finds candidate matching pairs of strings using position filtering technique.

For similarity measures such as cosine, Dice, Jaccard and overlap, the filter finds candidate string pairs that may have similarity score greater than or equal to the input threshold, as specified in “threshold”. For distance measures such as edit distance, the filter finds candidate string pairs that may have distance score less than or equal to the threshold.

To know more about position filtering, refer to the string matching chapter of the “Principles of Data Integration” book.

Parameters
  • tokenizer (Tokenizer) – tokenizer to be used.

  • sim_measure_type (string) – similarity measure type. Supported types are ‘JACCARD’, ‘COSINE’, ‘DICE’, ‘OVERLAP’ and ‘EDIT_DISTANCE’.

  • threshold (float) – threshold to be used by the filter.

  • allow_empty (boolean) – A flag to indicate whether pairs in which both strings are tokenized into an empty set of tokens should survive the filter (defaults to True). This flag is not valid for measures such as ‘OVERLAP’ and ‘EDIT_DISTANCE’.

  • allow_missing (boolean) – A flag to indicate whether pairs containing missing value should survive the filter (defaults to False).

tokenizer

An attribute to store the tokenizer.

Type

Tokenizer

sim_measure_type

An attribute to store the similarity measure type.

Type

string

threshold

An attribute to store the threshold value.

Type

float

allow_empty

An attribute to store the value of the flag allow_empty.

Type

boolean

allow_missing

An attribute to store the value of the flag allow_missing.

Type

boolean

filter_candset(candset, candset_l_key_attr, candset_r_key_attr, ltable, rtable, l_key_attr, r_key_attr, l_filter_attr, r_filter_attr, n_jobs=1, show_progress=True)

Finds candidate matching pairs of strings from the input candidate set.

Parameters
  • candset (DataFrame) – input candidate set.

  • candset_l_key_attr (string) – attribute in candidate set which is a key in left table.

  • candset_r_key_attr (string) – attribute in candidate set which is a key in right table.

  • ltable (DataFrame) – left input table.

  • rtable (DataFrame) – right input table.

  • l_key_attr (string) – key attribute in left table.

  • r_key_attr (string) – key attribute in right table.

  • l_filter_attr (string) – attribute in left table on which the filter should be applied.

  • r_filter_attr (string) – attribute in right table on which the filter should be applied.

  • n_jobs (int) – number of parallel jobs to use for the computation (defaults to 1). If -1 is given, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).

  • show_progress (boolean) – flag to indicate whether task progress should be displayed to the user (defaults to True).

Returns

An output table containing tuple pairs from the candidate set that survive the filter (DataFrame).

filter_pair(lstring, rstring)[source]

Checks if the input strings get dropped by the position filter.

Parameters
  • lstring (string) – input strings

  • rstring (string) – input strings

Returns

A flag indicating whether the string pair is dropped (boolean).

filter_tables(ltable, rtable, l_key_attr, r_key_attr, l_filter_attr, r_filter_attr, l_out_attrs=None, r_out_attrs=None, l_out_prefix='l_', r_out_prefix='r_', n_jobs=1, show_progress=True)[source]

Finds candidate matching pairs of strings from the input tables using position filtering technique.

Parameters
  • ltable (DataFrame) – left input table.

  • rtable (DataFrame) – right input table.

  • l_key_attr (string) – key attribute in left table.

  • r_key_attr (string) – key attribute in right table.

  • l_filter_attr (string) – attribute in left table on which the filter should be applied.

  • r_filter_attr (string) – attribute in right table on which the filter should be applied.

  • l_out_attrs (list) – list of attribute names from the left table to be included in the output table (defaults to None).

  • r_out_attrs (list) – list of attribute names from the right table to be included in the output table (defaults to None).

  • l_out_prefix (string) – prefix to be used for the attribute names coming from the left table, in the output table (defaults to ‘l_’).

  • r_out_prefix (string) – prefix to be used for the attribute names coming from the right table, in the output table (defaults to ‘r_’).

  • n_jobs (int) – number of parallel jobs to use for the computation (defaults to 1). If -1 is given, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).

  • show_progress (boolean) – flag to indicate whether task progress should be displayed to the user (defaults to True).

Returns

An output table containing tuple pairs that survive the filter (DataFrame).