Matchers

py_stringsimjoin.matcher.apply_matcher.apply_matcher(candset, candset_l_key_attr, candset_r_key_attr, ltable, rtable, l_key_attr, r_key_attr, l_match_attr, r_match_attr, tokenizer, sim_function, threshold, comp_op='>=', allow_missing=False, l_out_attrs=None, r_out_attrs=None, l_out_prefix='l_', r_out_prefix='r_', out_sim_score=True, n_jobs=1, show_progress=True)[source]

Find matching string pairs from the candidate set (typically produced by applying a filter to two tables) by applying a matcher of form (sim_function comp_op threshold).

Specifically, this method computes the input similarity function on string pairs in the candidate set and checks if the resulting score satisfies the input threshold (depending on the comparison operator).

Parameters:
  • candset (DataFrame) – input candidate set.
  • candset_l_key_attr (string) – attribute in candidate set which is a key in left table.
  • candset_r_key_attr (string) – attribute in candidate set which is a key in right table.
  • ltable (DataFrame) – left input table.
  • rtable (DataFrame) – right input table.
  • l_key_attr (string) – key attribute in left table.
  • r_key_attr (string) – key attribute in right table.
  • l_match_attr (string) – attribute in left table on which the matcher should be applied.
  • r_match_attr (string) – attribute in right table on which the matcher should be applied.
  • tokenizer (Tokenizer) – tokenizer to be used to tokenize the match attributes. If set to None, the matcher is applied directly on the match attributes.
  • sim_function (function) – matcher function to be applied.
  • threshold (float) – threshold to be satisfied.
  • comp_op (string) – comparison operator. Supported values are ‘>=’, ‘>’, ‘ <=’, ‘<’, ‘=’ and ‘!=’ (defaults to ‘>=’).
  • allow_missing (boolean) – flag to indicate whether tuple pairs with missing value in at least one of the match attributes should be included in the output (defaults to False).
  • l_out_attrs (list) – list of attribute names from the left table to be included in the output table (defaults to None).
  • r_out_attrs (list) – list of attribute names from the right table to be included in the output table (defaults to None).
  • l_out_prefix (string) – prefix to be used for the attribute names coming from the left table, in the output table (defaults to ‘l_’).
  • r_out_prefix (string) – prefix to be used for the attribute names coming from the right table, in the output table (defaults to ‘r_’).
  • out_sim_score (boolean) – flag to indicate whether similarity score should be included in the output table (defaults to True). Setting this flag to True will add a column named ‘_sim_score’ in the output table. This column will contain the similarity scores for the tuple pairs in the output.
  • n_jobs (int) – number of parallel jobs to use for the computation (defaults to 1). If -1 is given, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).
  • show_progress (boolean) – flag to indicate whether task progress should be displayed to the user (defaults to True).
Returns:

An output table containing tuple pairs from the candidate set that survive the matcher (DataFrame).