User Manual for py_stringsimjoin

This document shows the users how to install and use the package. To contribute to or further develop the package, see the project website, section “For Contributors and Developers”.

Contents

Installation

Requirements

  • Python 2.7 or Python 3.3+

Platforms

py_stringsimjoin has been tested on Linux (Ubuntu with Kernel Version 3.13.0-40-generic), OS X (Darwin with Kernel Version 13.4.0), and Windows 8.1.

Dependencies

  • pandas (to manage tables of tuples to be joined)
  • joblib (to write code that runs over multiple cores)
  • py_stringmatching (to tokenize and compute similarity scores between strings)
  • pyprind (to display progress bars)
  • six (to ensure our code run on both Python 2.x and Python 3.x)

Note

The py_stringsimjoin installer will automatically install the above required packages.

There are two ways to install py_stringsimjoin package: using pip or source distribution.

Installing Using pip

The easiest way to install the package is to use pip, which will retrieve py_stringsimjoin from PyPI then install it:

pip install py_stringsimjoin

Installing from Source Distribution

Step 1: Download the source code of the py_stringsimjoin package from here. (Download code in tar.gz format for Linux and OS X, and code in zip format for Windows.)

Step 2: Untar or unzip the package and execute the following command from the package root:

python setup.py install

Note

The above command will try to install py_stringsimjoin into the defaul Python directory on your machine. If you do not have installation permission for that directory then you can install the package in your home directory as follows:

python setup.py install --user

For more information see the following StackOverflow link.

Overview

Given two tables A and B, this package provides commands to perform string similarity joins between two columns of these tables, such as A.name and B.name, or A.city and B.city. An example of such joins is to return all pairs (x,y) of tuples from the Cartesian product of Tables A and B such that

  • x is a tuple in Table A and y is a tuple in Table B.
  • Jaccard(3gram(x.name), 3gram(y.name)) > 0.7. That is, first tokenize the value of the attribute “name” of x into a set P of 3grams, and tokenize the value of the attribute “name” of y into a set Q of 3grams. Then compute the Jaccard score between P and Q. This score must exceed 0.7. This is often called the “join condition”.

Such joins are challenging because a naive implementation would consider all tuple pairs in the Cartesian product of Tables A and B, an often enormous number (for example, 10 billion pairs if each table has 100K tuples). The package provides efficient implementations of such joins, by using methods called “filtering” to quickly eliminate the pairs that obviously cannot satisfy the join condition.

To understand tokenizing and string similarity scores (such as Jaccard, edit distance, etc.), see the Web site of the package py_stringmatching (in particular, read the following book chapter on string matching). That package provides efficient implementations of a set of tokenizers and string similarity measures. It focuses on the case of tokenizing two strings and then applying a similarity measure to the outputs of the tokenizers to compute a similarity score between those two strings. This package builds on top of py_stringmatching, so it is important to understand the tokenizers and string similarity measures of the py_stringmatching package.

To read more about string similarity joins, see “String Similarity Joins: An Experimental Evaluation” and “String Similarity Search and Join : A Survey”.

We now explain the most important notions an user is likely to encounter while using this package. To use the package, the user typically loads into Python two tables A and B (as described above). These two tables will often be referred to in the commands of this package as ltable (for “left table”) and rtable (for “right table”), respectively. The notion “tuple pair” refers to a pair (x,y) where x is a tuple in ltable and y is a tuple in rtable.

To execute a string similarity join, the user calls a command such as jaccard_join, cosine_join, etc. The command’s arguments include

  • The two tables: ltable and rtable, the key attributes of these tables, and the target attributes (on which the join will be performed, such as A.name and B.name).
  • The join condition.
  • The output will be a table of tuple pairs surviving the join. The user may want to specify the desired attributes of the output table, using arguments such as l_out_attrs, r_out_attrs, l_out_prefix, etc.
  • A flag to indicate on how many cores should this command be run.

Internally, a command such as jaccard_join will first create a filter object, using an appropriate filtering techniques (many such techniques exist, see the book chapter on string matching). Next, it uses this filter object to quickly drop many pairs that obviously do not satisfy the join condition. The set of remaining tuple pairs is referred to as a “candidate set”. Finally, it applies a matcher to the pairs in this set. The matcher simply checks and retains only those pairs that satisfy the join condition.

The implemented commands can be organized into the following groups: profilers, joins, filters, matchers, and utilities. We now briefly describe these.

Profilers

After loading the two tables A and B into Python, the user may want to run a profiler command, which will examine the two tables, detect possible problems for the subsequent string similarity joins, warn the user of these potential problems, and suggest possible solutions.

Currently only one profiler has been implemented, profile_table_for_join. This command examines the tables for unique and missing attribute values, and discuss possible problems stemming from these for the subsequent joins. Based on the report of this command, the user may want to take certain actions before actually applying join commands. Using the profiler is not required, of course.

Joins

After loading and optionally profiling and fixing the tables, most likely the user will just call a join command to do the join. We have implemented the following join commands:

  • cosine_join
  • dice_join
  • edit_distance_join
  • jaccard_join
  • overlap_coefficient_join
  • overlap_join

Filters & Matchers

Most users will just use join commands (described above). They do not need to know about filters and matchers. However, users who want to perform more complex string similarity joins (or joins that we currently do not yet support) may find filters and matchers useful and may want to use them. (See the How-to Guide for examples of performing complex string similarity joins such as TF/IDF joins.)

Filters are class objects. They form the following class hierarchy:

Filter
  • OverlapFilter
  • SizeFilter
  • PrefixFilter
  • PositionFilter
  • SuffixFilter

Currently we have implemented only one matcher, called “apply_matcher”.

Utilities

Consider a table A with an attribute “year”. This attribute contains numeric, not string, values. So we cannot apply the commands in py_stringsimjoin on this attribute directly (as these commands input only string values). To apply the command, we first must convert the values of this attribute (for example, 1978, 2001, etc.) into strings. This conversion is somewhat tricky, because if we are not careful, missing values such as NaN will be converted into strings “NaN”. In this package we have provided several utility commands to do such conversion.

Guides

To learn to quickly use the package, you can check out

The package homepage provides a link to the How-To Guide, which provides a complete set of instructions for using this package (including performing more complex joins such as TF/IDF).

Profilers

py_stringsimjoin.profiler.profiler.profile_table_for_join(input_table, profile_attrs=None)[source]

Profiles the attributes in the table to suggest implications for join.

Parameters:
  • input_table (DataFrame) – input table to profile.
  • profile_attrs (list) – list of attribute names from the input table to be profiled (defaults to None). If not provided, all attributes in the input table will be profiled.
Returns:

A dataframe consisting of profile output. Specifically, the dataframe contains three columns,

  1. ‘Unique values’ column, which shows the number of unique values in each attribute,
  2. ‘Missing values’ column, which shows the number of missing values in each attribute, and
  3. ‘Comments’ column, which contains comments about each attribute.

The output dataframe is indexed by attribute name, so that the statistics for each attribute can be easily accessed using the attribute name.

Joins

Cosine Join

py_stringsimjoin.join.cosine_join.cosine_join(ltable, rtable, l_key_attr, r_key_attr, l_join_attr, r_join_attr, tokenizer, threshold, comp_op='>=', allow_empty=True, allow_missing=False, l_out_attrs=None, r_out_attrs=None, l_out_prefix='l_', r_out_prefix='r_', out_sim_score=True, n_jobs=1, show_progress=True)[source]

Join two tables using a variant of cosine similarity known as Ochiai coefficient.

This is not the cosine measure that computes the cosine of the angle between two given vectors. Rather, it is a variant of cosine measure known as Ochiai coefficient (see the Wikipedia page Cosine Similarity). Specifically, for two sets X and Y, this measure computes:

\(cosine(X, Y) = \frac{|X \cap Y|}{\sqrt{|X| \cdot |Y|}}\)

In the case where one of X and Y is an empty set and the other is a non-empty set, we define their cosine score to be 0. In the case where both X and Y are empty sets, we define their cosine score to be 1.

Finds tuple pairs from left table and right table such that the cosine similarity between the join attributes satisfies the condition on input threshold. For example, if the comparison operator is ‘>=’, finds tuple pairs whose cosine similarity between the strings that are the values of the join attributes is greater than or equal to the input threshold, as specified in “threshold”.

Parameters:
  • ltable (DataFrame) – left input table.
  • rtable (DataFrame) – right input table.
  • l_key_attr (string) – key attribute in left table.
  • r_key_attr (string) – key attribute in right table.
  • l_join_attr (string) – join attribute in left table.
  • r_join_attr (string) – join attribute in right table.
  • tokenizer (Tokenizer) – tokenizer to be used to tokenize join attributes.
  • threshold (float) – cosine similarity threshold to be satisfied.
  • comp_op (string) – comparison operator. Supported values are ‘>=’, ‘>’ and ‘=’ (defaults to ‘>=’).
  • allow_empty (boolean) – flag to indicate whether tuple pairs with empty set of tokens in both the join attributes should be included in the output (defaults to True).
  • allow_missing (boolean) – flag to indicate whether tuple pairs with missing value in at least one of the join attributes should be included in the output (defaults to False). If this flag is set to True, a tuple in ltable with missing value in the join attribute will be matched with every tuple in rtable and vice versa.
  • l_out_attrs (list) – list of attribute names from the left table to be included in the output table (defaults to None).
  • r_out_attrs (list) – list of attribute names from the right table to be included in the output table (defaults to None).
  • l_out_prefix (string) – prefix to be used for the attribute names coming from the left table, in the output table (defaults to ‘l_’).
  • r_out_prefix (string) – prefix to be used for the attribute names coming from the right table, in the output table (defaults to ‘r_’).
  • out_sim_score (boolean) – flag to indicate whether similarity score should be included in the output table (defaults to True). Setting this flag to True will add a column named ‘_sim_score’ in the output table. This column will contain the similarity scores for the tuple pairs in the output.
  • n_jobs (int) – number of parallel jobs to use for the computation (defaults to 1). If -1 is given, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).
  • show_progress (boolean) – flag to indicate whether task progress should be displayed to the user (defaults to True).
Returns:

An output table containing tuple pairs that satisfy the join condition (DataFrame).

Dice Join

py_stringsimjoin.join.dice_join.dice_join(ltable, rtable, l_key_attr, r_key_attr, l_join_attr, r_join_attr, tokenizer, threshold, comp_op='>=', allow_empty=True, allow_missing=False, l_out_attrs=None, r_out_attrs=None, l_out_prefix='l_', r_out_prefix='r_', out_sim_score=True, n_jobs=1, show_progress=True)[source]

Join two tables using Dice similarity measure.

For two sets X and Y, the Dice similarity score between them is given by:

\(dice(X, Y) = \frac{2 * |X \cap Y|}{|X| + |Y|}\)

In the case where both X and Y are empty sets, we define their Dice score to be 1.

Finds tuple pairs from left table and right table such that the Dice similarity between the join attributes satisfies the condition on input threshold. For example, if the comparison operator is ‘>=’, finds tuple pairs whose Dice similarity between the strings that are the values of the join attributes is greater than or equal to the input threshold, as specified in “threshold”.

Parameters:
  • ltable (DataFrame) – left input table.
  • rtable (DataFrame) – right input table.
  • l_key_attr (string) – key attribute in left table.
  • r_key_attr (string) – key attribute in right table.
  • l_join_attr (string) – join attribute in left table.
  • r_join_attr (string) – join attribute in right table.
  • tokenizer (Tokenizer) – tokenizer to be used to tokenize join attributes.
  • threshold (float) – Dice similarity threshold to be satisfied.
  • comp_op (string) – comparison operator. Supported values are ‘>=’, ‘>’ and ‘=’ (defaults to ‘>=’).
  • allow_empty (boolean) – flag to indicate whether tuple pairs with empty set of tokens in both the join attributes should be included in the output (defaults to True).
  • allow_missing (boolean) – flag to indicate whether tuple pairs with missing value in at least one of the join attributes should be included in the output (defaults to False). If this flag is set to True, a tuple in ltable with missing value in the join attribute will be matched with every tuple in rtable and vice versa.
  • l_out_attrs (list) – list of attribute names from the left table to be included in the output table (defaults to None).
  • r_out_attrs (list) – list of attribute names from the right table to be included in the output table (defaults to None).
  • l_out_prefix (string) – prefix to be used for the attribute names coming from the left table, in the output table (defaults to ‘l_’).
  • r_out_prefix (string) – prefix to be used for the attribute names coming from the right table, in the output table (defaults to ‘r_’).
  • out_sim_score (boolean) – flag to indicate whether similarity score should be included in the output table (defaults to True). Setting this flag to True will add a column named ‘_sim_score’ in the output table. This column will contain the similarity scores for the tuple pairs in the output.
  • n_jobs (int) – number of parallel jobs to use for the computation (defaults to 1). If -1 is given, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).
  • show_progress (boolean) – flag to indicate whether task progress should be displayed to the user (defaults to True).
Returns:

An output table containing tuple pairs that satisfy the join condition (DataFrame).

Edit Distance Join

py_stringsimjoin.join.edit_distance_join.edit_distance_join(ltable, rtable, l_key_attr, r_key_attr, l_join_attr, r_join_attr, threshold, comp_op='<=', l_out_attrs=None, r_out_attrs=None, l_out_prefix='l_', r_out_prefix='r_', out_sim_score=True, n_jobs=1, show_progress=True, tokenizer=2_gram_tokenizer)[source]

Join two tables using edit distance measure.

Finds tuple pairs from left table and right table such that the edit distance between the join attributes satisfies the condition on input threshold. For example, if the comparison operator is ‘<=’, finds tuple pairs whose edit distance between the strings that are the values of the join attributes is less than or equal to the input threshold, as specified in “threshold”.

Note

Currently, this method only computes an approximate join result. This is because, to perform the join we transform an edit distance measure between strings into an overlap measure between qgrams of the strings. Hence, we need at least one qgram to be in common between two input strings, to appear in the join output. For smaller strings, where all qgrams of the strings differ, we cannot process them.

This method implements a simplified version of the algorithm proposed in Ed-Join: An Efficient Algorithm for Similarity Joins With Edit Distance Constraints (Chuan Xiao, Wei Wang and Xuemin Lin), VLDB 08.

Parameters:
  • ltable (DataFrame) – left input table.
  • rtable (DataFrame) – right input table.
  • l_key_attr (string) – key attribute in left table.
  • r_key_attr (string) – key attribute in right table.
  • l_join_attr (string) – join attribute in left table.
  • r_join_attr (string) – join attribute in right table.
  • threshold (float) – edit distance threshold to be satisfied.
  • comp_op (string) – comparison operator. Supported values are ‘<=’, ‘<’ and ‘=’ (defaults to ‘<=’).
  • allow_missing (boolean) – flag to indicate whether tuple pairs with missing value in at least one of the join attributes should be included in the output (defaults to False). If this flag is set to True, a tuple in ltable with missing value in the join attribute will be matched with every tuple in rtable and vice versa.
  • l_out_attrs (list) – list of attribute names from the left table to be included in the output table (defaults to None).
  • r_out_attrs (list) – list of attribute names from the right table to be included in the output table (defaults to None).
  • l_out_prefix (string) – prefix to be used for the attribute names coming from the left table, in the output table (defaults to ‘l_’).
  • r_out_prefix (string) – prefix to be used for the attribute names coming from the right table, in the output table (defaults to ‘r_’).
  • out_sim_score (boolean) – flag to indicate whether the edit distance score should be included in the output table (defaults to True). Setting this flag to True will add a column named ‘_sim_score’ in the output table. This column will contain the edit distance scores for the tuple pairs in the output.
  • n_jobs (int) – number of parallel jobs to use for the computation (defaults to 1). If -1 is given, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).
  • show_progress (boolean) – flag to indicate whether task progress should be displayed to the user (defaults to True).
  • tokenizer (Tokenizer) – tokenizer to be used to tokenize the join attributes during filtering, when edit distance measure is transformed into an overlap measure. This must be a q-gram tokenizer (defaults to 2-gram tokenizer).
Returns:

An output table containing tuple pairs that satisfy the join condition (DataFrame).

Jaccard Join

py_stringsimjoin.join.jaccard_join.jaccard_join(ltable, rtable, l_key_attr, r_key_attr, l_join_attr, r_join_attr, tokenizer, threshold, comp_op='>=', allow_empty=True, allow_missing=False, l_out_attrs=None, r_out_attrs=None, l_out_prefix='l_', r_out_prefix='r_', out_sim_score=True, n_jobs=1, show_progress=True)[source]

Join two tables using Jaccard similarity measure.

For two sets X and Y, the Jaccard similarity score between them is given by:

\(jaccard(X, Y) = \frac{|X \cap Y|}{|X \cup Y|}\)

In the case where both X and Y are empty sets, we define their Jaccard score to be 1.

Finds tuple pairs from left table and right table such that the Jaccard similarity between the join attributes satisfies the condition on input threshold. For example, if the comparison operator is ‘>=’, finds tuple pairs whose Jaccard similarity between the strings that are the values of the join attributes is greater than or equal to the input threshold, as specified in “threshold”.

Parameters:
  • ltable (DataFrame) – left input table.
  • rtable (DataFrame) – right input table.
  • l_key_attr (string) – key attribute in left table.
  • r_key_attr (string) – key attribute in right table.
  • l_join_attr (string) – join attribute in left table.
  • r_join_attr (string) – join attribute in right table.
  • tokenizer (Tokenizer) – tokenizer to be used to tokenize join attributes.
  • threshold (float) – Jaccard similarity threshold to be satisfied.
  • comp_op (string) – comparison operator. Supported values are ‘>=’, ‘>’ and ‘=’ (defaults to ‘>=’).
  • allow_empty (boolean) – flag to indicate whether tuple pairs with empty set of tokens in both the join attributes should be included in the output (defaults to True).
  • allow_missing (boolean) – flag to indicate whether tuple pairs with missing value in at least one of the join attributes should be included in the output (defaults to False). If this flag is set to True, a tuple in ltable with missing value in the join attribute will be matched with every tuple in rtable and vice versa.
  • l_out_attrs (list) – list of attribute names from the left table to be included in the output table (defaults to None).
  • r_out_attrs (list) – list of attribute names from the right table to be included in the output table (defaults to None).
  • l_out_prefix (string) – prefix to be used for the attribute names coming from the left table, in the output table (defaults to ‘l_’).
  • r_out_prefix (string) – prefix to be used for the attribute names coming from the right table, in the output table (defaults to ‘r_’).
  • out_sim_score (boolean) – flag to indicate whether similarity score should be included in the output table (defaults to True). Setting this flag to True will add a column named ‘_sim_score’ in the output table. This column will contain the similarity scores for the tuple pairs in the output.
  • n_jobs (int) – number of parallel jobs to use for the computation (defaults to 1). If -1 is given, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).
  • show_progress (boolean) – flag to indicate whether task progress should be displayed to the user (defaults to True).
Returns:

An output table containing tuple pairs that satisfy the join condition (DataFrame).

Overlap Join

py_stringsimjoin.join.overlap_join.overlap_join(ltable, rtable, l_key_attr, r_key_attr, l_join_attr, r_join_attr, tokenizer, threshold, comp_op='>=', allow_missing=False, l_out_attrs=None, r_out_attrs=None, l_out_prefix='l_', r_out_prefix='r_', out_sim_score=True, n_jobs=1, show_progress=True)[source]

Join two tables using overlap measure.

For two sets X and Y, the overlap between them is given by:

\(overlap(X, Y) = |X \cap Y|\)

Finds tuple pairs from left table and right table such that the overlap between the join attributes satisfies the condition on input threshold. For example, if the comparison operator is ‘>=’, finds tuple pairs whose overlap between the strings that are the values of the join attributes is greater than or equal to the input threshold, as specified in “threshold”.

Parameters:
  • ltable (DataFrame) – left input table.
  • rtable (DataFrame) – right input table.
  • l_key_attr (string) – key attribute in left table.
  • r_key_attr (string) – key attribute in right table.
  • l_join_attr (string) – join attribute in left table.
  • r_join_attr (string) – join attribute in right table.
  • tokenizer (Tokenizer) – tokenizer to be used to tokenize join attributes.
  • threshold (float) – overlap threshold to be satisfied.
  • comp_op (string) – comparison operator. Supported values are ‘>=’, ‘>’ and ‘=’ (defaults to ‘>=’).
  • allow_missing (boolean) – flag to indicate whether tuple pairs with missing value in at least one of the join attributes should be included in the output (defaults to False). If this flag is set to True, a tuple in ltable with missing value in the join attribute will be matched with every tuple in rtable and vice versa.
  • l_out_attrs (list) – list of attribute names from the left table to be included in the output table (defaults to None).
  • r_out_attrs (list) – list of attribute names from the right table to be included in the output table (defaults to None).
  • l_out_prefix (string) – prefix to be used for the attribute names coming from the left table, in the output table (defaults to ‘l_’).
  • r_out_prefix (string) – prefix to be used for the attribute names coming from the right table, in the output table (defaults to ‘r_’).
  • out_sim_score (boolean) – flag to indicate whether similarity score should be included in the output table (defaults to True). Setting this flag to True will add a column named ‘_sim_score’ in the output table. This column will contain the similarity scores for the tuple pairs in the output.
  • n_jobs (int) – number of parallel jobs to use for the computation (defaults to 1). If -1 is given, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).
  • show_progress (boolean) – flag to indicate whether task progress should be displayed to the user (defaults to True).
Returns:

An output table containing tuple pairs that satisfy the join condition (DataFrame).

Overlap Coefficient Join

py_stringsimjoin.join.overlap_coefficient_join.overlap_coefficient_join(ltable, rtable, l_key_attr, r_key_attr, l_join_attr, r_join_attr, tokenizer, threshold, comp_op='>=', allow_empty=True, allow_missing=False, l_out_attrs=None, r_out_attrs=None, l_out_prefix='l_', r_out_prefix='r_', out_sim_score=True, n_jobs=1, show_progress=True)[source]

Join two tables using overlap coefficient.

For two sets X and Y, the overlap coefficient between them is given by:

\(overlap\_coefficient(X, Y) = \frac{|X \cap Y|}{\min(|X|, |Y|)}\)

In the case where one of X and Y is an empty set and the other is a non-empty set, we define their overlap coefficient to be 0. In the case where both X and Y are empty sets, we define their overlap coefficient to be 1.

Finds tuple pairs from left table and right table such that the overlap coefficient between the join attributes satisfies the condition on input threshold. For example, if the comparison operator is ‘>=’, finds tuple pairs whose overlap coefficient between the strings that are the values of the join attributes is greater than or equal to the input threshold, as specified in “threshold”.

Parameters:
  • ltable (DataFrame) – left input table.
  • rtable (DataFrame) – right input table.
  • l_key_attr (string) – key attribute in left table.
  • r_key_attr (string) – key attribute in right table.
  • l_join_attr (string) – join attribute in left table.
  • r_join_attr (string) – join attribute in right table.
  • tokenizer (Tokenizer) – tokenizer to be used to tokenize join attributes.
  • threshold (float) – overlap coefficient threshold to be satisfied.
  • comp_op (string) – comparison operator. Supported values are ‘>=’, ‘>’ and ‘=’ (defaults to ‘>=’).
  • allow_empty (boolean) – flag to indicate whether tuple pairs with empty set of tokens in both the join attributes should be included in the output (defaults to True).
  • allow_missing (boolean) – flag to indicate whether tuple pairs with missing value in at least one of the join attributes should be included in the output (defaults to False). If this flag is set to True, a tuple in ltable with missing value in the join attribute will be matched with every tuple in rtable and vice versa.
  • l_out_attrs (list) – list of attribute names from the left table to be included in the output table (defaults to None).
  • r_out_attrs (list) – list of attribute names from the right table to be included in the output table (defaults to None).
  • l_out_prefix (string) – prefix to be used for the attribute names coming from the left table, in the output table (defaults to ‘l_’).
  • r_out_prefix (string) – prefix to be used for the attribute names coming from the right table, in the output table (defaults to ‘r_’).
  • out_sim_score (boolean) – flag to indicate whether similarity score should be included in the output table (defaults to True). Setting this flag to True will add a column named ‘_sim_score’ in the output table. This column will contain the similarity scores for the tuple pairs in the output.
  • n_jobs (int) – number of parallel jobs to use for the computation (defaults to 1). If -1 is given, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).
  • show_progress (boolean) – flag to indicate whether task progress should be displayed to the user (defaults to True).
Returns:

An output table containing tuple pairs that satisfy the join condition (DataFrame).

Filters

Overlap Filter

class py_stringsimjoin.filter.overlap_filter.OverlapFilter(tokenizer, overlap_size=1, comp_op='>=', allow_missing=False)[source]

Finds candidate matching pairs of strings using overlap filtering technique.

A string pair is output by overlap filter only if the number of common tokens in the strings satisfy the condition on overlap size threshold. For example, if the comparison operator is ‘>=’, a string pair is output if the number of common tokens is greater than or equal to the overlap size threshold, as specified by “overlap_size”.

Parameters:
  • tokenizer (Tokenizer) – tokenizer to be used.
  • overlap_size (int) – overlap threshold to be used by the filter.
  • comp_op (string) – comparison operator. Supported values are ‘>=’, ‘>’ and ‘=’ (defaults to ‘>=’).
  • allow_missing (boolean) – A flag to indicate whether pairs containing missing value should survive the filter (defaults to False).
tokenizer

Tokenizer

An attribute to store the tokenizer.

overlap_size

int

An attribute to store the overlap threshold value.

comp_op

string

An attribute to store the comparison operator.

allow_missing

boolean

An attribute to store the value of the flag allow_missing.

filter_candset(candset, candset_l_key_attr, candset_r_key_attr, ltable, rtable, l_key_attr, r_key_attr, l_filter_attr, r_filter_attr, n_jobs=1, show_progress=True)

Finds candidate matching pairs of strings from the input candidate set.

Parameters:
  • candset (DataFrame) – input candidate set.
  • candset_l_key_attr (string) – attribute in candidate set which is a key in left table.
  • candset_r_key_attr (string) – attribute in candidate set which is a key in right table.
  • ltable (DataFrame) – left input table.
  • rtable (DataFrame) – right input table.
  • l_key_attr (string) – key attribute in left table.
  • r_key_attr (string) – key attribute in right table.
  • l_filter_attr (string) – attribute in left table on which the filter should be applied.
  • r_filter_attr (string) – attribute in right table on which the filter should be applied.
  • n_jobs (int) – number of parallel jobs to use for the computation (defaults to 1). If -1 is given, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).
  • show_progress (boolean) – flag to indicate whether task progress should be displayed to the user (defaults to True).
Returns:

An output table containing tuple pairs from the candidate set that survive the filter (DataFrame).

filter_pair(lstring, rstring)[source]

Checks if the input strings get dropped by the overlap filter.

Parameters:lstring,rstring (string) – input strings
Returns:A flag indicating whether the string pair is dropped (boolean).
filter_tables(ltable, rtable, l_key_attr, r_key_attr, l_filter_attr, r_filter_attr, l_out_attrs=None, r_out_attrs=None, l_out_prefix='l_', r_out_prefix='r_', out_sim_score=False, n_jobs=1, show_progress=True)[source]

Finds candidate matching pairs of strings from the input tables using overlap filtering technique.

Parameters:
  • ltable (DataFrame) – left input table.
  • rtable (DataFrame) – right input table.
  • l_key_attr (string) – key attribute in left table.
  • r_key_attr (string) – key attribute in right table.
  • l_filter_attr (string) – attribute in left table on which the filter should be applied.
  • r_filter_attr (string) – attribute in right table on which the filter should be applied.
  • l_out_attrs (list) – list of attribute names from the left table to be included in the output table (defaults to None).
  • r_out_attrs (list) – list of attribute names from the right table to be included in the output table (defaults to None).
  • l_out_prefix (string) – prefix to be used for the attribute names coming from the left table, in the output table (defaults to ‘l_’).
  • r_out_prefix (string) – prefix to be used for the attribute names coming from the right table, in the output table (defaults to ‘r_’).
  • out_sim_score (boolean) – flag to indicate whether the overlap score should be included in the output table (defaults to True). Setting this flag to True will add a column named ‘_sim_score’ in the output table. This column will contain the overlap scores for the tuple pairs in the output.
  • n_jobs (int) – number of parallel jobs to use for the computation (defaults to 1). If -1 is given, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).
  • show_progress (boolean) – flag to indicate whether task progress should be displayed to the user (defaults to True).
Returns:

An output table containing tuple pairs that survive the filter (DataFrame).

Size Filter

class py_stringsimjoin.filter.size_filter.SizeFilter(tokenizer, sim_measure_type, threshold, allow_empty=True, allow_missing=False)[source]

Finds candidate matching pairs of strings using size filtering technique.

For similarity measures such as cosine, Dice, Jaccard and overlap, the filter finds candidate string pairs that may have similarity score greater than or equal to the input threshold, as specified in “threshold”. For distance measures such as edit distance, the filter finds candidate string pairs that may have distance score less than or equal to the threshold.

To know more about size filtering, refer to the string matching chapter of the “Principles of Data Integration” book.

Parameters:
  • tokenizer (Tokenizer) – tokenizer to be used.
  • sim_measure_type (string) – similarity measure type. Supported types are ‘JACCARD’, ‘COSINE’, ‘DICE’, ‘OVERLAP’ and ‘EDIT_DISTANCE’.
  • threshold (float) – threshold to be used by the filter.
  • allow_empty (boolean) – A flag to indicate whether pairs in which both strings are tokenized into an empty set of tokens should survive the filter (defaults to True). This flag is not valid for measures such as ‘OVERLAP’ and ‘EDIT_DISTANCE’.
  • allow_missing (boolean) – A flag to indicate whether pairs containing missing value should survive the filter (defaults to False).
tokenizer

Tokenizer

An attribute to store the tokenizer.

sim_measure_type

string

An attribute to store the similarity measure type.

threshold

float

An attribute to store the threshold value.

allow_empty

boolean

An attribute to store the value of the flag allow_empty.

allow_missing

boolean

An attribute to store the value of the flag allow_missing.

filter_candset(candset, candset_l_key_attr, candset_r_key_attr, ltable, rtable, l_key_attr, r_key_attr, l_filter_attr, r_filter_attr, n_jobs=1, show_progress=True)

Finds candidate matching pairs of strings from the input candidate set.

Parameters:
  • candset (DataFrame) – input candidate set.
  • candset_l_key_attr (string) – attribute in candidate set which is a key in left table.
  • candset_r_key_attr (string) – attribute in candidate set which is a key in right table.
  • ltable (DataFrame) – left input table.
  • rtable (DataFrame) – right input table.
  • l_key_attr (string) – key attribute in left table.
  • r_key_attr (string) – key attribute in right table.
  • l_filter_attr (string) – attribute in left table on which the filter should be applied.
  • r_filter_attr (string) – attribute in right table on which the filter should be applied.
  • n_jobs (int) – number of parallel jobs to use for the computation (defaults to 1). If -1 is given, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).
  • show_progress (boolean) – flag to indicate whether task progress should be displayed to the user (defaults to True).
Returns:

An output table containing tuple pairs from the candidate set that survive the filter (DataFrame).

filter_pair(lstring, rstring)[source]

Checks if the input strings get dropped by the size filter.

Parameters:lstring,rstring (string) – input strings
Returns:A flag indicating whether the string pair is dropped (boolean).
filter_tables(ltable, rtable, l_key_attr, r_key_attr, l_filter_attr, r_filter_attr, l_out_attrs=None, r_out_attrs=None, l_out_prefix='l_', r_out_prefix='r_', n_jobs=1, show_progress=True)[source]

Finds candidate matching pairs of strings from the input tables using size filtering technique.

Parameters:
  • ltable (DataFrame) – left input table.
  • rtable (DataFrame) – right input table.
  • l_key_attr (string) – key attribute in left table.
  • r_key_attr (string) – key attribute in right table.
  • l_filter_attr (string) – attribute in left table on which the filter should be applied.
  • r_filter_attr (string) – attribute in right table on which the filter should be applied.
  • l_out_attrs (list) – list of attribute names from the left table to be included in the output table (defaults to None).
  • r_out_attrs (list) – list of attribute names from the right table to be included in the output table (defaults to None).
  • l_out_prefix (string) – prefix to be used for the attribute names coming from the left table, in the output table (defaults to ‘l_’).
  • r_out_prefix (string) – prefix to be used for the attribute names coming from the right table, in the output table (defaults to ‘r_’).
  • n_jobs (int) – number of parallel jobs to use for the computation (defaults to 1). If -1 is given, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).
  • show_progress (boolean) – flag to indicate whether task progress should be displayed to the user (defaults to True).
Returns:

An output table containing tuple pairs that survive the filter (DataFrame).

Prefix Filter

class py_stringsimjoin.filter.prefix_filter.PrefixFilter(tokenizer, sim_measure_type, threshold, allow_empty=True, allow_missing=False)[source]

Finds candidate matching pairs of strings using prefix filtering technique.

For similarity measures such as cosine, Dice, Jaccard and overlap, the filter finds candidate string pairs that may have similarity score greater than or equal to the input threshold, as specified in “threshold”. For distance measures such as edit distance, the filter finds candidate string pairs that may have distance score less than or equal to the threshold.

To know more about prefix filtering, refer to the string matching chapter of the “Principles of Data Integration” book.

Parameters:
  • tokenizer (Tokenizer) – tokenizer to be used.
  • sim_measure_type (string) – similarity measure type. Supported types are ‘JACCARD’, ‘COSINE’, ‘DICE’, ‘OVERLAP’ and ‘EDIT_DISTANCE’.
  • threshold (float) – threshold to be used by the filter.
  • allow_empty (boolean) – A flag to indicate whether pairs in which both strings are tokenized into an empty set of tokens should survive the filter (defaults to True). This flag is not valid for measures such as ‘OVERLAP’ and ‘EDIT_DISTANCE’.
  • allow_missing (boolean) – A flag to indicate whether pairs containing missing value should survive the filter (defaults to False).
tokenizer

Tokenizer

An attribute to store the tokenizer.

sim_measure_type

string

An attribute to store the similarity measure type.

threshold

float

An attribute to store the threshold value.

allow_empty

boolean

An attribute to store the value of the flag allow_empty.

allow_missing

boolean

An attribute to store the value of the flag allow_missing.

filter_candset(candset, candset_l_key_attr, candset_r_key_attr, ltable, rtable, l_key_attr, r_key_attr, l_filter_attr, r_filter_attr, n_jobs=1, show_progress=True)

Finds candidate matching pairs of strings from the input candidate set.

Parameters:
  • candset (DataFrame) – input candidate set.
  • candset_l_key_attr (string) – attribute in candidate set which is a key in left table.
  • candset_r_key_attr (string) – attribute in candidate set which is a key in right table.
  • ltable (DataFrame) – left input table.
  • rtable (DataFrame) – right input table.
  • l_key_attr (string) – key attribute in left table.
  • r_key_attr (string) – key attribute in right table.
  • l_filter_attr (string) – attribute in left table on which the filter should be applied.
  • r_filter_attr (string) – attribute in right table on which the filter should be applied.
  • n_jobs (int) – number of parallel jobs to use for the computation (defaults to 1). If -1 is given, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).
  • show_progress (boolean) – flag to indicate whether task progress should be displayed to the user (defaults to True).
Returns:

An output table containing tuple pairs from the candidate set that survive the filter (DataFrame).

filter_pair(lstring, rstring)[source]

Checks if the input strings get dropped by the prefix filter.

Parameters:lstring,rstring (string) – input strings
Returns:A flag indicating whether the string pair is dropped (boolean).
filter_tables(ltable, rtable, l_key_attr, r_key_attr, l_filter_attr, r_filter_attr, l_out_attrs=None, r_out_attrs=None, l_out_prefix='l_', r_out_prefix='r_', n_jobs=1, show_progress=True)[source]

Finds candidate matching pairs of strings from the input tables using prefix filtering technique.

Parameters:
  • ltable (DataFrame) – left input table.
  • rtable (DataFrame) – right input table.
  • l_key_attr (string) – key attribute in left table.
  • r_key_attr (string) – key attribute in right table.
  • l_filter_attr (string) – attribute in left table on which the filter should be applied.
  • r_filter_attr (string) – attribute in right table on which the filter should be applied.
  • l_out_attrs (list) – list of attribute names from the left table to be included in the output table (defaults to None).
  • r_out_attrs (list) – list of attribute names from the right table to be included in the output table (defaults to None).
  • l_out_prefix (string) – prefix to be used for the attribute names coming from the left table, in the output table (defaults to ‘l_’).
  • r_out_prefix (string) – prefix to be used for the attribute names coming from the right table, in the output table (defaults to ‘r_’).
  • out_sim_score (boolean) – flag to indicate whether the overlap score should be included in the output table (defaults to True). Setting this flag to True will add a column named ‘_sim_score’ in the output table. This column will contain the overlap scores for the tuple pairs in the output.
  • n_jobs (int) – number of parallel jobs to use for the computation (defaults to 1). If -1 is given, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).
  • show_progress (boolean) – flag to indicate whether task progress should be displayed to the user (defaults to True).
Returns:

An output table containing tuple pairs that survive the filter (DataFrame).

Position Filter

class py_stringsimjoin.filter.position_filter.PositionFilter(tokenizer, sim_measure_type, threshold, allow_empty=True, allow_missing=False)[source]

Finds candidate matching pairs of strings using position filtering technique.

For similarity measures such as cosine, Dice, Jaccard and overlap, the filter finds candidate string pairs that may have similarity score greater than or equal to the input threshold, as specified in “threshold”. For distance measures such as edit distance, the filter finds candidate string pairs that may have distance score less than or equal to the threshold.

To know more about position filtering, refer to the string matching chapter of the “Principles of Data Integration” book.

Parameters:
  • tokenizer (Tokenizer) – tokenizer to be used.
  • sim_measure_type (string) – similarity measure type. Supported types are ‘JACCARD’, ‘COSINE’, ‘DICE’, ‘OVERLAP’ and ‘EDIT_DISTANCE’.
  • threshold (float) – threshold to be used by the filter.
  • allow_empty (boolean) – A flag to indicate whether pairs in which both strings are tokenized into an empty set of tokens should survive the filter (defaults to True). This flag is not valid for measures such as ‘OVERLAP’ and ‘EDIT_DISTANCE’.
  • allow_missing (boolean) – A flag to indicate whether pairs containing missing value should survive the filter (defaults to False).
tokenizer

Tokenizer

An attribute to store the tokenizer.

sim_measure_type

string

An attribute to store the similarity measure type.

threshold

float

An attribute to store the threshold value.

allow_empty

boolean

An attribute to store the value of the flag allow_empty.

allow_missing

boolean

An attribute to store the value of the flag allow_missing.

filter_candset(candset, candset_l_key_attr, candset_r_key_attr, ltable, rtable, l_key_attr, r_key_attr, l_filter_attr, r_filter_attr, n_jobs=1, show_progress=True)

Finds candidate matching pairs of strings from the input candidate set.

Parameters:
  • candset (DataFrame) – input candidate set.
  • candset_l_key_attr (string) – attribute in candidate set which is a key in left table.
  • candset_r_key_attr (string) – attribute in candidate set which is a key in right table.
  • ltable (DataFrame) – left input table.
  • rtable (DataFrame) – right input table.
  • l_key_attr (string) – key attribute in left table.
  • r_key_attr (string) – key attribute in right table.
  • l_filter_attr (string) – attribute in left table on which the filter should be applied.
  • r_filter_attr (string) – attribute in right table on which the filter should be applied.
  • n_jobs (int) – number of parallel jobs to use for the computation (defaults to 1). If -1 is given, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).
  • show_progress (boolean) – flag to indicate whether task progress should be displayed to the user (defaults to True).
Returns:

An output table containing tuple pairs from the candidate set that survive the filter (DataFrame).

filter_pair(lstring, rstring)[source]

Checks if the input strings get dropped by the position filter.

Parameters:lstring,rstring (string) – input strings
Returns:A flag indicating whether the string pair is dropped (boolean).
filter_tables(ltable, rtable, l_key_attr, r_key_attr, l_filter_attr, r_filter_attr, l_out_attrs=None, r_out_attrs=None, l_out_prefix='l_', r_out_prefix='r_', n_jobs=1, show_progress=True)[source]

Finds candidate matching pairs of strings from the input tables using position filtering technique.

Parameters:
  • ltable (DataFrame) – left input table.
  • rtable (DataFrame) – right input table.
  • l_key_attr (string) – key attribute in left table.
  • r_key_attr (string) – key attribute in right table.
  • l_filter_attr (string) – attribute in left table on which the filter should be applied.
  • r_filter_attr (string) – attribute in right table on which the filter should be applied.
  • l_out_attrs (list) – list of attribute names from the left table to be included in the output table (defaults to None).
  • r_out_attrs (list) – list of attribute names from the right table to be included in the output table (defaults to None).
  • l_out_prefix (string) – prefix to be used for the attribute names coming from the left table, in the output table (defaults to ‘l_’).
  • r_out_prefix (string) – prefix to be used for the attribute names coming from the right table, in the output table (defaults to ‘r_’).
  • n_jobs (int) – number of parallel jobs to use for the computation (defaults to 1). If -1 is given, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).
  • show_progress (boolean) – flag to indicate whether task progress should be displayed to the user (defaults to True).
Returns:

An output table containing tuple pairs that survive the filter (DataFrame).

Suffix Filter

class py_stringsimjoin.filter.suffix_filter.SuffixFilter(tokenizer, sim_measure_type, threshold, allow_empty=True, allow_missing=False)[source]

Finds candidate matching pairs of strings using suffix filtering technique.

For similarity measures such as cosine, Dice, Jaccard and overlap, the filter finds candidate string pairs that may have similarity score greater than or equal to the input threshold, as specified in “threshold”. For distance measures such as edit distance, the filter finds candidate string pairs that may have distance score less than or equal to the threshold.

To know more about suffix filtering, refer to the paper Efficient Similarity Joins for Near Duplicate Detection (Chuan Xiao, Wei Wang, Xuemin Lin and Jeffrey Xu Yu), WWW 08.

Parameters:
  • tokenizer (Tokenizer) – tokenizer to be used.
  • sim_measure_type (string) – similarity measure type. Supported types are ‘JACCARD’, ‘COSINE’, ‘DICE’, ‘OVERLAP’ and ‘EDIT_DISTANCE’.
  • threshold (float) – threshold to be used by the filter.
  • allow_empty (boolean) – A flag to indicate whether pairs in which both strings are tokenized into an empty set of tokens should survive the filter (defaults to True). This flag is not valid for measures such as ‘OVERLAP’ and ‘EDIT_DISTANCE’.
  • allow_missing (boolean) – A flag to indicate whether pairs containing missing value should survive the filter (defaults to False).
tokenizer

Tokenizer

An attribute to store the tokenizer.

sim_measure_type

string

An attribute to store the similarity measure type.

threshold

float

An attribute to store the threshold value.

allow_empty

boolean

An attribute to store the value of the flag allow_empty.

allow_missing

boolean

An attribute to store the value of the flag allow_missing.

filter_candset(candset, candset_l_key_attr, candset_r_key_attr, ltable, rtable, l_key_attr, r_key_attr, l_filter_attr, r_filter_attr, n_jobs=1, show_progress=True)

Finds candidate matching pairs of strings from the input candidate set.

Parameters:
  • candset (DataFrame) – input candidate set.
  • candset_l_key_attr (string) – attribute in candidate set which is a key in left table.
  • candset_r_key_attr (string) – attribute in candidate set which is a key in right table.
  • ltable (DataFrame) – left input table.
  • rtable (DataFrame) – right input table.
  • l_key_attr (string) – key attribute in left table.
  • r_key_attr (string) – key attribute in right table.
  • l_filter_attr (string) – attribute in left table on which the filter should be applied.
  • r_filter_attr (string) – attribute in right table on which the filter should be applied.
  • n_jobs (int) – number of parallel jobs to use for the computation (defaults to 1). If -1 is given, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).
  • show_progress (boolean) – flag to indicate whether task progress should be displayed to the user (defaults to True).
Returns:

An output table containing tuple pairs from the candidate set that survive the filter (DataFrame).

filter_pair(lstring, rstring)[source]

Checks if the input strings get dropped by the suffix filter.

Parameters:lstring,rstring (string) – input strings
Returns:A flag indicating whether the string pair is dropped (boolean).
filter_tables(ltable, rtable, l_key_attr, r_key_attr, l_filter_attr, r_filter_attr, l_out_attrs=None, r_out_attrs=None, l_out_prefix='l_', r_out_prefix='r_', n_jobs=1, show_progress=True)[source]

Finds candidate matching pairs of strings from the input tables using suffix filtering technique.

Parameters:
  • ltable (DataFrame) – left input table.
  • rtable (DataFrame) – right input table.
  • l_key_attr (string) – key attribute in left table.
  • r_key_attr (string) – key attribute in right table.
  • l_filter_attr (string) – attribute in left table on which the filter should be applied.
  • r_filter_attr (string) – attribute in right table on which the filter should be applied.
  • l_out_attrs (list) – list of attribute names from the left table to be included in the output table (defaults to None).
  • r_out_attrs (list) – list of attribute names from the right table to be included in the output table (defaults to None).
  • l_out_prefix (string) – prefix to be used for the attribute names coming from the left table, in the output table (defaults to ‘l_’).
  • r_out_prefix (string) – prefix to be used for the attribute names coming from the right table, in the output table (defaults to ‘r_’).
  • n_jobs (int) – number of parallel jobs to use for the computation (defaults to 1). If -1 is given, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).
  • show_progress (boolean) – flag to indicate whether task progress should be displayed to the user (defaults to True).
Returns:

An output table containing tuple pairs that survive the filter (DataFrame).

Matchers

py_stringsimjoin.matcher.apply_matcher.apply_matcher(candset, candset_l_key_attr, candset_r_key_attr, ltable, rtable, l_key_attr, r_key_attr, l_match_attr, r_match_attr, tokenizer, sim_function, threshold, comp_op='>=', allow_missing=False, l_out_attrs=None, r_out_attrs=None, l_out_prefix='l_', r_out_prefix='r_', out_sim_score=True, n_jobs=1, show_progress=True)[source]

Find matching string pairs from the candidate set (typically produced by applying a filter to two tables) by applying a matcher of form (sim_function comp_op threshold).

Specifically, this method computes the input similarity function on string pairs in the candidate set and checks if the resulting score satisfies the input threshold (depending on the comparison operator).

Parameters:
  • candset (DataFrame) – input candidate set.
  • candset_l_key_attr (string) – attribute in candidate set which is a key in left table.
  • candset_r_key_attr (string) – attribute in candidate set which is a key in right table.
  • ltable (DataFrame) – left input table.
  • rtable (DataFrame) – right input table.
  • l_key_attr (string) – key attribute in left table.
  • r_key_attr (string) – key attribute in right table.
  • l_match_attr (string) – attribute in left table on which the matcher should be applied.
  • r_match_attr (string) – attribute in right table on which the matcher should be applied.
  • tokenizer (Tokenizer) – tokenizer to be used to tokenize the match attributes. If set to None, the matcher is applied directly on the match attributes.
  • sim_function (function) – matcher function to be applied.
  • threshold (float) – threshold to be satisfied.
  • comp_op (string) – comparison operator. Supported values are ‘>=’, ‘>’, ‘ <=’, ‘<’, ‘=’ and ‘!=’ (defaults to ‘>=’).
  • allow_missing (boolean) – flag to indicate whether tuple pairs with missing value in at least one of the match attributes should be included in the output (defaults to False).
  • l_out_attrs (list) – list of attribute names from the left table to be included in the output table (defaults to None).
  • r_out_attrs (list) – list of attribute names from the right table to be included in the output table (defaults to None).
  • l_out_prefix (string) – prefix to be used for the attribute names coming from the left table, in the output table (defaults to ‘l_’).
  • r_out_prefix (string) – prefix to be used for the attribute names coming from the right table, in the output table (defaults to ‘r_’).
  • out_sim_score (boolean) – flag to indicate whether similarity score should be included in the output table (defaults to True). Setting this flag to True will add a column named ‘_sim_score’ in the output table. This column will contain the similarity scores for the tuple pairs in the output.
  • n_jobs (int) – number of parallel jobs to use for the computation (defaults to 1). If -1 is given, all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) becomes less than 1, then no parallel computing code will be used (i.e., equivalent to the default).
  • show_progress (boolean) – flag to indicate whether task progress should be displayed to the user (defaults to True).
Returns:

An output table containing tuple pairs from the candidate set that survive the matcher (DataFrame).

Utilities

py_stringsimjoin.utils.converter.dataframe_column_to_str(dataframe, col_name, inplace=False, return_col=False)[source]

Convert columun in the dataframe into string type while preserving NaN values.

This method is useful when performing join over numeric columns. Currently, the join methods expect the join columns to be of string type. Hence, the numeric columns need to be converted to string type before performing the join.

Parameters:
  • dataframe (DataFrame) – Input pandas dataframe.
  • col_name (string) – Name of the column in the dataframe to be converted.
  • inplace (boolean) – A flag indicating whether the input dataframe should be modified inplace or in a copy of it.
  • return_col (boolean) – A flag indicating whether a copy of the converted column should be returned. When this flag is set to True, the method will not modify the original dataframe and will return a new column of string type. Only one of inplace and return_col can be set to True.
Returns:

A Boolean value when inplace is set to True.

A new dataframe when inplace is set to False and return_col is set to False.

A series when inplace is set to False and return_col is set to True.

py_stringsimjoin.utils.converter.series_to_str(series, inplace=False)[source]

Convert series into string type while preserving NaN values.

Parameters:
  • series (Series) – Input pandas series.
  • inplace (boolean) – A flag indicating whether the input series should be modified inplace or in a copy of it. This flag is ignored when the input series consists of only NaN values or the series is empty (with int or float type). In these two cases, we always return a copy irrespective of the inplace flag.
Returns:

A Boolean value when inplace is set to True.

A series when inplace is set to False.

Indices and tables