User Manual for py_stringmatching

This document shows the users how to install and use the package. To contribute to or further develop the package, see the project website, section “For Contributors and Developers”.

Contents

What is New?

Compared to Version 0.4.0, the followings are new:

  • Cython version was updated. The package is now built with updated Cython version >= 0.27.3.
  • Added support for Python 3.7 version and dropped Testing support for Python 3.3 version.

Installation

Requirements

  • Python 2.7 or Python 3.4+
  • C or C++ compiler (parts of the package are in Cython for efficiency reasons, and you need C or C++ compiler to compile these parts)

Platforms

py_stringmatching has been tested on Linux (Ubuntu with Kernel Version 3.13.0-40-generic), OS X (Darwin with Kernel Version 13.4.0), and Windows 8.1.

Dependencies

  • numpy 1.7.0 or higher
  • six

Note

The py_stringmatching installer will automatically install the above required packages.

C Compiler Required

Before installing this package, you need to make sure that you have a C compiler installed. This is necessary because this package contains Cython files. Go here for more information about how to check whether you already have a C compiler and how to install a C compiler.

After you have confirmed that you have a C compiler installed, you are ready to install the package. There are two ways to install py_stringmatching package: using pip or source distribution.

Installing Using pip

The easiest way to install the package is to use pip, which will retrieve py_stringmatching from PyPI then install it:

pip install py_stringmatching

Installing from Source Distribution

Step 1: Download the py_stringmatching package from here.

Step 2: Unzip the package and execute the following command from the package root:

python setup.py install

Note

The above command will try to install py_stringmatching into the defaul Python directory on your machine. If you do not have installation permission for that directory then you can install the package in your home directory as follows:

python setup.py install --user

For more information see the StackOverflow link.

Tutorial

Once the package has been installed, you can import the package as follows:

In [1]: import py_stringmatching as sm

Computing a similarity score between two given strings x and y then typically consists of four steps: (1) selecting a similarity measure type, (2) selecting a tokenizer type, (3) creating a tokenizer object (of the selected type) and using it to tokenize the two given strings x and y, and (4) creating a similarity measure object (of the selected type) and applying it to the output of the tokenizer to compute a similarity score. We now elaborate on these steps.

1. Selecting a Similarity Measure

First, you must select a similarity measure. The package py_stringmatching currently provides a set of different measures (with plan to add more). Examples of such measures are Jaccard, Levenshtein, TF/IDF, etc. To understand more about these measures, a good place to start is the string matching chapter of the book “Principles of Data Integration”. (This chapter is available on the package’s homepage.)

A major group of similarity measures treats input strings as sequences of characters (e.g., Levenshtein, Smith Waterman). Another group treats input strings as sets of tokens (e.g., Jaccard). Yet another group treats input strings as bags of tokens (e.g., TF/IDF). A bag of tokens is a collection of tokens such that a token can appear multiple times in the collection (as opposed to a set of tokens, where each token can appear only once).

The currently implemented similarity measures include:
  • sequence-based measures: affine gap, bag distance, editex, Hamming distance, Jaro, Jaro Winkler, Levenshtein, Needleman Wunsch, partial ratio, partial token sort, ratio, Smith Waterman, token sort.
  • set-based measures: cosine, Dice, Jaccard, overlap coefficient, Tversky Index.
  • bag-based measures: TF/IDF.
  • phonetic-based measures: soundex.

(There are also hybrid similarity measures: Monge Elkan, Soft TF/IDF, and Generalized Jaccard. They are so called because each of these measures uses multiple similarity measures. See their descriptions in this user manual to understand what types of input they expect.)

At this point, you should know if the selected similarity measure treats input strings as sequences, bags, or sets, so that later you can set the parameters of the tokenizing function properly (see Steps 2-3 below).

2. Selecting a Tokenizer Type

If the above selected similarity measure treats input strings as sequences of characters, then you do not need to tokenize the input strings x and y, and hence do not have to select a tokenizer type.

Otherwise, you need to select a tokenizer type. The package py_stringmatching currently provides a set of different tokenizer types: alphabetical tokenizer, alphanumeric tokenizer, delimiter-based tokenizer, qgram tokenizer, and whitespace tokenizer (more tokenizer types can easily be added).

A tokenizer will convert an input string into a set or a bag of tokens, as discussed in Step 3.

3. Creating a Tokenizer Object and Using It to Tokenize the Input Strings

If you have selected a tokenizer type in Step 2, then in Step 3 you create a tokenizer object of that type. If the intended similarity measure (selected in Step 1) treats the input strings as sets of tokens, then when creating the tokenizer object, you must set the flag return_set to True. Otherwise this flag defaults to False, and the created tokenizer object will tokenize a string into a bag of tokens.

The following examples create tokenizer objects where the flag return_set is not mentioned, thus defaulting to False. So these tokenizer objects will tokenize a string into a bag of tokens.

# create an alphabetical tokenizer that returns a bag of tokens
In [2]: alphabet_tok = sm.AlphabeticTokenizer()

# create an alphanumeric tokenizer
In [3]: alnum_tok = sm.AlphanumericTokenizer()

# create a delimiter tokenizer using comma as a delimiter
In [4]: delim_tok = sm.DelimiterTokenizer(delim_set=[','])

# create a qgram tokenizer using q=3
In [5]: qg3_tok = sm.QgramTokenizer(qval=3)

# create a whitespace tokenizer
In [6]: ws_tok = sm.WhitespaceTokenizer()

Given the string “up up and away”, the tokenizer alphabet_tok (defined above) will convert it into a bag of tokens [‘up’, ‘up’, ‘and’, ‘away’], where the token ‘up’ appears twice.

The following examples create tokenizer objects where the flag return_set is set to True. Thus these tokenizers will tokenize a string into a set of tokens.

# create an alphabetical tokenizer that returns a set of tokens
In [7]: alphabet_tok_set = sm.AlphabeticTokenizer(return_set=True)

# create a whitespace tokenizer that returns a set of tokens
In [8]: ws_tok_set = sm.WhitespaceTokenizer(return_set=True)

# create a qgram tokenizer with q=3 that returns a set of tokens
In [9]: qg3_tok_set = sm.QgramTokenizer(qval=3, return_set=True)

So given the same string “up up and away”, the tokenizer alphabet_tok_set (defined above) will convert it into a set of tokens [‘up’, ‘and’, ‘away’].

All tokenizers have a tokenize method which tokenizes a given input string into a set or bag of tokens (depending on whether the flag return_set is True or False), as these examples illustrate:

In [10]: test_string = ' .hello, world!! data, science, is    amazing!!. hello.'

# tokenize into a bag of alphabetical tokens
In [11]: alphabet_tok.tokenize(test_string)
Out[11]: ['hello', 'world', 'data', 'science', 'is', 'amazing', 'hello']

# tokenize into alphabetical tokens (with return_set set to True)
In [12]: alphabet_tok_set.tokenize(test_string)
Out[12]: ['hello', 'world', 'data', 'science', 'is', 'amazing']

# tokenize using comma as the delimiter
In [13]: delim_tok.tokenize(test_string)
Out[13]: [' .hello', ' world!! data', ' science', ' is    amazing!!. hello.']

# tokenize using whitespace as the delimiter
In [14]: ws_tok.tokenize(test_string)
Out[14]: ['.hello,', 'world!!', 'data,', 'science,', 'is', 'amazing!!.', 'hello.']

Thus, once you have created the tokenizer, you can use the tokenize method to tokenize the two input strings x and y (see more in Step 4 below).

Note

The tokenize method returns a Python list which represents a set of tokens or a bag of tokens, depending on whether the flag return_set is True or False.

4. Creating a Similarity Measure Object and Using It to Compute a Similarity Score

Recall that in Step 1 you have selected a similarity measure (e.g., Jaccard, Levenshtein). In this step you start by creating a similarity measure object of the selected type, as illustrated by these examples:

# create a Jaccard similarity measure object
In [15]: jac = sm.Jaccard()

# create a Levenshtein similarity measure object
In [16]: lev = sm.Levenshtein()

There are two main types of similarity measures.

  1. Those that when given two input strings will compute a true similarity score, which is a number in the range [0,1] such that the higher this number, the more similar the two input strings are.
  2. Those that when given two input strings will compute a distance score, which is a number such that the higher this number, the more dissimilar the two input strings are (this number is often not in the range [0,1]). Clearly, Type-2 measures (also known as distance measures), are the reverse of Type-1 measures.

For example, Jaccard similarity measure will compute a true similarity score in [0,1] for two input strings. Levenshtein similarity measure, on the other hand, is really a distance measure, which computes the edit distance between the two input strings (see for example Wikipedia or the string matching chapter in the book “Principles of Data Integration”). It is easy to convert a distance score into a true similarity score (again, see examples in the above book chapter).

Given the above, each similarity measure object in py_stringmatching is supplied with two methods: get_raw_score and get_sim_score. The first method will compute the raw score as defined by that type of similarity measures, be it similarity score or distance score. For example, for Jaccard this method will return a true similarity score, whereas for Levenshtein it will return an edit distance score.

The method get_sim_score normalizes the raw score to obtain a true similarity score (a number in [0,1], such that the higher this number the more similar the two strings are). For Jaccard, get_sim_score will simply call get_raw_score. For Levenshtein, however, get_sim_score will normalize the edit distance to return a true similarity score in [0,1].

Here are some examples of using the get_raw_score method:

# input strings
In [17]: x = 'string matching package'

In [18]: y = 'string matching library'

# compute Jaccard score over sets of tokens of x and y, tokenized using whitespace
In [19]: jac.get_raw_score(ws_tok_set.tokenize(x), ws_tok_set.tokenize(y))
Out[19]: 0.5

# compute Jaccard score over sets of tokens of x and y, tokenized into qgrams (with q=3)
In [20]: jac.get_raw_score(qg3_tok_set.tokenize(x), qg3_tok_set.tokenize(y))
Out[20]: 0.4375

# compute Levenshtein distance between x and y
In [21]: lev.get_raw_score(x, y)
Out[21]: 6

Note that in the above examples, the Jaccard measure treats the input strings as sets of tokens. And indeed, the two tokenizers ws_tok_set and qg3_tok_set as defined earlier would tokenize a string into a set of tokens. The Levenshtein measure, on the other hand, treats the input strings as sequences of characters. Hence when using it we do not have to tokenize the two strings x and y.

Here are some example of using the get_sim_score method:

# get normalized Levenshtein similarity score between x and y
In [22]: lev.get_sim_score(x, y)
Out[22]: 0.7391304347826086

# get normalized Jaccard similarity score (this is the same as the raw score)
In [23]: jac.get_sim_score(ws_tok_set.tokenize(x), ws_tok_set.tokenize(y))
Out[23]: 0.5

So depending on what you want, you can call get_raw_score or get_sim_score. Note, however, that certain measures such as affine gap, Monge-Elkan, Needleman-Wunsch, Smith-Waterman and Soft TF/IDF do not have a get_sim_score method, because there is no straightforward way to normalize the raw scores of these measures into similarity scores in [0,1] (see the Developer Manual for further explanation).

Handling a Large Number of String Pairs

Steps 1-4 above discuss the case where you want to compute the similarity score of only a single string pair.

There are however cases where you need to compute the similarity scores of many string pairs. For example, given a table A of 10K strings and a table B of 10K strings, you may need to compute the string similarity scores for all 100M string pairs in the Cartesian product of the two tables.

In such cases, you should avoid tokenizing the same string repeatedly, such as calling jac.get_sim_score(ws_tok_set.tokenize(x), ws_tok_set.tokenize(y)) for all pairs (x,y) in the Cartesian product. If you do this, a string x in table A will be tokenized 10K times, since it will appear in 10K pairs. This is clearly unnecessary and very expensive.

Instead, you should tokenize all strings in tables A and B only once, store the output of tokenizing in some Python structure, then call the similarity measure on these structures to compute similarity scores. This will avoid repeated tokenizing of the same strings.

Handling Missing Values

By “missing values” we mean cases where the values of one or more strings are missing (e.g., represented as None or NaN in Python). For example, given a row “David,,36” in a CSV file, the value of the second cell of this row is missing. So when this file is read into a data frame, the corresponding cell in the data frame will have the value NaN. Note that missing values are different from empty string values, which are represented as “”.

Handling missing values is tricky and application dependent (see the Developer Manual for a detailed discussion). For these reasons, the tokenizers and similarity measures in the package py_stringmatching do not handle missing values. If one of their input arguments is missing, they will stop, raising an error. Put differently, they expect non-missing input arguments.

Adding Prefix and Suffix to the Input String for Qgram Tokenizers

Consider computing a similarity score between two strings “mo” and “moo” using 3gram tokenizing followed by Jaccard scoring. Tokenizing “mo” returns an empty set, because “mo” contains no 3gram. Tokenizing “moo” returns the set {“moo”}. As a result, the Jaccard score between “mo” and “moo” is 0. This is somewhat counterintuitive, because the two strings are similar.

To address such cases, in practice it is common to add a prefix of (q-1) characters (using #) and a suffix of (q-1) characters (using $) to the input string, before generating qgram tokens. For example, “moo” will be padded to be “##moo$$”, before tokenizing. The flag “padding” in qgram tokenizers can be set for this purpose (the default is True, in which case the string will be padded).

Class Hierarchy for Tokenizers and Similarity Measures

The current version implements the following class hierarchy for tokenizers:

Tokenizer
  • DefinitionTokenizer
    • AlphabeticTokenizer
    • AlphanumericTokenizer
    • QgramTokenizer
  • DelimiterTokenizer
    • WhitespaceTokenizer

The version implements the following class hierarchy for similarity measures:

SimilarityMeasure
  • SequenceSimilarityMeasure
    • Affine
    • BagDistance
    • Editex
    • HammingDistance
    • Jaro
    • JaroWinkler
    • Levenshtein
    • NeedlemanWunsch
    • PartialRatio
    • PartialTokenSort
    • Ratio
    • SmithWaterman
    • TokenSort
  • TokenSimilarityMeasure
    • Cosine
    • Dice
    • Jaccard
    • OverlapCoefficient
    • TfIdf
    • TverskyIndex
  • HybridSimilarityMeasure
    • GeneralizedJaccard
    • MongeElkan
    • SoftTfIdf
  • PhoneticSimilarityMeasure
    • Soundex

References

AnHai Doan, Alon Halevy, Zachary Ives, “Principles of Data Integration”, Morgan Kaufmann, 2012. Chapter 4 “String Matching” (available on the package’s homepage).

Tokenizers

Alphabetic Tokenizer

class py_stringmatching.tokenizer.alphabetic_tokenizer.AlphabeticTokenizer(return_set=False)

Returns tokens that are maximal sequences of consecutive alphabetical characters.

Parameters:return_set (boolean) – A flag to indicate whether to return a set of tokens instead of a bag of tokens (defaults to False).
return_set

An attribute that stores the value for the flag return_set.

Type:boolean
get_return_set()

Gets the value of the return_set flag.

Returns:The boolean value of the return_set flag.
set_return_set(return_set)

Sets the value of the return_set flag.

Parameters:return_set (boolean) – a flag to indicate whether to return a set of tokens instead of a bag of tokens.
tokenize(input_string)

Tokenizes input string into alphabetical tokens.

Parameters:input_string (str) – The string to be tokenized.
Returns:A Python list, which represents a set of tokens if the flag return_set is True, and a bag of tokens otherwise.
Raises:TypeError – If the input is not a string.

Examples

>>> al_tok = AlphabeticTokenizer()
>>> al_tok.tokenize('data99science, data#integration.')
['data', 'science', 'data', 'integration']
>>> al_tok.tokenize('99')
[]
>>> al_tok = AlphabeticTokenizer(return_set=True)
>>> al_tok.tokenize('data99science, data#integration.')
['data', 'science', 'integration']

Alphanumeric Tokenizer

class py_stringmatching.tokenizer.alphanumeric_tokenizer.AlphanumericTokenizer(return_set=False)

Returns tokens that are maximal sequences of consecutive alphanumeric characters.

Parameters:return_set (boolean) – A flag to indicate whether to return a set of tokens instead of a bag of tokens (defaults to False).
return_set

An attribute to store the value of the flag return_set.

Type:boolean
get_return_set()

Gets the value of the return_set flag.

Returns:The boolean value of the return_set flag.
set_return_set(return_set)

Sets the value of the return_set flag.

Parameters:return_set (boolean) – a flag to indicate whether to return a set of tokens instead of a bag of tokens.
tokenize(input_string)

Tokenizes input string into alphanumeric tokens.

Parameters:input_string (str) – The string to be tokenized.
Returns:A Python list, which represents a set of tokens if the flag return_set is true, and a bag of tokens otherwise.
Raises:TypeError – If the input is not a string.

Examples

>>> alnum_tok = AlphanumericTokenizer()
>>> alnum_tok.tokenize('data9,(science), data9#.(integration).88')
['data9', 'science', 'data9', 'integration', '88']
>>> alnum_tok.tokenize('#.&')
[]
>>> alnum_tok = AlphanumericTokenizer(return_set=True)
>>> alnum_tok.tokenize('data9,(science), data9#.(integration).88')
['data9', 'science', 'integration', '88']

Delimiter Tokenizer

class py_stringmatching.tokenizer.delimiter_tokenizer.DelimiterTokenizer(delim_set={' '}, return_set=False)

Uses delimiters to find tokens, as apposed to using definitions.

Examples of delimiters include white space and punctuations. Examples of definitions include alphabetical and qgram tokens.

Parameters:
  • delim_set (set) – A set of delimiter strings (defaults to space delimiter).
  • return_set (boolean) – A flag to indicate whether to return a set of tokens instead of a bag of tokens (defaults to False).
return_set

An attribute to store the value of the flag return_set.

Type:boolean
get_delim_set()

Gets the current set of delimiters.

Returns:A Python set which is the current set of delimiters.
get_return_set()

Gets the value of the return_set flag.

Returns:The boolean value of the return_set flag.
set_delim_set(delim_set)

Sets the current set of delimiters.

Parameters:delim_set (set) – A set of delimiter strings.
set_return_set(return_set)

Sets the value of the return_set flag.

Parameters:return_set (boolean) – a flag to indicate whether to return a set of tokens instead of a bag of tokens.
tokenize(input_string)

Tokenizes input string based on the set of delimiters.

Parameters:input_string (str) – The string to be tokenized.
Returns:A Python list which is a set or a bag of tokens, depending on whether return_set flag is set to True or False.
Raises:TypeError – If the input is not a string.

Examples

>>> delim_tok = DelimiterTokenizer()
>>> delim_tok.tokenize('data science')
['data', 'science']
>>> delim_tok = DelimiterTokenizer(['$#$'])
>>> delim_tok.tokenize('data$#$science')
['data', 'science']
>>> delim_tok = DelimiterTokenizer([',', '.'])
>>> delim_tok.tokenize('data,science.data,integration.')
['data', 'science', 'data', 'integration']
>>> delim_tok = DelimiterTokenizer([',', '.'], return_set=True)
>>> delim_tok.tokenize('data,science.data,integration.')
['data', 'science', 'integration']

Qgram Tokenizer

class py_stringmatching.tokenizer.qgram_tokenizer.QgramTokenizer(qval=2, padding=True, prefix_pad='#', suffix_pad='$', return_set=False)

Returns tokens that are sequences of q consecutive characters.

A qgram of an input string s is a substring t (of s) which is a sequence of q consecutive characters. Qgrams are also known as ngrams or kgrams.

Parameters:
  • qval (int) – A value for q, that is, the qgram’s length (defaults to 2).
  • return_set (boolean) – A flag to indicate whether to return a set of tokens or a bag of tokens (defaults to False).
  • padding (boolean) – A flag to indicate whether a prefix and a suffix should be added to the input string (defaults to True).
  • prefix_pad (str) – A character (that is, a string of length 1 in Python) that should be replicated (qval-1) times and prepended to the input string, if padding was set to True (defaults to ‘#’).
  • suffix_pad (str) – A character (that is, a string of length 1 in Python) that should be replicated (qval-1) times and appended to the input string, if padding was set to True (defaults to ‘$’).
qval

An attribute to store the q value.

Type:int
return_set

An attribute to store the flag return_set.

Type:boolean
padding

An attribute to store the padding flag.

Type:boolean
prefix_pad

An attribute to store the prefix string that should be used for padding.

Type:str
suffix_pad

An attribute to store the suffix string that should be used for padding.

Type:str
get_padding()

Gets the value of the padding flag. This flag determines whether the padding should be done for the input strings or not.

Returns:The Boolean value of the padding flag.
get_prefix_pad()

Gets the value of the prefix pad.

Returns:The prefix pad string.
get_qval()

Gets the value of the qval attribute, which is the length of qgrams.

Returns:The value of the qval attribute.
get_return_set()

Gets the value of the return_set flag.

Returns:The boolean value of the return_set flag.
get_suffix_pad()

Gets the value of the suffix pad.

Returns:The suffix pad string.
set_padding(padding)

Sets the value of the padding flag.

Parameters:padding (boolean) – Flag to indicate whether padding should be done or not.
Returns:The Boolean value of True is returned if the update was successful.
Raises:AssertionError – If the padding is not of type boolean
set_prefix_pad(prefix_pad)

Sets the value of the prefix pad string.

Parameters:

prefix_pad (str) – String that should be prepended to the input string before tokenization.

Returns:

The Boolean value of True is returned if the update was successful.

Raises:
  • AssertionError – If the prefix_pad is not of type string.
  • AssertionError – If the length of prefix_pad is not one.
set_qval(qval)

Sets the value of the qval attribute.

Parameters:qval (int) – A value for q (the length of qgrams).
Raises:AssertionError – If qval is less than 1.
set_return_set(return_set)

Sets the value of the return_set flag.

Parameters:return_set (boolean) – a flag to indicate whether to return a set of tokens instead of a bag of tokens.
set_suffix_pad(suffix_pad)

Sets the value of the suffix pad string.

Parameters:

suffix_pad (str) – String that should be appended to the input string before tokenization.

Returns:

The boolean value of True is returned if the update was successful.

Raises:
  • AssertionError – If the suffix_pad is not of type string.
  • AssertionError – If the length of suffix_pad is not one.
tokenize(input_string)

Tokenizes input string into qgrams.

Parameters:input_string (str) – The string to be tokenized.
Returns:A Python list, which is a set or a bag of qgrams, depending on whether return_set flag is True or False.
Raises:TypeError – If the input is not a string

Examples

>>> qg2_tok = QgramTokenizer()
>>> qg2_tok.tokenize('database')
['#d', 'da', 'at', 'ta', 'ab', 'ba', 'as', 'se', 'e$']
>>> qg2_tok.tokenize('a')
['#a', 'a$']
>>> qg3_tok = QgramTokenizer(qval=3)
>>> qg3_tok.tokenize('database')
['##d', '#da', 'dat', 'ata', 'tab', 'aba', 'bas', 'ase', 'se$', 'e$$']
>>> qg3_nopad = QgramTokenizer(padding=False)
>>> qg3_nopad.tokenize('database')
['da', 'at', 'ta', 'ab', 'ba', 'as', 'se']
>>> qg3_diffpads = QgramTokenizer(prefix_pad='^', suffix_pad='!')
>>> qg3_diffpads.tokenize('database')
['^d', 'da', 'at', 'ta', 'ab', 'ba', 'as', 'se', 'e!']

Whitespace Tokenizer

class py_stringmatching.tokenizer.whitespace_tokenizer.WhitespaceTokenizer(return_set=False)

Segments the input string using whitespaces then returns the segments as tokens.

Currently using the split function in Python, so whitespace character refers to the actual whitespace character as well as the tab and newline characters.

Parameters:return_set (boolean) – A flag to indicate whether to return a set of tokens instead of a bag of tokens (defaults to False).
return_set

An attribute to store the flag return_set.

Type:boolean
get_delim_set()

Gets the current set of delimiters.

Returns:A Python set which is the current set of delimiters.
get_return_set()

Gets the value of the return_set flag.

Returns:The boolean value of the return_set flag.
set_delim_set(delim_set)

Sets the current set of delimiters.

Parameters:delim_set (set) – A set of delimiter strings.
set_return_set(return_set)

Sets the value of the return_set flag.

Parameters:return_set (boolean) – a flag to indicate whether to return a set of tokens instead of a bag of tokens.
tokenize(input_string)

Tokenizes input string based on white space.

Parameters:input_string (str) – The string to be tokenized.
Returns:A Python list, which is a set or a bag of tokens, depending on whether return_set is True or False.
Raises:TypeError – If the input is not a string.

Examples

>>> ws_tok = WhitespaceTokenizer()
>>> ws_tok.tokenize('data science')
['data', 'science']
>>> ws_tok.tokenize('data        science')
['data', 'science']
>>> ws_tok.tokenize('data   science')
['data', 'science']
>>> ws_tok = WhitespaceTokenizer(return_set=True)
>>> ws_tok.tokenize('data   science data integration')
['data', 'science', 'integration']

Similarity Measures

Affine Gap

class py_stringmatching.similarity_measure.affine.Affine(gap_start=1, gap_continuation=0.5, sim_func=identity_function)

Returns the affine gap score between two strings.

The affine gap measure is an extension of the Needleman-Wunsch measure that handles the longer gaps more gracefully. For more information refer to the string matching chapter in the DI book (“Principles of Data Integration”).

Parameters:
  • gap_start (float) – Cost for the gap at the start (defaults to 1).
  • gap_continuation (float) – Cost for the gap continuation (defaults to 0.5).
  • sim_func (function) – Function computing similarity score between two characters, which are represented as strings (defaults to an identity function, which returns 1 if the two characters are the same and returns 0 otherwise).
gap_start

An attribute to store the gap cost at the start.

Type:float
gap_continuation

An attribute to store the gap continuation cost.

Type:float
sim_func

An attribute to store the similarity function.

Type:function
get_gap_continuation()

Get gap continuation cost.

Returns:gap continuation cost (float).
get_gap_start()

Get gap start cost.

Returns:gap start cost (float).
get_raw_score(string1, string2)

Computes the affine gap score between two strings. This score can be outside the range [0,1].

Parameters:string1,string2 (str) – Input strings.
Returns:Affine gap score betwen the two input strings (float).
Raises:TypeError – If the inputs are not strings or if one of the inputs is None.

Examples

>>> aff = Affine()
>>> aff.get_raw_score('dva', 'deeva')
1.5
>>> aff = Affine(gap_start=2, gap_continuation=0.5)
>>> aff.get_raw_score('dva', 'deeve')
-0.5
>>> aff = Affine(gap_continuation=0.2, sim_func=lambda s1, s2: (int(1 if s1 == s2 else 0)))
>>> aff.get_raw_score('AAAGAATTCA', 'AAATCA')
4.4
get_sim_func()

Get similarity function.

Returns:similarity function (function).
set_gap_continuation(gap_continuation)

Set gap continuation cost.

Parameters:gap_continuation (float) – Cost for the gap continuation.
set_gap_start(gap_start)

Set gap start cost.

Parameters:gap_start (float) – Cost for the gap at the start.
set_sim_func(sim_func)

Set similarity function.

Parameters:sim_func (function) – Function computing similarity score between two characters, represented as strings.

Bag Distance

Bag distance measure

class py_stringmatching.similarity_measure.bag_distance.BagDistance

Bag distance measure class.

get_raw_score(string1, string2)

Computes the bag distance between two strings.

For two strings X and Y, the Bag distance is: \(max( |bag(string1)-bag(string2)|, |bag(string2)-bag(string1)| )\)

Parameters:string1,string2 (str) – Input strings
Returns:Bag distance (int)
Raises:TypeError – If the inputs are not strings

Examples

>>> bd = BagDistance()
>>> bd.get_raw_score('cat', 'hat')
1
>>> bd.get_raw_score('Niall', 'Neil')
2
>>> bd.get_raw_score('aluminum', 'Catalan')
5
>>> bd.get_raw_score('ATCG', 'TAGC')
0
>>> bd.get_raw_score('abcde', 'xyz')
5

References

get_sim_score(string1, string2)

Computes the normalized bag similarity between two strings.

Parameters:string1,string2 (str) – Input strings
Returns:Normalized bag similarity (float)
Raises:TypeError – If the inputs are not strings

Examples

>>> bd = BagDistance()
>>> bd.get_sim_score('cat', 'hat')
0.6666666666666667
>>> bd.get_sim_score('Niall', 'Neil')
0.6
>>> bd.get_sim_score('aluminum', 'Catalan')
0.375
>>> bd.get_sim_score('ATCG', 'TAGC')
1.0
>>> bd.get_sim_score('abcde', 'xyz')
0.0

References

Cosine

class py_stringmatching.similarity_measure.cosine.Cosine

Computes a variant of cosine measure known as Ochiai coefficient.

This is not the cosine measure that computes the cosine of the angle between two given vectors. Rather, it computes a variant of cosine measure known as Ochiai coefficient (see the Wikipedia page “Cosine Similarity”). Specifically, for two sets X and Y, this measure computes:

\(cosine(X, Y) = \frac{|X \cap Y|}{\sqrt{|X| \cdot |Y|}}\)

Note

  • In the case where one of X and Y is an empty set and the other is a non-empty set, we define their cosine score to be 0.
  • In the case where both X and Y are empty sets, we define their cosine score to be 1.
get_raw_score(set1, set2)

Computes the raw cosine score between two sets.

Parameters:set1,set2 (set or list) – Input sets (or lists). Input lists are converted to sets.
Returns:Cosine similarity (float)
Raises:TypeError – If the inputs are not sets (or lists) or if one of the inputs is None.

Examples

>>> cos = Cosine()
>>> cos.get_raw_score(['data', 'science'], ['data'])
0.7071067811865475
>>> cos.get_raw_score(['data', 'data', 'science'], ['data', 'management'])
0.4999999999999999
>>> cos.get_raw_score([], ['data'])
0.0

References

  • String similarity joins: An Experimental Evaluation (a paper appearing in the VLDB 2014 Conference).
  • Project Flamingo at http://flamingo.ics.uci.edu.
get_sim_score(set1, set2)

Computes the normalized cosine similarity between two sets.

Parameters:set1,set2 (set or list) – Input sets (or lists). Input lists are converted to sets.
Returns:Normalized cosine similarity (float)
Raises:TypeError – If the inputs are not sets (or lists) or if one of the inputs is None.

Examples

>>> cos = Cosine()
>>> cos.get_sim_score(['data', 'science'], ['data'])
0.7071067811865475
>>> cos.get_sim_score(['data', 'data', 'science'], ['data', 'management'])
0.4999999999999999
>>> cos.get_sim_score([], ['data'])
0.0

Dice

class py_stringmatching.similarity_measure.dice.Dice

Returns the Dice score between two strings.

The Dice similarity score is defined as twice the shared information (intersection) divided by sum of cardinalities. For two sets X and Y, the Dice similarity score is:

\(dice(X, Y) = \frac{2 * |X \cap Y|}{|X| + |Y|}\)

Note

In the case where both X and Y are empty sets, we define their Dice score to be 1.

get_raw_score(set1, set2)

Computes the raw Dice score between two sets. This score is already in [0,1].

Parameters:set1,set2 (set or list) – Input sets (or lists). Input lists are converted to sets.
Returns:Dice similarity score (float).
Raises:TypeError – If the inputs are not sets (or lists) or if one of the inputs is None.

Examples

>>> dice = Dice()
>>> dice.get_raw_score(['data', 'science'], ['data'])
0.6666666666666666
>>> dice.get_raw_score({1, 1, 2, 3, 4}, {2, 3, 4, 5, 6, 7, 7, 8})
0.5454545454545454
>>> dice.get_raw_score(['data', 'management'], ['data', 'data', 'science'])
0.5

References

get_sim_score(set1, set2)

Computes the normalized dice similarity score between two sets. Simply call get_raw_score.

Parameters:set1,set2 (set or list) – Input sets (or lists). Input lists are converted to sets.
Returns:Normalized dice similarity (float).
Raises:TypeError – If the inputs are not sets (or lists) or if one of the inputs is None.

Examples

>>> dice = Dice()
>>> dice.get_sim_score(['data', 'science'], ['data'])
0.6666666666666666
>>> dice.get_sim_score({1, 1, 2, 3, 4}, {2, 3, 4, 5, 6, 7, 7, 8})
0.5454545454545454
>>> dice.get_sim_score(['data', 'management'], ['data', 'data', 'science'])
0.5

Editex

Editex distance measure

class py_stringmatching.similarity_measure.editex.Editex(match_cost=0, group_cost=1, mismatch_cost=2, local=False)

Editex distance measure class.

Parameters:
  • match_cost (int) – Weight to give the correct char match, default=0
  • group_cost (int) – Weight to give if the chars are in the same editex group, default=1
  • mismatch_cost (int) – Weight to give the incorrect char match, default=2
  • local (boolean) – Local variant on/off, default=False
get_group_cost()

Get group cost

Returns:group cost (int)
get_local()

Get local flag

Returns:local flag (boolean)
get_match_cost()

Get match cost

Returns:match cost (int)
get_mismatch_cost()

Get mismatch cost

Returns:mismatch cost (int)
get_raw_score(string1, string2)

Computes the editex distance between two strings.

As described on pages 3 & 4 of Zobel, Justin and Philip Dart. 1996. Phonetic string matching: Lessons from information retrieval. In: Proceedings of the ACM-SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland. 166–173. http://goanna.cs.rmit.edu.au/~jz/fulltext/sigir96.pdf

The local variant is based on Ring, Nicholas and Alexandra L. Uitdenbogerd. 2009. Finding ‘Lucy in Disguise’: The Misheard Lyric Matching Problem. In: Proceedings of the 5th Asia Information Retrieval Symposium, Sapporo, Japan. 157-167. http://www.seg.rmit.edu.au/research/download.php?manuscript=404

Parameters:string1,string2 (str) – Input strings
Returns:Editex distance (int)
Raises:TypeError – If the inputs are not strings

Examples

>>> ed = Editex()
>>> ed.get_raw_score('cat', 'hat')
2
>>> ed.get_raw_score('Niall', 'Neil')
2
>>> ed.get_raw_score('aluminum', 'Catalan')
12
>>> ed.get_raw_score('ATCG', 'TAGC')
6

References

get_sim_score(string1, string2)

Computes the normalized editex similarity between two strings.

Parameters:string1,string2 (str) – Input strings
Returns:Normalized editex similarity (float)
Raises:TypeError – If the inputs are not strings

Examples

>>> ed = Editex()
>>> ed.get_sim_score('cat', 'hat')
0.66666666666666674
>>> ed.get_sim_score('Niall', 'Neil')
0.80000000000000004
>>> ed.get_sim_score('aluminum', 'Catalan')
0.25
>>> ed.get_sim_score('ATCG', 'TAGC')
0.25

References

set_group_cost(group_cost)

Set group cost

Parameters:group_cost (int) – Weight to give if the chars are in the same editex group
set_local(local)

Set local flag

Parameters:local (boolean) – Local variant on/off
set_match_cost(match_cost)

Set match cost

Parameters:match_cost (int) – Weight to give the correct char match
set_mismatch_cost(mismatch_cost)

Set mismatch cost

Parameters:mismatch_cost (int) – Weight to give the incorrect char match

Generalized Jaccard

Generalized jaccard similarity measure

class py_stringmatching.similarity_measure.generalized_jaccard.GeneralizedJaccard(sim_func=<bound method Jaro.get_raw_score of <py_stringmatching.similarity_measure.jaro.Jaro object>>, threshold=0.5)

Generalized jaccard similarity measure class.

Parameters:
  • sim_func (function) – similarity function. This should return a similarity score between two strings in set (optional), default is jaro similarity measure
  • threshold (float) – Threshold value (defaults to 0.5). If the similarity of a token pair exceeds the threshold, then the token pair is considered a match.
get_raw_score(set1, set2)

Computes the Generalized Jaccard measure between two sets.

This similarity measure is softened version of the Jaccard measure. The Jaccard measure is promising candidate for tokens which exactly match across the sets. However, in practice tokens are often misspelled, such as energy vs. eneryg. THe generalized Jaccard measure will enable matching in such cases.

Parameters:

set1,set2 (set or list) – Input sets (or lists) of strings. Input lists are converted to sets.

Returns:

Generalized Jaccard similarity (float)

Raises:
  • TypeError – If the inputs are not sets (or lists) or if one of the inputs is None.
  • ValueError – If the similarity measure doesn’t return values in the range [0,1]

Examples

>>> gj = GeneralizedJaccard()
>>> gj.get_raw_score(['data', 'science'], ['data'])
0.5
>>> gj.get_raw_score(['data', 'management'], ['data', 'data', 'science'])
0.3333333333333333
>>> gj.get_raw_score(['Niall'], ['Neal', 'Njall'])
0.43333333333333335
>>> gj = GeneralizedJaccard(sim_func=JaroWinkler().get_raw_score, threshold=0.8)
>>> gj.get_raw_score(['Comp', 'Sci.', 'and', 'Engr', 'Dept.,', 'Universty', 'of', 'Cal,', 'San', 'Deigo'],
                     ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego'])
0.45810185185185187
get_sim_func()

Get similarity function

Returns:similarity function (function)
get_sim_score(set1, set2)

Computes the normalized Generalized Jaccard similarity between two sets.

Parameters:

set1,set2 (set or list) – Input sets (or lists) of strings. Input lists are converted to sets.

Returns:

Normalized Generalized Jaccard similarity (float)

Raises:
  • TypeError – If the inputs are not sets (or lists) or if one of the inputs is None.
  • ValueError – If the similarity measure doesn’t return values in the range [0,1]

Examples

>>> gj = GeneralizedJaccard()
>>> gj.get_sim_score(['data', 'science'], ['data'])
0.5
>>> gj.get_sim_score(['data', 'management'], ['data', 'data', 'science'])
0.3333333333333333
>>> gj.get_sim_score(['Niall'], ['Neal', 'Njall'])
0.43333333333333335
>>> gj = GeneralizedJaccard(sim_func=JaroWinkler().get_raw_score, threshold=0.8)
>>> gj.get_sim_score(['Comp', 'Sci.', 'and', 'Engr', 'Dept.,', 'Universty', 'of', 'Cal,', 'San', 'Deigo'],
                     ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego'])
0.45810185185185187
get_threshold()

Get threshold used for the similarity function

Returns:threshold (float)
set_sim_func(sim_func)

Set similarity function

Parameters:sim_func (function) – similarity function
set_threshold(threshold)

Set threshold value for the similarity function

Parameters:threshold (float) – threshold value

Hamming Distance

class py_stringmatching.similarity_measure.hamming_distance.HammingDistance

Computes Hamming distance.

The Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. Thus, it measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the other.

get_raw_score(string1, string2)

Computes the raw hamming distance between two strings.

Parameters:

string1,string2 (str) – Input strings.

Returns:

Hamming distance (int).

Raises:
  • TypeError – If the inputs are not strings or if one of the inputs is None.
  • ValueError – If the input strings are not of same length.

Examples

>>> hd = HammingDistance()
>>> hd.get_raw_score('', '')
0
>>> hd.get_raw_score('alex', 'john')
4
>>> hd.get_raw_score(' ', 'a')
1
>>> hd.get_raw_score('JOHN', 'john')
4
get_sim_score(string1, string2)

Computes the normalized Hamming similarity score between two strings.

Parameters:

string1,string2 (str) – Input strings.

Returns:

Normalized Hamming similarity score (float).

Raises:
  • TypeError – If the inputs are not strings or if one of the inputs is None.
  • ValueError – If the input strings are not of same length.

Examples

>>> hd = HammingDistance()
>>> hd.get_sim_score('', '')
1.0
>>> hd.get_sim_score('alex', 'john')
0.0
>>> hd.get_sim_score(' ', 'a')
0.0
>>> hd.get_sim_score('JOHN', 'john')
0.0

Jaccard

class py_stringmatching.similarity_measure.jaccard.Jaccard

Computes Jaccard measure.

For two sets X and Y, the Jaccard similarity score is:

\(jaccard(X, Y) = \frac{|X \cap Y|}{|X \cup Y|}\)

Note

In the case where both X and Y are empty sets, we define their Jaccard score to be 1.

get_raw_score(set1, set2)

Computes the raw Jaccard score between two sets.

Parameters:set1,set2 (set or list) – Input sets (or lists). Input lists are converted to sets.
Returns:Jaccard similarity score (float).
Raises:TypeError – If the inputs are not sets (or lists) or if one of the inputs is None.

Examples

>>> jac = Jaccard()
>>> jac.get_raw_score(['data', 'science'], ['data'])
0.5
>>> jac.get_raw_score({1, 1, 2, 3, 4}, {2, 3, 4, 5, 6, 7, 7, 8})
0.375
>>> jac.get_raw_score(['data', 'management'], ['data', 'data', 'science'])
0.3333333333333333
get_sim_score(set1, set2)

Computes the normalized Jaccard similarity between two sets. Simply call get_raw_score.

Parameters:set1,set2 (set or list) – Input sets (or lists). Input lists are converted to sets.
Returns:Normalized Jaccard similarity (float).
Raises:TypeError – If the inputs are not sets (or lists) or if one of the inputs is None.

Examples

>>> jac = Jaccard()
>>> jac.get_sim_score(['data', 'science'], ['data'])
0.5
>>> jac.get_sim_score({1, 1, 2, 3, 4}, {2, 3, 4, 5, 6, 7, 7, 8})
0.375
>>> jac.get_sim_score(['data', 'management'], ['data', 'data', 'science'])
0.3333333333333333

Jaro

class py_stringmatching.similarity_measure.jaro.Jaro

Computes Jaro measure.

The Jaro measure is a type of edit distance, developed mainly to compare short strings, such as first and last names.

get_raw_score(string1, string2)

Computes the raw Jaro score between two strings.

Parameters:string1,string2 (str) – Input strings.
Returns:Jaro similarity score (float).
Raises:TypeError – If the inputs are not strings or if one of the inputs is None.

Examples

>>> jaro = Jaro()
>>> jaro.get_raw_score('MARTHA', 'MARHTA')
0.9444444444444445
>>> jaro.get_raw_score('DWAYNE', 'DUANE')
0.8222222222222223
>>> jaro.get_raw_score('DIXON', 'DICKSONX')
0.7666666666666666
get_sim_score(string1, string2)

Computes the normalized Jaro similarity score between two strings. Simply call get_raw_score.

Parameters:string1,string2 (str) – Input strings.
Returns:Normalized Jaro similarity score (float).
Raises:TypeError – If the inputs are not strings or if one of the inputs is None.

Examples

>>> jaro = Jaro()
>>> jaro.get_sim_score('MARTHA', 'MARHTA')
0.9444444444444445
>>> jaro.get_sim_score('DWAYNE', 'DUANE')
0.8222222222222223
>>> jaro.get_sim_score('DIXON', 'DICKSONX')
0.7666666666666666

Jaro Winkler

class py_stringmatching.similarity_measure.jaro_winkler.JaroWinkler(prefix_weight=0.1)

Computes Jaro-Winkler measure.

The Jaro-Winkler measure is designed to capture cases where two strings have a low Jaro score, but share a prefix and thus are likely to match.

Parameters:prefix_weight (float) – Weight to give to the prefix (defaults to 0.1).
prefix_weight

An attribute to store the prefix weight.

Type:float
get_prefix_weight()

Get prefix weight.

Returns:prefix weight (float).
get_raw_score(string1, string2)

Computes the raw Jaro-Winkler score between two strings.

Parameters:string1,string2 (str) – Input strings.
Returns:Jaro-Winkler similarity score (float).
Raises:TypeError – If the inputs are not strings or if one of the inputs is None.

Examples

>>> jw = JaroWinkler()
>>> jw.get_raw_score('MARTHA', 'MARHTA')
0.9611111111111111
>>> jw.get_raw_score('DWAYNE', 'DUANE')
0.84
>>> jw.get_raw_score('DIXON', 'DICKSONX')
0.8133333333333332
get_sim_score(string1, string2)

Computes the normalized Jaro-Winkler similarity score between two strings. Simply call get_raw_score.

Parameters:string1,string2 (str) – Input strings.
Returns:Normalized Jaro-Winkler similarity (float).
Raises:TypeError – If the inputs are not strings or if one of the inputs is None.

Examples

>>> jw = JaroWinkler()
>>> jw.get_sim_score('MARTHA', 'MARHTA')
0.9611111111111111
>>> jw.get_sim_score('DWAYNE', 'DUANE')
0.84
>>> jw.get_sim_score('DIXON', 'DICKSONX')
0.8133333333333332
set_prefix_weight(prefix_weight)

Set prefix weight.

Parameters:prefix_weight (float) – Weight to give to the prefix.

Levenshtein

class py_stringmatching.similarity_measure.levenshtein.Levenshtein

Computes Levenshtein measure (also known as edit distance).

Levenshtein distance computes the minimum cost of transforming one string into the other. Transforming a string is carried out using a sequence of the following operators: delete a character, insert a character, and substitute one character for another.

get_raw_score(string1, string2)

Computes the raw Levenshtein distance between two strings.

Parameters:string1,string2 (str) – Input strings.
Returns:Levenshtein distance (int).
Raises:TypeError – If the inputs are not strings.

Examples

>>> lev = Levenshtein()
>>> lev.get_raw_score('a', '')
1
>>> lev.get_raw_score('example', 'samples')
3
>>> lev.get_raw_score('levenshtein', 'frankenstein')
6
get_sim_score(string1, string2)

Computes the normalized Levenshtein similarity score between two strings.

Parameters:string1,string2 (str) – Input strings.
Returns:Normalized Levenshtein similarity (float).
Raises:TypeError – If the inputs are not strings.

Examples

>>> lev = Levenshtein()
>>> lev.get_sim_score('a', '')
0.0
>>> lev.get_sim_score('example', 'samples')
0.5714285714285714
>>> lev.get_sim_score('levenshtein', 'frankenstein')
0.5

Monge Elkan

class py_stringmatching.similarity_measure.monge_elkan.MongeElkan(sim_func=jaro_winkler_function)

Computes Monge-Elkan measure.

The Monge-Elkan similarity measure is a type of hybrid similarity measure that combines the benefits of sequence-based and set-based methods. This can be effective for domains in which more control is needed over the similarity measure. It implicitly uses a secondary similarity measure, such as Levenshtein to compute over all similarity score. See the string matching chapter in the DI book (Principles of Data Integration).

Parameters:sim_func (function) – Secondary similarity function. This is expected to be a sequence-based similarity measure (defaults to Jaro-Winkler similarity measure).
sim_func

An attribute to store the secondary similarity function.

Type:function
get_raw_score(bag1, bag2)

Computes the raw Monge-Elkan score between two bags (lists).

Parameters:bag1,bag2 (list) – Input lists.
Returns:Monge-Elkan similarity score (float).
Raises:TypeError – If the inputs are not lists or if one of the inputs is None.

Examples

>>> me = MongeElkan()
>>> me.get_raw_score(['Niall'], ['Neal'])
0.8049999999999999
>>> me.get_raw_score(['Niall'], ['Nigel'])
0.7866666666666667
>>> me.get_raw_score(['Comput.', 'Sci.', 'and', 'Eng.', 'Dept.,', 'University', 'of', 'California,', 'San', 'Diego'], ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego'])
0.8677218614718616
>>> me.get_raw_score([''], ['a'])
0.0
>>> me = MongeElkan(sim_func=NeedlemanWunsch().get_raw_score)
>>> me.get_raw_score(['Comput.', 'Sci.', 'and', 'Eng.', 'Dept.,', 'University', 'of', 'California,', 'San', 'Diego'], ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego'])
2.0
>>> me = MongeElkan(sim_func=Affine().get_raw_score)
>>> me.get_raw_score(['Comput.', 'Sci.', 'and', 'Eng.', 'Dept.,', 'University', 'of', 'California,', 'San', 'Diego'], ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego'])
2.25

References

  • Principles of Data Integration book
get_sim_func()

Get the secondary similarity function.

Returns:secondary similarity function (function).
set_sim_func(sim_func)

Set the secondary similarity function.

Parameters:sim_func (function) – Secondary similarity function.

Needleman Wunsch

class py_stringmatching.similarity_measure.needleman_wunsch.NeedlemanWunsch(gap_cost=1.0, sim_func=identity_function)

Computes Needleman-Wunsch measure.

The Needleman-Wunsch distance generalizes the Levenshtein distance and considers global alignment between two strings. Specifically, it is computed by assigning a score to each alignment between the two input strings and choosing the score of the best alignment, that is, the maximal score. An alignment between two strings is a set of correspondences between their characters, allowing for gaps.

Parameters:
  • gap_cost (float) – Cost of gap (defaults to 1.0).
  • sim_func (function) – Similarity function to give a score for each correspondence between the characters (defaults to an identity function, which returns 1 if the two characters are the same and 0 otherwise.
gap_cost

An attribute to store the gap cost.

Type:float
sim_func

An attribute to store the similarity function.

Type:function
get_gap_cost()

Get gap cost.

Returns:Gap cost (float).
get_raw_score(string1, string2)

Computes the raw Needleman-Wunsch score between two strings.

Parameters:string1,string2 (str) – Input strings.
Returns:Needleman-Wunsch similarity score (float).
Raises:TypeError – If the inputs are not strings or if one of the inputs is None.

Examples

>>> nw = NeedlemanWunsch()
>>> nw.get_raw_score('dva', 'deeva')
1.0
>>> nw = NeedlemanWunsch(gap_cost=0.0)
>>> nw.get_raw_score('dva', 'deeve')
2.0
>>> nw = NeedlemanWunsch(gap_cost=1.0, sim_func=lambda s1, s2 : (2.0 if s1 == s2 else -1.0))
>>> nw.get_raw_score('dva', 'deeve')
1.0
>>> nw = NeedlemanWunsch(gap_cost=0.5, sim_func=lambda s1, s2 : (1.0 if s1 == s2 else -1.0))
>>> nw.get_raw_score('GCATGCUA', 'GATTACA')
2.5
get_sim_func()

Get the similarity function.

Returns:similarity function (function).
set_gap_cost(gap_cost)

Set gap cost.

Parameters:gap_cost (float) – Cost of gap.
set_sim_func(sim_func)

Set similarity function.

Parameters:sim_func (function) – Similarity function to give a score for the correspondence between characters.

Overlap Coefficient

class py_stringmatching.similarity_measure.overlap_coefficient.OverlapCoefficient

Computes overlap coefficient measure.

The overlap coefficient is a similarity measure related to the Jaccard measure that measures the overlap between two sets, and is defined as the size of the intersection divided by the smaller of the size of the two sets. For two sets X and Y, the overlap coefficient is:

\(overlap\_coefficient(X, Y) = \frac{|X \cap Y|}{\min(|X|, |Y|)}\)

Note

  • In the case where one of X and Y is an empty set and the other is a non-empty set, we define their overlap coefficient to be 0.
  • In the case where both X and Y are empty sets, we define their overlap coefficient to be 1.
get_raw_score(set1, set2)

Computes the raw overlap coefficient score between two sets.

Parameters:set1,set2 (set or list) – Input sets (or lists). Input lists are converted to sets.
Returns:Overlap coefficient (float).
Raises:TypeError – If the inputs are not sets (or lists) or if one of the inputs is None.

Examples

>>> oc = OverlapCoefficient()
>>> oc.get_raw_score(['data', 'science'], ['data'])
1.0
>>> oc.get_raw_score([], [])
1.0
>>> oc.get_raw_score([], ['data'])
0

References

get_sim_score(set1, set2)

Computes the normalized overlap coefficient between two sets. Simply call get_raw_score.

Parameters:set1,set2 (set or list) – Input sets (or lists). Input lists are converted to sets.
Returns:Normalized overlap coefficient (float).
Raises:TypeError – If the inputs are not sets (or lists) or if one of the inputs is None.

Examples

>>> oc = OverlapCoefficient()
>>> oc.get_sim_score(['data', 'science'], ['data'])
1.0
>>> oc.get_sim_score([], [])
1.0
>>> oc.get_sim_score([], ['data'])
0

Partial Ratio

Fuzzy Wuzzy Partial Ratio Similarity Measure

class py_stringmatching.similarity_measure.partial_ratio.PartialRatio

Computes the Fuzzy Wuzzy partial ratio similarity between two strings.

Fuzzy Wuzzy partial ratio raw score is a measure of the strings similarity as an int in the range [0, 100]. Given two strings X and Y, let the shorter string (X) be of length m. It finds the fuzzy wuzzy ratio similarity measure between the shorter string and every substring of length m of the longer string, and returns the maximum of those similarity measures. Fuzzy Wuzzy partial ratio sim score is a float in the range [0, 1] and is obtained by dividing the raw score by 100.

Note: In the case where either of strings X or Y are empty, we define the Fuzzy Wuzzy ratio similarity score to be 0.

get_raw_score(string1, string2)

Computes the Fuzzy Wuzzy partial ratio measure raw score between two strings. This score is in the range [0,100].

Parameters:string1,string2 (str) – Input strings
Returns:Partial Ratio measure raw score (int) is returned
Raises:TypeError – If the inputs are not strings

Examples

>>> s = PartialRatio()
>>> s.get_raw_score('Robert Rupert', 'Rupert')
100
>>> s.get_raw_score('Sue', 'sue')
67
>>> s.get_raw_score('example', 'samples')
86

References

get_sim_score(string1, string2)

Computes the Fuzzy Wuzzy partial ratio similarity score between two strings. This score is in the range [0,1].

Parameters:string1,string2 (str) – Input strings
Returns:Partial Ratio measure similarity score (float) is returned
Raises:TypeError – If the inputs are not strings

Examples

>>> s = PartialRatio()
>>> s.get_sim_score('Robert Rupert', 'Rupert')
1.0
>>> s.get_sim_score('Sue', 'sue')
0.67
>>> s.get_sim_score('example', 'samples')
0.86

References

Partial Token Sort

Fuzzy Wuzzy Token Sort Similarity Measure

class py_stringmatching.similarity_measure.partial_token_sort.PartialTokenSort

Computes Fuzzy Wuzzy partial token sort similarity measure.

Fuzzy Wuzzy partial token sort ratio raw raw_score is a measure of the strings similarity as an int in the range [0, 100]. For two strings X and Y, the score is obtained by splitting the two strings into tokens and then sorting the tokens. The score is then the fuzzy wuzzy partial ratio raw score of the transformed strings. Fuzzy Wuzzy token sort sim score is a float in the range [0, 1] and is obtained by dividing the raw score by 100.

Note:
In the case where either of strings X or Y are empty, we define the Fuzzy Wuzzy partial ratio similarity score to be 0.
get_raw_score(string1, string2, force_ascii=True, full_process=True)

Computes the Fuzzy Wuzzy partial token sort measure raw score between two strings. This score is in the range [0,100].

Parameters:
  • string1,string2 (str) – Input strings
  • force_ascii (boolean) – Flag to remove non-ascii characters or not
  • full_process (boolean) – Flag to process the string or not. Processing includes
  • non alphanumeric characters, converting string to lower case and (removing) –
  • leading and trailing whitespaces. (removing) –
Returns:

Partial Token Sort measure raw score (int) is returned

Raises:

TypeError – If the inputs are not strings

Examples

>>> s = PartialTokenSort()
>>> s.get_raw_score('great is scala', 'java is great')
81
>>> s.get_raw_score('Sue', 'sue')
100
>>> s.get_raw_score('C++ and Java', 'Java and Python')
64

References

get_sim_score(string1, string2, force_ascii=True, full_process=True)

Computes the Fuzzy Wuzzy partial token sort similarity score between two strings. This score is in the range [0,1].

Parameters:
  • string1,string2 (str) – Input strings
  • force_ascii (boolean) – Flag to remove non-ascii characters or not
  • full_process (boolean) – Flag to process the string or not. Processing includes
  • non alphanumeric characters, converting string to lower case and (removing) –
  • leading and trailing whitespaces. (removing) –
Returns:

Partial Token Sort measure similarity score (float) is returned

Raises:

TypeError – If the inputs are not strings

Examples

>>> s = PartialTokenSort()
>>> s.get_sim_score('great is scala', 'java is great')
0.81
>>> s.get_sim_score('Sue', 'sue')
1.0
>>> s.get_sim_score('C++ and Java', 'Java and Python')
0.64

References

Ratio

Fuzzy Wuzzy Ratio Similarity Measure

class py_stringmatching.similarity_measure.ratio.Ratio

Computes Fuzzy Wuzzy ratio similarity measure.

Fuzzy Wuzzy ratio raw score is a measure of the strings similarity as an int in the range [0, 100]. For two strings X and Y, the score is defined by int(round((2.0 * M / T) * 100)) where T is the total number of characters in both strings, and M is the number of matches in the two strings. Fuzzy Wuzzy ratio sim score is a float in the range [0, 1] and is obtained by dividing the raw score by 100.

Note:
In the case where either of strings X or Y are empty, we define the Fuzzy Wuzzy ratio similarity score to be 0.
get_raw_score(string1, string2)

Computes the Fuzzy Wuzzy ratio measure raw score between two strings. This score is in the range [0,100].

Parameters:string1,string2 (str) – Input strings
Returns:Ratio measure raw score (int) is returned
Raises:TypeError – If the inputs are not strings

Examples

>>> s = Ratio()
>>> s.get_raw_score('Robert', 'Rupert')
67
>>> s.get_raw_score('Sue', 'sue')
67
>>> s.get_raw_score('example', 'samples')
71

References

get_sim_score(string1, string2)

Computes the Fuzzy Wuzzy ratio similarity score between two strings. This score is in the range [0,1].

Parameters:string1,string2 (str) – Input strings
Returns:Ratio measure similarity score (float) is returned
Raises:TypeError – If the inputs are not strings

Examples

>>> s = Ratio()
>>> s.get_sim_score('Robert', 'Rupert')
0.67
>>> s.get_sim_score('Sue', 'sue')
0.67
>>> s.get_sim_score('example', 'samples')
0.71

References

Smith Waterman

class py_stringmatching.similarity_measure.smith_waterman.SmithWaterman(gap_cost=1.0, sim_func=identity_function)

Computes Smith-Waterman measure.

The Smith-Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings. Instead of looking at the total sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure. See the string matching chapter in the DI book (Principles of Data Integration).

Parameters:
  • gap_cost (float) – Cost of gap (defaults to 1.0).
  • sim_func (function) – Similarity function to give a score for the correspondence between the characters (defaults to an identity function, which returns 1 if the two characters are the same and 0 otherwise).
gap_cost

An attribute to store the gap cost.

Type:float
sim_func

An attribute to store the similarity function.

Type:function
get_gap_cost()

Get gap cost.

Returns:Gap cost (float).
get_raw_score(string1, string2)

Computes the raw Smith-Waterman score between two strings.

Parameters:string1,string2 (str) – Input strings.
Returns:Smith-Waterman similarity score (float).
Raises:TypeError – If the inputs are not strings or if one of the inputs is None.

Examples

>>> sw = SmithWaterman()
>>> sw.get_raw_score('cat', 'hat')
2.0
>>> sw = SmithWaterman(gap_cost=2.2)
>>> sw.get_raw_score('dva', 'deeve')
1.0
>>> sw = SmithWaterman(gap_cost=1, sim_func=lambda s1, s2 : (2 if s1 == s2 else -1))
>>> sw.get_raw_score('dva', 'deeve')
2.0
>>> sw = SmithWaterman(gap_cost=1.4, sim_func=lambda s1, s2 : (1.5 if s1 == s2 else 0.5))
>>> sw.get_raw_score('GCATAGCU', 'GATTACA')
6.5
get_sim_func()

Get similarity function.

Returns:Similarity function (function).
set_gap_cost(gap_cost)

Set gap cost.

Parameters:gap_cost (float) – Cost of gap.
set_sim_func(sim_func)

Set similarity function.

Parameters:sim_func (function) – Similarity function to give a score for the correspondence between the characters.

Soft TF/IDF

class py_stringmatching.similarity_measure.soft_tfidf.SoftTfIdf(corpus_list=None, sim_func=jaro_function, threshold=0.5)

Computes soft TF/IDF measure.

Note

Currently, this measure is implemented without dampening. This is similar to setting dampen flag to be False in TF-IDF. We plan to add the dampen flag in the next release.

Parameters:
  • corpus_list (list of lists) – Corpus list (default is set to None) of strings. If set to None, the input list are considered the only corpus.
  • sim_func (function) – Secondary similarity function. This should return a similarity score between two strings (optional), default is the Jaro similarity measure.
  • threshold (float) – Threshold value for the secondary similarity function (defaults to 0.5). If the similarity of a token pair exceeds the threshold, then the token pair is considered a match.
sim_func

An attribute to store the secondary similarity function.

Type:function
threshold

An attribute to store the threshold value for the secondary similarity function.

Type:float
get_corpus_list()

Get corpus list.

Returns:corpus list (list of lists).
get_raw_score(bag1, bag2)

Computes the raw soft TF/IDF score between two lists given the corpus information.

Parameters:bag1,bag2 (list) – Input lists
Returns:Soft TF/IDF score between the input lists (float).
Raises:TypeError – If the inputs are not lists or if one of the inputs is None.

Examples

>>> soft_tfidf = SoftTfIdf([['a', 'b', 'a'], ['a', 'c'], ['a']], sim_func=Jaro().get_raw_score, threshold=0.8)
>>> soft_tfidf.get_raw_score(['a', 'b', 'a'], ['a', 'c'])
0.17541160386140586
>>> soft_tfidf = SoftTfIdf([['a', 'b', 'a'], ['a', 'c'], ['a']], threshold=0.9)
>>> soft_tfidf.get_raw_score(['a', 'b', 'a'], ['a'])
0.5547001962252291
>>> soft_tfidf = SoftTfIdf([['x', 'y'], ['w'], ['q']])
>>> soft_tfidf.get_raw_score(['a', 'b', 'a'], ['a'])
0.0
>>> soft_tfidf = SoftTfIdf(sim_func=Affine().get_raw_score, threshold=0.6)
>>> soft_tfidf.get_raw_score(['aa', 'bb', 'a'], ['ab', 'ba'])
0.81649658092772592

References

  • the string matching chapter of the “Principles of Data Integration” book.
get_sim_func()

Get secondary similarity function.

Returns:secondary similarity function (function).
get_threshold()

Get threshold used for the secondary similarity function.

Returns:threshold (float).
set_corpus_list(corpus_list)

Set corpus list.

Parameters:corpus_list (list of lists) – Corpus list.
set_sim_func(sim_func)

Set secondary similarity function.

Parameters:sim_func (function) – Secondary similarity function.
set_threshold(threshold)

Set threshold value for the secondary similarity function.

Parameters:threshold (float) – threshold value.

Soundex

Soundex phonetic similarity measure

class py_stringmatching.similarity_measure.soundex.Soundex

Soundex phonetic similarity measure class.

get_raw_score(string1, string2)

Computes the Soundex phonetic similarity between two strings.

Phonetic measure such as soundex match string based on their sound. These measures have been especially effective in matching names, since names are often spelled in different ways that sound the same. For example, Meyer, Meier, and Mire sound the same, as do Smith, Smithe, and Smythe.

Soundex is used primarily to match surnames. It does not work as well for names of East Asian origins, because much of the discriminating power of these names resides in the vowel sounds, which the code ignores.

Parameters:string1,string2 (str) – Input strings
Returns:Soundex similarity score (int) is returned
Raises:TypeError – If the inputs are not strings

Examples

>>> s = Soundex()
>>> s.get_raw_score('Robert', 'Rupert')
1
>>> s.get_raw_score('Sue', 's')
1
>>> s.get_raw_score('Gough', 'Goff')
0
>>> s.get_raw_score('a,,li', 'ali')
1
get_sim_score(string1, string2)

Computes the normalized soundex similarity between two strings.

Parameters:string1,string2 (str) – Input strings
Returns:Normalized soundex similarity (int)
Raises:TypeError – If the inputs are not strings or if one of the inputs is None.

Examples

>>> s = Soundex()
>>> s.get_sim_score('Robert', 'Rupert')
1
>>> s.get_sim_score('Sue', 's')
1
>>> s.get_sim_score('Gough', 'Goff')
0
>>> s.get_sim_score('a,,li', 'ali')
1

TF/IDF

class py_stringmatching.similarity_measure.tfidf.TfIdf(corpus_list=None, dampen=True)

Computes TF/IDF measure.

This measure employs the notion of TF/IDF score commonly used in information retrieval (IR) to find documents that are relevant to keyword queries. The intuition underlying the TF/IDF measure is that two strings are similar if they share distinguishing terms. See the string matching chapter in the book “Principles of Data Integration”

Parameters:
  • corpus_list (list of lists) – The corpus that will be used to compute TF and IDF values. This corpus is a list of strings, where each string has been tokenized into a list of tokens (that is, a bag of tokens). The default is set to None. In this case, when we call this TF/IDF measure on two input strings (using get_raw_score or get_sim_score), the corpus is taken to be the list of those two strings.
  • dampen (boolean) – Flag to indicate whether ‘log’ should be used in TF and IDF formulas (defaults to True).
dampen

An attribute to store the dampen flag.

Type:boolean
get_corpus_list()

Get corpus list.

Returns:corpus list (list of lists).
get_dampen()

Get dampen flag.

Returns:dampen flag (boolean).
get_raw_score(bag1, bag2)

Computes the raw TF/IDF score between two lists.

Parameters:bag1,bag2 (list) – Input lists.
Returns:TF/IDF score between the input lists (float).
Raises:TypeError – If the inputs are not lists or if one of the inputs is None.

Examples

>>> # here the corpus is a list of three strings that
>>> # have been tokenized into three lists of tokens
>>> tfidf = TfIdf([['a', 'b', 'a'], ['a', 'c'], ['a']])
>>> tfidf.get_raw_score(['a', 'b', 'a'], ['b', 'c'])
0.7071067811865475
>>> tfidf.get_raw_score(['a', 'b', 'a'], ['a'])
0.0
>>> tfidf = TfIdf([['x', 'y'], ['w'], ['q']])
>>> tfidf.get_raw_score(['a', 'b', 'a'], ['a'])
0.0
>>> tfidf = TfIdf([['a', 'b', 'a'], ['a', 'c'], ['a'], ['b']], False)
>>> tfidf.get_raw_score(['a', 'b', 'a'], ['a', 'c'])
0.25298221281347033
>>> tfidf = TfIdf(dampen=False)
>>> tfidf.get_raw_score(['a', 'b', 'a'], ['a'])
0.7071067811865475
>>> tfidf = TfIdf()
>>> tfidf.get_raw_score(['a', 'b', 'a'], ['a'])
0.0
get_sim_score(bag1, bag2)

Computes the normalized TF/IDF similarity score between two lists. Simply call get_raw_score.

Parameters:bag1,bag2 (list) – Input lists.
Returns:Normalized TF/IDF similarity score between the input lists (float).
Raises:TypeError – If the inputs are not lists or if one of the inputs is None.

Examples

>>> # here the corpus is a list of three strings that
>>> # have been tokenized into three lists of tokens
>>> tfidf = TfIdf([['a', 'b', 'a'], ['a', 'c'], ['a']])
>>> tfidf.get_sim_score(['a', 'b', 'a'], ['b', 'c'])
0.7071067811865475
>>> tfidf.get_sim_score(['a', 'b', 'a'], ['a'])
0.0
>>> tfidf = TfIdf([['x', 'y'], ['w'], ['q']])
>>> tfidf.get_sim_score(['a', 'b', 'a'], ['a'])
0.0
>>> tfidf = TfIdf([['a', 'b', 'a'], ['a', 'c'], ['a'], ['b']], False)
>>> tfidf.get_sim_score(['a', 'b', 'a'], ['a', 'c'])
0.25298221281347033
>>> tfidf = TfIdf(dampen=False)
>>> tfidf.get_sim_score(['a', 'b', 'a'], ['a'])
0.7071067811865475
>>> tfidf = TfIdf()
>>> tfidf.get_sim_score(['a', 'b', 'a'], ['a'])
0.0
set_corpus_list(corpus_list)

Set corpus list.

Parameters:corpus_list (list of lists) – Corpus list.
set_dampen(dampen)

Set dampen flag.

Parameters:dampen (boolean) – Flag to indicate whether ‘log’ should be applied to TF and IDF formulas.

Token Sort

Fuzzy Wuzzy Token Sort Similarity Measure

class py_stringmatching.similarity_measure.token_sort.TokenSort

Computes Fuzzy Wuzzy token sort similarity measure.

Fuzzy Wuzzy token sort ratio raw raw_score is a measure of the strings similarity as an int in the range [0, 100]. For two strings X and Y, the score is obtained by splitting the two strings into tokens and then sorting the tokens. The score is then the fuzzy wuzzy ratio raw score of the transformed strings. Fuzzy Wuzzy token sort sim score is a float in the range [0, 1] and is obtained by dividing the raw score by 100.

Note:
In the case where either of strings X or Y are empty, we define the Fuzzy Wuzzy ratio similarity score to be 0.
get_raw_score(string1, string2, force_ascii=True, full_process=True)

Computes the Fuzzy Wuzzy token sort measure raw score between two strings. This score is in the range [0,100].

Parameters:
  • string1,string2 (str) – Input strings
  • force_ascii (boolean) – Flag to remove non-ascii characters or not
  • full_process (boolean) – Flag to process the string or not. Processing includes
  • non alphanumeric characters, converting string to lower case and (removing) –
  • leading and trailing whitespaces. (removing) –
Returns:

Token Sort measure raw score (int) is returned

Raises:

TypeError – If the inputs are not strings

Examples

>>> s = TokenSort()
>>> s.get_raw_score('great is scala', 'java is great')
81
>>> s.get_raw_score('Sue', 'sue')
100
>>> s.get_raw_score('C++ and Java', 'Java and Python')
64

References

get_sim_score(string1, string2, force_ascii=True, full_process=True)

Computes the Fuzzy Wuzzy token sort similarity score between two strings. This score is in the range [0,1].

Parameters:
  • string1,string2 (str) – Input strings
  • force_ascii (boolean) – Flag to remove non-ascii characters or not
  • full_process (boolean) – Flag to process the string or not. Processing includes
  • non alphanumeric characters, converting string to lower case and (removing) –
  • leading and trailing whitespaces. (removing) –
Returns:

Token Sort measure similarity score (float) is returned

Raises:

TypeError – If the inputs are not strings

Examples

>>> s = TokenSort()
>>> s.get_sim_score('great is scala', 'java is great')
0.81
>>> s.get_sim_score('Sue', 'sue')
1.0
>>> s.get_sim_score('C++ and Java', 'Java and Python')
0.64

References

Tversky Index

Tversky index similarity measure

class py_stringmatching.similarity_measure.tversky_index.TverskyIndex(alpha=0.5, beta=0.5)

Tversky index similarity measure class.

Parameters:beta (alpha,) – Tversky index parameters (defaults to 0.5).
get_alpha()

Get alpha

Returns:alpha (float)
get_beta()

Get beta

Returns:beta (float)
get_raw_score(set1, set2)

Computes the Tversky index similarity between two sets.

The Tversky index is an asymmetric similarity measure on sets that compares a variant to a prototype. The Tversky index can be seen as a generalization of Dice’s coefficient and Tanimoto coefficient.

For sets X and Y the Tversky index is a number between 0 and 1 given by: \(tversky_index(X, Y) = \frac{|X \cap Y|}{|X \cap Y| + lpha |X-Y| + eta |Y-X|}\) where, :math: lpha, eta >=0

Parameters:set1,set2 (set or list) – Input sets (or lists). Input lists are converted to sets.
Returns:Tversly index similarity (float)
Raises:TypeError – If the inputs are not sets (or lists) or if one of the inputs is None.

Examples

>>> tvi = TverskyIndex()
>>> tvi.get_raw_score(['data', 'science'], ['data'])
0.6666666666666666
>>> tvi.get_raw_score(['data', 'management'], ['data', 'data', 'science'])
0.5
>>> tvi.get_raw_score({1, 1, 2, 3, 4}, {2, 3, 4, 5, 6, 7, 7, 8})
0.5454545454545454
>>> tvi = TverskyIndex(0.5, 0.5)
>>> tvi.get_raw_score({1, 1, 2, 3, 4}, {2, 3, 4, 5, 6, 7, 7, 8})
0.5454545454545454
>>> tvi = TverskyIndex(beta=0.5)
>>> tvi.get_raw_score(['data', 'management'], ['data', 'data', 'science'])
0.5
get_sim_score(set1, set2)

Computes the normalized tversky index similarity between two sets.

Parameters:set1,set2 (set or list) – Input sets (or lists). Input lists are converted to sets.
Returns:Normalized tversky index similarity (float)
Raises:TypeError – If the inputs are not sets (or lists) or if one of the inputs is None.

Examples

>>> tvi = TverskyIndex()
>>> tvi.get_sim_score(['data', 'science'], ['data'])
0.6666666666666666
>>> tvi.get_sim_score(['data', 'management'], ['data', 'data', 'science'])
0.5
>>> tvi.get_sim_score({1, 1, 2, 3, 4}, {2, 3, 4, 5, 6, 7, 7, 8})
0.5454545454545454
>>> tvi = TverskyIndex(0.5, 0.5)
>>> tvi.get_sim_score({1, 1, 2, 3, 4}, {2, 3, 4, 5, 6, 7, 7, 8})
0.5454545454545454
>>> tvi = TverskyIndex(beta=0.5)
>>> tvi.get_sim_score(['data', 'management'], ['data', 'data', 'science'])
0.5
set_alpha(alpha)

Set alpha

Parameters:alpha (float) – Tversky index parameter
set_beta(beta)

Set beta

Parameters:beta (float) – Tversky index parameter

Runtime Benchmark

For this package, we add a runtime benchmark (consisting of a script and several datasets) to measure the runtime performance of similarity measures. This benchmark can be used by users to judge whether similarity measures are fast enough for their purposes, and used by developers to speed up the measures.

Running the Benchmark

The user can run the benchmark as follows:

Step 1: Clone the py_stringmatching package from GitHub using the following command:

git clone https://github.com/anhaidgroup/py_stringmatching.git

Step 2: Change the working directory to py_stringmatching/benchmarks/custom_benchmarks

Step 3: Run the benchmark using the following sequence of commands:

>>> import py_stringmatching as sm
>>> from run_benchmark import *
# create an object for the similarity measure you need to benchmark
>>> jaccard = sm.Jaccard()
# create a tokenizer object (in case of token-based measures)
>>> ws = sm.WhitespaceTokenizer(return_set = True)
# Set dataset paths
>>> short_strings_path = 'datasets/short_strings.csv'
>>> medium_strings_path = 'datasets/medium_strings.csv'
>>> long_strings_path = 'datasets/long_strings.csv'
# Data size (number of string pairs) over which the benchmark should be run
>>> data_size = 10000
# Number of times to repeat
>>> num_repeat = 3
# Output file where the benchmark results should be written
>>> output_file = 'benchmark_results.csv'
# run the benchmark
>>> run_benchmark(short_strings_path, medium_strings_path, long_strings_path, data_size = data_size, jaccard.get_sim_score, ws.tokenize, num_repeat = num_repeat, output_file = output_file)

The benchmark contains three datasets in the datasets directory: (1) short_strings.csv, (2) medium_strings.csv, and (3) long_strings.csv. Each dataset contains 5000 strings. Specifically, short_strings.csv contains strings with length in the range of 2-15 (avg. of 10), medium_strings.csv contains strings with length in the range of 18-39 (avg. of 25), and long_strings.csv contains strings with length in the range of 60-1726 (avg. of 127).

The above command will run the benchmark for 9 different configurations (short-short, short-medium, short-long, medium-short, medium-medium, medium-long, long-short, long-medium, long-long) for the provided similarity measure, and writes the result to the provided output file. See below for additional details.

Interpreting the Results

The benchmark results will be a CSV file containing the following information:

  • Configuration
  • Runtime (in secs) for each run of a configuration (note that each configuration is run for num_repeat times)
  • Average runtime (in secs) for each configuration

An example output file will look like this:

configuration,run_1 (in secs),run_2 (in secs),run_3 (in secs),average (in secs)
short_short,0.112642049789,0.112892866135,0.112852096558,0.112795670827
short_medium,0.115404129028,0.115512132645,0.115454912186,0.115457057953
short_long,0.194123983383,0.193922996521,0.193790912628,0.193945964177
medium_short,0.11647105217,0.116579055786,0.116438865662,0.116496324539
medium_medium,0.118470907211,0.118409156799,0.118496894836,0.118458986282
medium_long,0.206312894821,0.206974983215,0.206708908081,0.206665595373
long_short,0.205050945282,0.205410957336,0.205253124237,0.205238342285
long_medium,0.217441797256,0.21806883812,0.218235015869,0.217915217082
long_long,0.770321846008,0.76869893074,0.768806934357,0.769275903702

Indices and tables