User Manual for py_stringmatching¶
This document shows the users how to install and use the package. To contribute to or further develop the package, see the project website, section “For Contributors and Developers”.
Contents¶
What is New?¶
Compared to Version 0.2.1, the following is new:
- Added nine new string similarity measures - Bag Distance, Editex, Generalized Jaccard, Partial Ratio, Partial Token Sort, Ratio, Soundex, Token Sort, and Tversky Index.
Installation¶
Requirements¶
- Python 2.7 or Python 3.3+
- C or C++ compiler (parts of the package are in Cython for efficiency reasons, and you need C or C++ compiler to compile these parts)
Platforms¶
py_stringmatching has been tested on Linux (Ubuntu with Kernel Version 3.13.0-40-generic), OS X (Darwin with Kernel Version 13.4.0), and Windows 8.1.
Dependencies¶
- numpy 1.7.0 or higher
- six
Note
The py_stringmatching installer will automatically install the above required packages.
There are two ways to install py_stringmatching package: using pip or source distribution.
Installing Using pip¶
The easiest way to install the package is to use pip, which will retrieve py_stringmatching from PyPI then install it:
pip install py_stringmatching
Installing from Source Distribution¶
Step 1: Download the py_stringmatching package from here.
Step 2: Unzip the package and execute the following command from the package root:
python setup.py install
Note
The above command will try to install py_stringmatching into the defaul Python directory on your machine. If you do not have installation permission for that directory then you can install the package in your home directory as follows:
python setup.py install --user
For more information see the StackOverflow link.
Tutorial¶
Once the package has been installed, you can import the package as follows:
In [1]: import py_stringmatching as sm
Computing a similarity score between two given strings x and y then typically consists of four steps: (1) selecting a similarity measure type, (2) selecting a tokenizer type, (3) creating a tokenizer object (of the selected type) and using it to tokenize the two given strings x and y, and (4) creating a similarity measure object (of the selected type) and applying it to the output of the tokenizer to compute a similarity score. We now elaborate on these steps.
1. Selecting a Similarity Measure¶
First, you must select a similarity measure. The package py_stringmatching currently provides 23 different measures (with plan to add more). Examples of such measures are Jaccard, Levenshtein, TF/IDF, etc. To understand more about these measures, a good place to start is the string matching chapter of the book “Principles of Data Integration”. (This chapter is available on the package’s homepage.)
A major group of similarity measures treats input strings as sequences of characters (e.g., Levenshtein, Smith Waterman). Another group treats input strings as sets of tokens (e.g., Jaccard). Yet another group treats input strings as bags of tokens (e.g., TF/IDF). A bag of tokens is a collection of tokens such that a token can appear multiple times in the collection (as opposed to a set of tokens, where each token can appear only once).
- For the currently implemented 23 similarity measures, we have:
- sequence-based measures: affine gap, bag distance, editex, Hamming distance, Jaro, Jaro Winkler, Levenshtein, Needleman Wunsch, partial ratio, partial token sort, ratio, Smith Waterman, token sort.
- set-based measures: cosine, Dice, Jaccard, overlap coefficient, Tversky Index.
- bag-based measures: TF/IDF.
- phonetic-based measures: soundex.
(There are also three hybrid similarity measures: Monge Elkan, Soft TF/IDF, and Generalized Jaccard. They are so called because each of these measures uses multiple similarity measures. See their descriptions in this user manual to understand what types of input they expect.)
At this point, you should know if the selected similarity measure treats input strings as sequences, bags, or sets, so that later you can set the parameters of the tokenizing function properly (see Steps 2-3 below).
2. Selecting a Tokenizer Type¶
If the above selected similarity measure treats input strings as sequences of characters, then you do not need to tokenize the input strings x and y, and hence do not have to select a tokenizer type.
Otherwise, you need to select a tokenizer type. The package py_stringmatching currently provides five different tokenizer types: alphabetical tokenizer, alphanumeric tokenizer, delimiter-based tokenizer, qgram tokenizer, and whitespace tokenizer (more tokenizer types can easily be added).
A tokenizer will convert an input string into a set or a bag of tokens, as discussed in Step 3.
3. Creating a Tokenizer Object and Using It to Tokenize the Input Strings¶
If you have selected a tokenizer type in Step 2, then in Step 3 you create a tokenizer object of that type. If the intended similarity measure (selected in Step 1) treats the input strings as sets of tokens, then when creating the tokenizer object, you must set the flag return_set to True. Otherwise this flag defaults to False, and the created tokenizer object will tokenize a string into a bag of tokens.
The following examples create tokenizer objects where the flag return_set is not mentioned, thus defaulting to False. So these tokenizer objects will tokenize a string into a bag of tokens.
# create an alphabetical tokenizer that returns a bag of tokens
In [2]: alphabet_tok = sm.AlphabeticTokenizer()
# create an alphanumeric tokenizer
In [3]: alnum_tok = sm.AlphanumericTokenizer()
# create a delimiter tokenizer using comma as a delimiter
In [4]: delim_tok = sm.DelimiterTokenizer(delim_set=[','])
# create a qgram tokenizer using q=3
In [5]: qg3_tok = sm.QgramTokenizer(qval=3)
# create a whitespace tokenizer
In [6]: ws_tok = sm.WhitespaceTokenizer()
Given the string “up up and away”, the tokenizer alphabet_tok (defined above) will convert it into a bag of tokens [‘up’, ‘up’, ‘and’, ‘away’], where the token ‘up’ appears twice.
The following examples create tokenizer objects where the flag return_set is set to True. Thus these tokenizers will tokenize a string into a set of tokens.
# create an alphabetical tokenizer that returns a set of tokens
In [7]: alphabet_tok_set = sm.AlphabeticTokenizer(return_set=True)
# create a whitespace tokenizer that returns a set of tokens
In [8]: ws_tok_set = sm.WhitespaceTokenizer(return_set=True)
# create a qgram tokenizer with q=3 that returns a set of tokens
In [9]: qg3_tok_set = sm.QgramTokenizer(qval=3, return_set=True)
So given the same string “up up and away”, the tokenizer alphabet_tok_set (defined above) will convert it into a set of tokens [‘up’, ‘and’, ‘away’].
All tokenizers have a tokenize method which tokenizes a given input string into a set or bag of tokens (depending on whether the flag return_set is True or False), as these examples illustrate:
In [10]: test_string = ' .hello, world!! data, science, is amazing!!. hello.'
# tokenize into a bag of alphabetical tokens
In [11]: alphabet_tok.tokenize(test_string)
Out[11]: ['hello', 'world', 'data', 'science', 'is', 'amazing', 'hello']
# tokenize into alphabetical tokens (with return_set set to True)
In [12]: alphabet_tok_set.tokenize(test_string)
Out[12]: ['hello', 'world', 'data', 'science', 'is', 'amazing']
# tokenize using comma as the delimiter
In [13]: delim_tok.tokenize(test_string)
Out[13]: [' .hello', ' world!! data', ' science', ' is amazing!!. hello.']
# tokenize using whitespace as the delimiter
In [14]: ws_tok.tokenize(test_string)
Out[14]: ['.hello,', 'world!!', 'data,', 'science,', 'is', 'amazing!!.', 'hello.']
Thus, once you have created the tokenizer, you can use the tokenize method to tokenize the two input strings x and y (see more in Step 4 below).
Note
The tokenize method returns a Python list which represents a set of tokens or a bag of tokens, depending on whether the flag return_set is True or False.
4. Creating a Similarity Measure Object and Using It to Compute a Similarity Score¶
Recall that in Step 1 you have selected a similarity measure (e.g., Jaccard, Levenshtein). In this step you start by creating a similarity measure object of the selected type, as illustrated by these examples:
# create a Jaccard similarity measure object
In [15]: jac = sm.Jaccard()
# create a Levenshtein similarity measure object
In [16]: lev = sm.Levenshtein()
There are two main types of similarity measures.
- Those that when given two input strings will compute a true similarity score, which is a number in the range [0,1] such that the higher this number, the more similar the two input strings are.
- Those that when given two input strings will compute a distance score, which is a number such that the higher this number, the more dissimilar the two input strings are (this number is often not in the range [0,1]). Clearly, Type-2 measures (also known as distance measures), are the reverse of Type-1 measures.
For example, Jaccard similarity measure will compute a true similarity score in [0,1] for two input strings. Levenshtein similarity measure, on the other hand, is really a distance measure, which computes the edit distance between the two input strings (see for example Wikipedia or the string matching chapter in the book “Principles of Data Integration”). It is easy to convert a distance score into a true similarity score (again, see examples in the above book chapter).
Given the above, each similarity measure object in py_stringmatching is supplied with two methods: get_raw_score and get_sim_score. The first method will compute the raw score as defined by that type of similarity measures, be it similarity score or distance score. For example, for Jaccard this method will return a true similarity score, whereas for Levenshtein it will return an edit distance score.
The method get_sim_score normalizes the raw score to obtain a true similarity score (a number in [0,1], such that the higher this number the more similar the two strings are). For Jaccard, get_sim_score will simply call get_raw_score. For Levenshtein, however, get_sim_score will normalize the edit distance to return a true similarity score in [0,1].
Here are some examples of using the get_raw_score method:
# input strings
In [17]: x = 'string matching package'
In [18]: y = 'string matching library'
# compute Jaccard score over sets of tokens of x and y, tokenized using whitespace
In [19]: jac.get_raw_score(ws_tok_set.tokenize(x), ws_tok_set.tokenize(y))
Out[19]: 0.5
# compute Jaccard score over sets of tokens of x and y, tokenized into qgrams (with q=3)
In [20]: jac.get_raw_score(qg3_tok_set.tokenize(x), qg3_tok_set.tokenize(y))
Out[20]: 0.4375
# compute Levenshtein distance between x and y
In [21]: lev.get_raw_score(x, y)
Out[21]: 6
Note that in the above examples, the Jaccard measure treats the input strings as sets of tokens. And indeed, the two tokenizers ws_tok_set and qg3_tok_set as defined earlier would tokenize a string into a set of tokens. The Levenshtein measure, on the other hand, treats the input strings as sequences of characters. Hence when using it we do not have to tokenize the two strings x and y.
Here are some example of using the get_sim_score method:
# get normalized Levenshtein similarity score between x and y
In [22]: lev.get_sim_score(x, y)
Out[22]: 0.7391304347826086
# get normalized Jaccard similarity score (this is the same as the raw score)
In [23]: jac.get_sim_score(ws_tok_set.tokenize(x), ws_tok_set.tokenize(y))
Out[23]: 0.5
So depending on what you want, you can call get_raw_score or get_sim_score. Note, however, that certain measures such as affine gap, Monge-Elkan, Needleman-Wunsch, Smith-Waterman and Soft TF/IDF do not have a get_sim_score method, because there is no straightforward way to normalize the raw scores of these measures into similarity scores in [0,1] (see the Developer Manual for further explanation).
Handling a Large Number of String Pairs¶
Steps 1-4 above discuss the case where you want to compute the similarity score of only a single string pair.
There are however cases where you need to compute the similarity scores of many string pairs. For example, given a table A of 10K strings and a table B of 10K strings, you may need to compute the string similarity scores for all 100M string pairs in the Cartesian product of the two tables.
In such cases, you should avoid tokenizing the same string repeatedly, such as calling jac.get_sim_score(ws_tok_set.tokenize(x), ws_tok_set.tokenize(y)) for all pairs (x,y) in the Cartesian product. If you do this, a string x in table A will be tokenized 10K times, since it will appear in 10K pairs. This is clearly unnecessary and very expensive.
Instead, you should tokenize all strings in tables A and B only once, store the output of tokenizing in some Python structure, then call the similarity measure on these structures to compute similarity scores. This will avoid repeated tokenizing of the same strings.
Handling Missing Values¶
By “missing values” we mean cases where the values of one or more strings are missing (e.g., represented as None or NaN in Python). For example, given a row “David,,36” in a CSV file, the value of the second cell of this row is missing. So when this file is read into a data frame, the corresponding cell in the data frame will have the value NaN. Note that missing values are different from empty string values, which are represented as “”.
Handling missing values is tricky and application dependent (see the Developer Manual for a detailed discussion). For these reasons, the tokenizers and similarity measures in the package py_stringmatching do not handle missing values. If one of their input arguments is missing, they will stop, raising an error. Put differently, they expect non-missing input arguments.
Adding Prefix and Suffix to the Input String for Qgram Tokenizers¶
Consider computing a similarity score between two strings “mo” and “moo” using 3gram tokenizing followed by Jaccard scoring. Tokenizing “mo” returns an empty set, because “mo” contains no 3gram. Tokenizing “moo” returns the set {“moo”}. As a result, the Jaccard score between “mo” and “moo” is 0. This is somewhat counterintuitive, because the two strings are similar.
To address such cases, in practice it is common to add a prefix of (q-1) characters (using #) and a suffix of (q-1) characters (using $) to the input string, before generating qgram tokens. For example, “moo” will be padded to be “##moo$$”, before tokenizing. The flag “padding” in qgram tokenizers can be set for this purpose (the default is True, in which case the string will be padded).
Class Hierarchy for Tokenizers and Similarity Measures¶
Version 0.2.0 implements the following class hierarchy for tokenizers:
- Tokenizer
- DefinitionTokenizer
- AlphabeticTokenizer
- AlphanumericTokenizer
- QgramTokenizer
- DelimiterTokenizer
- WhitespaceTokenizer
The version implements the following class hierarchy for similarity measures:
- SimilarityMeasure
- SequenceSimilarityMeasure
- Affine
- BagDistance
- Editex
- HammingDistance
- Jaro
- JaroWinkler
- Levenshtein
- NeedlemanWunsch
- PartialRatio
- PartialTokenSort
- Ratio
- SmithWaterman
- TokenSort
- TokenSimilarityMeasure
- Cosine
- Dice
- Jaccard
- OverlapCoefficient
- TfIdf
- TverskyIndex
- HybridSimilarityMeasure
- GeneralizedJaccard
- MongeElkan
- SoftTfIdf
- PhoneticSimilarityMeasure
- Soundex
References¶
AnHai Doan, Alon Halevy, Zachary Ives, “Principles of Data Integration”, Morgan Kaufmann, 2012. Chapter 4 “String Matching” (available on the package’s homepage).
Tokenizers¶
Alphabetic Tokenizer¶
-
class
py_stringmatching.tokenizer.alphabetic_tokenizer.
AlphabeticTokenizer
(return_set=False)[source]¶ Returns tokens that are maximal sequences of consecutive alphabetical characters.
Parameters: return_set (boolean) – A flag to indicate whether to return a set of tokens instead of a bag of tokens (defaults to False). -
return_set
¶ boolean
An attribute that stores the value for the flag return_set.
-
get_return_set
()¶ Gets the value of the return_set flag.
Returns: The boolean value of the return_set flag.
-
set_return_set
(return_set)¶ Sets the value of the return_set flag.
Parameters: return_set (boolean) – a flag to indicate whether to return a set of tokens instead of a bag of tokens.
-
tokenize
(input_string)[source]¶ Tokenizes input string into alphabetical tokens.
Parameters: input_string (str) – The string to be tokenized. Returns: A Python list, which represents a set of tokens if the flag return_set is True, and a bag of tokens otherwise. Raises: TypeError
– If the input is not a string.Examples
>>> al_tok = AlphabeticTokenizer() >>> al_tok.tokenize('data99science, data#integration.') ['data', 'science', 'data', 'integration'] >>> al_tok.tokenize('99') [] >>> al_tok = AlphabeticTokenizer(return_set=True) >>> al_tok.tokenize('data99science, data#integration.') ['data', 'science', 'integration']
-
Alphanumeric Tokenizer¶
-
class
py_stringmatching.tokenizer.alphanumeric_tokenizer.
AlphanumericTokenizer
(return_set=False)[source]¶ Returns tokens that are maximal sequences of consecutive alphanumeric characters.
Parameters: return_set (boolean) – A flag to indicate whether to return a set of tokens instead of a bag of tokens (defaults to False). -
return_set
¶ boolean
An attribute to store the value of the flag return_set.
-
get_return_set
()¶ Gets the value of the return_set flag.
Returns: The boolean value of the return_set flag.
-
set_return_set
(return_set)¶ Sets the value of the return_set flag.
Parameters: return_set (boolean) – a flag to indicate whether to return a set of tokens instead of a bag of tokens.
-
tokenize
(input_string)[source]¶ Tokenizes input string into alphanumeric tokens.
Parameters: input_string (str) – The string to be tokenized. Returns: A Python list, which represents a set of tokens if the flag return_set is true, and a bag of tokens otherwise. Raises: TypeError
– If the input is not a string.Examples
>>> alnum_tok = AlphanumericTokenizer() >>> alnum_tok.tokenize('data9,(science), data9#.(integration).88') ['data9', 'science', 'data9', 'integration', '88'] >>> alnum_tok.tokenize('#.&') [] >>> alnum_tok = AlphanumericTokenizer(return_set=True) >>> alnum_tok.tokenize('data9,(science), data9#.(integration).88') ['data9', 'science', 'integration', '88']
-
Delimiter Tokenizer¶
-
class
py_stringmatching.tokenizer.delimiter_tokenizer.
DelimiterTokenizer
(delim_set=set([' ']), return_set=False)[source]¶ Uses delimiters to find tokens, as apposed to using definitions.
Examples of delimiters include white space and punctuations. Examples of definitions include alphabetical and qgram tokens.
Parameters: - delim_set (set) – A set of delimiter strings (defaults to space delimiter).
- return_set (boolean) – A flag to indicate whether to return a set of tokens instead of a bag of tokens (defaults to False).
-
return_set
¶ boolean
An attribute to store the value of the flag return_set.
-
get_delim_set
()[source]¶ Gets the current set of delimiters.
Returns: A Python set which is the current set of delimiters.
-
get_return_set
()¶ Gets the value of the return_set flag.
Returns: The boolean value of the return_set flag.
-
set_delim_set
(delim_set)[source]¶ Sets the current set of delimiters.
Parameters: delim_set (set) – A set of delimiter strings.
-
set_return_set
(return_set)¶ Sets the value of the return_set flag.
Parameters: return_set (boolean) – a flag to indicate whether to return a set of tokens instead of a bag of tokens.
-
tokenize
(input_string)[source]¶ Tokenizes input string based on the set of delimiters.
Parameters: input_string (str) – The string to be tokenized. Returns: A Python list which is a set or a bag of tokens, depending on whether return_set flag is set to True or False. Raises: TypeError
– If the input is not a string.Examples
>>> delim_tok = DelimiterTokenizer() >>> delim_tok.tokenize('data science') ['data', 'science'] >>> delim_tok = DelimiterTokenizer(['$#$']) >>> delim_tok.tokenize('data$#$science') ['data', 'science'] >>> delim_tok = DelimiterTokenizer([',', '.']) >>> delim_tok.tokenize('data,science.data,integration.') ['data', 'science', 'data', 'integration'] >>> delim_tok = DelimiterTokenizer([',', '.'], return_set=True) >>> delim_tok.tokenize('data,science.data,integration.') ['data', 'science', 'integration']
Qgram Tokenizer¶
-
class
py_stringmatching.tokenizer.qgram_tokenizer.
QgramTokenizer
(qval=2, padding=True, prefix_pad='#', suffix_pad='$', return_set=False)[source]¶ Returns tokens that are sequences of q consecutive characters.
A qgram of an input string s is a substring t (of s) which is a sequence of q consecutive characters. Qgrams are also known as ngrams or kgrams.
Parameters: - qval (int) – A value for q, that is, the qgram’s length (defaults to 2).
- return_set (boolean) – A flag to indicate whether to return a set of tokens or a bag of tokens (defaults to False).
- padding (boolean) – A flag to indicate whether a prefix and a suffix should be added to the input string (defaults to True).
- prefix_pad (str) – A character (that is, a string of length 1 in Python) that should be replicated (qval-1) times and prepended to the input string, if padding was set to True (defaults to ‘#’).
- suffix_pad (str) – A character (that is, a string of length 1 in Python) that should be replicated (qval-1) times and appended to the input string, if padding was set to True (defaults to ‘$’).
-
qval
¶ int
An attribute to store the q value.
-
return_set
¶ boolean
An attribute to store the flag return_set.
-
padding
¶ boolean
An attribute to store the padding flag.
-
prefix_pad
¶ str
An attribute to store the prefix string that should be used for padding.
-
suffix_pad
¶ str
An attribute to store the suffix string that should be used for padding.
-
get_padding
()[source]¶ Gets the value of the padding flag. This flag determines whether the padding should be done for the input strings or not.
Returns: The Boolean value of the padding flag.
-
get_qval
()[source]¶ Gets the value of the qval attribute, which is the length of qgrams.
Returns: The value of the qval attribute.
-
get_return_set
()¶ Gets the value of the return_set flag.
Returns: The boolean value of the return_set flag.
-
set_padding
(padding)[source]¶ Sets the value of the padding flag.
Parameters: padding (boolean) – Flag to indicate whether padding should be done or not. Returns: The Boolean value of True is returned if the update was successful. Raises: AssertionError
– If the padding is not of type boolean
-
set_prefix_pad
(prefix_pad)[source]¶ Sets the value of the prefix pad string.
Parameters: prefix_pad (str) – String that should be prepended to the input string before tokenization.
Returns: The Boolean value of True is returned if the update was successful.
Raises: AssertionError
– If the prefix_pad is not of type string.AssertionError
– If the length of prefix_pad is not one.
-
set_qval
(qval)[source]¶ Sets the value of the qval attribute.
Parameters: qval (int) – A value for q (the length of qgrams). Raises: AssertionError
– If qval is less than 1.
-
set_return_set
(return_set)¶ Sets the value of the return_set flag.
Parameters: return_set (boolean) – a flag to indicate whether to return a set of tokens instead of a bag of tokens.
-
set_suffix_pad
(suffix_pad)[source]¶ Sets the value of the suffix pad string.
Parameters: suffix_pad (str) – String that should be appended to the input string before tokenization.
Returns: The boolean value of True is returned if the update was successful.
Raises: AssertionError
– If the suffix_pad is not of type string.AssertionError
– If the length of suffix_pad is not one.
-
tokenize
(input_string)[source]¶ Tokenizes input string into qgrams.
Parameters: input_string (str) – The string to be tokenized. Returns: A Python list, which is a set or a bag of qgrams, depending on whether return_set flag is True or False. Raises: TypeError
– If the input is not a stringExamples
>>> qg2_tok = QgramTokenizer() >>> qg2_tok.tokenize('database') ['#d', 'da', 'at', 'ta', 'ab', 'ba', 'as', 'se', 'e$'] >>> qg2_tok.tokenize('a') ['#a', 'a$'] >>> qg3_tok = QgramTokenizer(qval=3) >>> qg3_tok.tokenize('database') ['##d', '#da', 'dat', 'ata', 'tab', 'aba', 'bas', 'ase', 'se$', 'e$$'] >>> qg3_nopad = QgramTokenizer(padding=False) >>> qg3_nopad.tokenize('database') ['da', 'at', 'ta', 'ab', 'ba', 'as', 'se'] >>> qg3_diffpads = QgramTokenizer(prefix_pad='^', suffix_pad='!') >>> qg3_diffpads.tokenize('database') ['^d', 'da', 'at', 'ta', 'ab', 'ba', 'as', 'se', 'e!']
Whitespace Tokenizer¶
-
class
py_stringmatching.tokenizer.whitespace_tokenizer.
WhitespaceTokenizer
(return_set=False)[source]¶ Segments the input string using whitespaces then returns the segments as tokens.
Currently using the split function in Python, so whitespace character refers to the actual whitespace character as well as the tab and newline characters.
Parameters: return_set (boolean) – A flag to indicate whether to return a set of tokens instead of a bag of tokens (defaults to False). -
return_set
¶ boolean
An attribute to store the flag return_set.
-
get_delim_set
()¶ Gets the current set of delimiters.
Returns: A Python set which is the current set of delimiters.
-
get_return_set
()¶ Gets the value of the return_set flag.
Returns: The boolean value of the return_set flag.
-
set_return_set
(return_set)¶ Sets the value of the return_set flag.
Parameters: return_set (boolean) – a flag to indicate whether to return a set of tokens instead of a bag of tokens.
-
tokenize
(input_string)[source]¶ Tokenizes input string based on white space.
Parameters: input_string (str) – The string to be tokenized. Returns: A Python list, which is a set or a bag of tokens, depending on whether return_set is True or False. Raises: TypeError
– If the input is not a string.Examples
>>> ws_tok = WhitespaceTokenizer() >>> ws_tok.tokenize('data science') ['data', 'science'] >>> ws_tok.tokenize('data science') ['data', 'science'] >>> ws_tok.tokenize('data science') ['data', 'science'] >>> ws_tok = WhitespaceTokenizer(return_set=True) >>> ws_tok.tokenize('data science data integration') ['data', 'science', 'integration']
-
Similarity Measures¶
Affine Gap¶
-
class
py_stringmatching.similarity_measure.affine.
Affine
(gap_start=1, gap_continuation=0.5, sim_func=identity_function)[source]¶ Returns the affine gap score between two strings.
The affine gap measure is an extension of the Needleman-Wunsch measure that handles the longer gaps more gracefully. For more information refer to the string matching chapter in the DI book (“Principles of Data Integration”).
Parameters: - gap_start (float) – Cost for the gap at the start (defaults to 1).
- gap_continuation (float) – Cost for the gap continuation (defaults to 0.5).
- sim_func (function) – Function computing similarity score between two characters, which are represented as strings (defaults to an identity function, which returns 1 if the two characters are the same and returns 0 otherwise).
-
gap_start
¶ float
An attribute to store the gap cost at the start.
-
gap_continuation
¶ float
An attribute to store the gap continuation cost.
-
sim_func
¶ function
An attribute to store the similarity function.
-
get_raw_score
(string1, string2)[source]¶ Computes the affine gap score between two strings. This score can be outside the range [0,1].
Parameters: string1,string2 (str) – Input strings. Returns: Affine gap score betwen the two input strings (float). Raises: TypeError
– If the inputs are not strings or if one of the inputs is None.Examples
>>> aff = Affine() >>> aff.get_raw_score('dva', 'deeva') 1.5 >>> aff = Affine(gap_start=2, gap_continuation=0.5) >>> aff.get_raw_score('dva', 'deeve') -0.5 >>> aff = Affine(gap_continuation=0.2, sim_func=lambda s1, s2: (int(1 if s1 == s2 else 0))) >>> aff.get_raw_score('AAAGAATTCA', 'AAATCA') 4.4
-
set_gap_continuation
(gap_continuation)[source]¶ Set gap continuation cost.
Parameters: gap_continuation (float) – Cost for the gap continuation.
Bag Distance¶
Bag distance measure
-
class
py_stringmatching.similarity_measure.bag_distance.
BagDistance
[source]¶ Bag distance measure class.
-
get_raw_score
(string1, string2)[source]¶ Computes the bag distance between two strings.
For two strings X and Y, the Bag distance is: \(max( |bag(string1)-bag(string2)|, |bag(string2)-bag(string1)| )\)
Parameters: string1,string2 (str) – Input strings Returns: Bag distance (int) Raises: TypeError
– If the inputs are not stringsExamples
>>> bd = BagDistance() >>> bd.get_raw_score('cat', 'hat') 1 >>> bd.get_raw_score('Niall', 'Neil') 2 >>> bd.get_raw_score('aluminum', 'Catalan') 5 >>> bd.get_raw_score('ATCG', 'TAGC') 0 >>> bd.get_raw_score('abcde', 'xyz') 5
References
-
get_sim_score
(string1, string2)[source]¶ Computes the normalized bag similarity between two strings.
Parameters: string1,string2 (str) – Input strings Returns: Normalized bag similarity (float) Raises: TypeError
– If the inputs are not stringsExamples
>>> bd = BagDistance() >>> bd.get_sim_score('cat', 'hat') 0.6666666666666667 >>> bd.get_sim_score('Niall', 'Neil') 0.6 >>> bd.get_sim_score('aluminum', 'Catalan') 0.375 >>> bd.get_sim_score('ATCG', 'TAGC') 1.0 >>> bd.get_sim_score('abcde', 'xyz') 0.0
References
-
Cosine¶
-
class
py_stringmatching.similarity_measure.cosine.
Cosine
[source]¶ Computes a variant of cosine measure known as Ochiai coefficient.
This is not the cosine measure that computes the cosine of the angle between two given vectors. Rather, it computes a variant of cosine measure known as Ochiai coefficient (see the Wikipedia page “Cosine Similarity”). Specifically, for two sets X and Y, this measure computes:
\(cosine(X, Y) = \frac{|X \cap Y|}{\sqrt{|X| \cdot |Y|}}\)Note
- In the case where one of X and Y is an empty set and the other is a non-empty set, we define their cosine score to be 0.
- In the case where both X and Y are empty sets, we define their cosine score to be 1.
-
get_raw_score
(set1, set2)[source]¶ Computes the raw cosine score between two sets.
Parameters: set1,set2 (set or list) – Input sets (or lists). Input lists are converted to sets. Returns: Cosine similarity (float) Raises: TypeError
– If the inputs are not sets (or lists) or if one of the inputs is None.Examples
>>> cos = Cosine() >>> cos.get_raw_score(['data', 'science'], ['data']) 0.7071067811865475 >>> cos.get_raw_score(['data', 'data', 'science'], ['data', 'management']) 0.4999999999999999 >>> cos.get_raw_score([], ['data']) 0.0
References
- String similarity joins: An Experimental Evaluation (a paper appearing in the VLDB 2014 Conference).
- Project Flamingo at http://flamingo.ics.uci.edu.
-
get_sim_score
(set1, set2)[source]¶ Computes the normalized cosine similarity between two sets.
Parameters: set1,set2 (set or list) – Input sets (or lists). Input lists are converted to sets. Returns: Normalized cosine similarity (float) Raises: TypeError
– If the inputs are not sets (or lists) or if one of the inputs is None.Examples
>>> cos = Cosine() >>> cos.get_sim_score(['data', 'science'], ['data']) 0.7071067811865475 >>> cos.get_sim_score(['data', 'data', 'science'], ['data', 'management']) 0.4999999999999999 >>> cos.get_sim_score([], ['data']) 0.0
Dice¶
-
class
py_stringmatching.similarity_measure.dice.
Dice
[source]¶ Returns the Dice score between two strings.
The Dice similarity score is defined as twice the shared information (intersection) divided by sum of cardinalities. For two sets X and Y, the Dice similarity score is:
\(dice(X, Y) = \frac{2 * |X \cap Y|}{|X| + |Y|}\)Note
In the case where both X and Y are empty sets, we define their Dice score to be 1.
-
get_raw_score
(set1, set2)[source]¶ Computes the raw Dice score between two sets. This score is already in [0,1].
Parameters: set1,set2 (set or list) – Input sets (or lists). Input lists are converted to sets. Returns: Dice similarity score (float). Raises: TypeError
– If the inputs are not sets (or lists) or if one of the inputs is None.Examples
>>> dice = Dice() >>> dice.get_raw_score(['data', 'science'], ['data']) 0.6666666666666666 >>> dice.get_raw_score({1, 1, 2, 3, 4}, {2, 3, 4, 5, 6, 7, 7, 8}) 0.5454545454545454 >>> dice.get_raw_score(['data', 'management'], ['data', 'data', 'science']) 0.5
References
- Wikipedia article : https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Dice%27s_coefficient
- SimMetrics library.
-
get_sim_score
(set1, set2)[source]¶ Computes the normalized dice similarity score between two sets. Simply call get_raw_score.
Parameters: set1,set2 (set or list) – Input sets (or lists). Input lists are converted to sets. Returns: Normalized dice similarity (float). Raises: TypeError
– If the inputs are not sets (or lists) or if one of the inputs is None.Examples
>>> dice = Dice() >>> dice.get_sim_score(['data', 'science'], ['data']) 0.6666666666666666 >>> dice.get_sim_score({1, 1, 2, 3, 4}, {2, 3, 4, 5, 6, 7, 7, 8}) 0.5454545454545454 >>> dice.get_sim_score(['data', 'management'], ['data', 'data', 'science']) 0.5
-
Editex¶
Editex distance measure
-
class
py_stringmatching.similarity_measure.editex.
Editex
(match_cost=0, group_cost=1, mismatch_cost=2, local=False)[source]¶ Editex distance measure class.
Parameters: - match_cost (int) – Weight to give the correct char match, default=0
- group_cost (int) – Weight to give if the chars are in the same editex group, default=1
- mismatch_cost (int) – Weight to give the incorrect char match, default=2
- local (boolean) – Local variant on/off, default=False
-
get_raw_score
(string1, string2)[source]¶ Computes the editex distance between two strings.
As described on pages 3 & 4 of Zobel, Justin and Philip Dart. 1996. Phonetic string matching: Lessons from information retrieval. In: Proceedings of the ACM-SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland. 166–173. http://goanna.cs.rmit.edu.au/~jz/fulltext/sigir96.pdf
The local variant is based on Ring, Nicholas and Alexandra L. Uitdenbogerd. 2009. Finding ‘Lucy in Disguise’: The Misheard Lyric Matching Problem. In: Proceedings of the 5th Asia Information Retrieval Symposium, Sapporo, Japan. 157-167. http://www.seg.rmit.edu.au/research/download.php?manuscript=404
Parameters: string1,string2 (str) – Input strings Returns: Editex distance (int) Raises: TypeError
– If the inputs are not stringsExamples
>>> ed = Editex() >>> ed.get_raw_score('cat', 'hat') 2 >>> ed.get_raw_score('Niall', 'Neil') 2 >>> ed.get_raw_score('aluminum', 'Catalan') 12 >>> ed.get_raw_score('ATCG', 'TAGC') 6
References
- Abydos Library - https://github.com/chrislit/abydos/blob/master/abydos/distance.py
-
get_sim_score
(string1, string2)[source]¶ Computes the normalized editex similarity between two strings.
Parameters: string1,string2 (str) – Input strings Returns: Normalized editex similarity (float) Raises: TypeError
– If the inputs are not stringsExamples
>>> ed = Editex() >>> ed.get_sim_score('cat', 'hat') 0.66666666666666674 >>> ed.get_sim_score('Niall', 'Neil') 0.80000000000000004 >>> ed.get_sim_score('aluminum', 'Catalan') 0.25 >>> ed.get_sim_score('ATCG', 'TAGC') 0.25
References
- Abydos Library - https://github.com/chrislit/abydos/blob/master/abydos/distance.py
-
set_group_cost
(group_cost)[source]¶ Set group cost
Parameters: group_cost (int) – Weight to give if the chars are in the same editex group
Generalized Jaccard¶
Generalized jaccard similarity measure
-
class
py_stringmatching.similarity_measure.generalized_jaccard.
GeneralizedJaccard
(sim_func=<bound method Jaro.get_raw_score of <py_stringmatching.similarity_measure.jaro.Jaro object at 0x7f5a3641bf50>>, threshold=0.5)[source]¶ Generalized jaccard similarity measure class.
Parameters: - sim_func (function) – similarity function. This should return a similarity score between two strings in set (optional), default is jaro similarity measure
- threshold (float) – Threshold value (defaults to 0.5). If the similarity of a token pair exceeds the threshold, then the token pair is considered a match.
-
get_raw_score
(set1, set2)[source]¶ Computes the Generalized Jaccard measure between two sets.
This similarity measure is softened version of the Jaccard measure. The Jaccard measure is promising candidate for tokens which exactly match across the sets. However, in practice tokens are often misspelled, such as energy vs. eneryg. THe generalized Jaccard measure will enable matching in such cases.
Parameters: set1,set2 (set or list) – Input sets (or lists) of strings. Input lists are converted to sets.
Returns: Generalized Jaccard similarity (float)
Raises: TypeError
– If the inputs are not sets (or lists) or if one of the inputs is None.ValueError
– If the similarity measure doesn’t return values in the range [0,1]
Examples
>>> gj = GeneralizedJaccard() >>> gj.get_raw_score(['data', 'science'], ['data']) 0.5 >>> gj.get_raw_score(['data', 'management'], ['data', 'data', 'science']) 0.3333333333333333 >>> gj.get_raw_score(['Niall'], ['Neal', 'Njall']) 0.43333333333333335 >>> gj = GeneralizedJaccard(sim_func=JaroWinkler().get_raw_score, threshold=0.8) >>> gj.get_raw_score(['Comp', 'Sci.', 'and', 'Engr', 'Dept.,', 'Universty', 'of', 'Cal,', 'San', 'Deigo'], ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego']) 0.45810185185185187
-
get_sim_score
(set1, set2)[source]¶ Computes the normalized Generalized Jaccard similarity between two sets.
Parameters: set1,set2 (set or list) – Input sets (or lists) of strings. Input lists are converted to sets.
Returns: Normalized Generalized Jaccard similarity (float)
Raises: TypeError
– If the inputs are not sets (or lists) or if one of the inputs is None.ValueError
– If the similarity measure doesn’t return values in the range [0,1]
Examples
>>> gj = GeneralizedJaccard() >>> gj.get_sim_score(['data', 'science'], ['data']) 0.5 >>> gj.get_sim_score(['data', 'management'], ['data', 'data', 'science']) 0.3333333333333333 >>> gj.get_sim_score(['Niall'], ['Neal', 'Njall']) 0.43333333333333335 >>> gj = GeneralizedJaccard(sim_func=JaroWinkler().get_raw_score, threshold=0.8) >>> gj.get_sim_score(['Comp', 'Sci.', 'and', 'Engr', 'Dept.,', 'Universty', 'of', 'Cal,', 'San', 'Deigo'], ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego']) 0.45810185185185187
Hamming Distance¶
-
class
py_stringmatching.similarity_measure.hamming_distance.
HammingDistance
[source]¶ Computes Hamming distance.
The Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. Thus, it measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the other.
-
get_raw_score
(string1, string2)[source]¶ Computes the raw hamming distance between two strings.
Parameters: string1,string2 (str) – Input strings.
Returns: Hamming distance (int).
Raises: TypeError
– If the inputs are not strings or if one of the inputs is None.ValueError
– If the input strings are not of same length.
Examples
>>> hd = HammingDistance() >>> hd.get_raw_score('', '') 0 >>> hd.get_raw_score('alex', 'john') 4 >>> hd.get_raw_score(' ', 'a') 1 >>> hd.get_raw_score('JOHN', 'john') 4
-
get_sim_score
(string1, string2)[source]¶ Computes the normalized Hamming similarity score between two strings.
Parameters: string1,string2 (str) – Input strings.
Returns: Normalized Hamming similarity score (float).
Raises: TypeError
– If the inputs are not strings or if one of the inputs is None.ValueError
– If the input strings are not of same length.
Examples
>>> hd = HammingDistance() >>> hd.get_sim_score('', '') 1.0 >>> hd.get_sim_score('alex', 'john') 0.0 >>> hd.get_sim_score(' ', 'a') 0.0 >>> hd.get_sim_score('JOHN', 'john') 0.0
-
Jaccard¶
-
class
py_stringmatching.similarity_measure.jaccard.
Jaccard
[source]¶ Computes Jaccard measure.
For two sets X and Y, the Jaccard similarity score is:
\(jaccard(X, Y) = \frac{|X \cap Y|}{|X \cup Y|}\)Note
In the case where both X and Y are empty sets, we define their Jaccard score to be 1.
-
get_raw_score
(set1, set2)[source]¶ Computes the raw Jaccard score between two sets.
Parameters: set1,set2 (set or list) – Input sets (or lists). Input lists are converted to sets. Returns: Jaccard similarity score (float). Raises: TypeError
– If the inputs are not sets (or lists) or if one of the inputs is None.Examples
>>> jac = Jaccard() >>> jac.get_raw_score(['data', 'science'], ['data']) 0.5 >>> jac.get_raw_score({1, 1, 2, 3, 4}, {2, 3, 4, 5, 6, 7, 7, 8}) 0.375 >>> jac.get_raw_score(['data', 'management'], ['data', 'data', 'science']) 0.3333333333333333
-
get_sim_score
(set1, set2)[source]¶ Computes the normalized Jaccard similarity between two sets. Simply call get_raw_score.
Parameters: set1,set2 (set or list) – Input sets (or lists). Input lists are converted to sets. Returns: Normalized Jaccard similarity (float). Raises: TypeError
– If the inputs are not sets (or lists) or if one of the inputs is None.Examples
>>> jac = Jaccard() >>> jac.get_sim_score(['data', 'science'], ['data']) 0.5 >>> jac.get_sim_score({1, 1, 2, 3, 4}, {2, 3, 4, 5, 6, 7, 7, 8}) 0.375 >>> jac.get_sim_score(['data', 'management'], ['data', 'data', 'science']) 0.3333333333333333
-
Jaro¶
-
class
py_stringmatching.similarity_measure.jaro.
Jaro
[source]¶ Computes Jaro measure.
The Jaro measure is a type of edit distance, developed mainly to compare short strings, such as first and last names.
-
get_raw_score
(string1, string2)[source]¶ Computes the raw Jaro score between two strings.
Parameters: string1,string2 (str) – Input strings. Returns: Jaro similarity score (float). Raises: TypeError
– If the inputs are not strings or if one of the inputs is None.Examples
>>> jaro = Jaro() >>> jaro.get_raw_score('MARTHA', 'MARHTA') 0.9444444444444445 >>> jaro.get_raw_score('DWAYNE', 'DUANE') 0.8222222222222223 >>> jaro.get_raw_score('DIXON', 'DICKSONX') 0.7666666666666666
-
get_sim_score
(string1, string2)[source]¶ Computes the normalized Jaro similarity score between two strings. Simply call get_raw_score.
Parameters: string1,string2 (str) – Input strings. Returns: Normalized Jaro similarity score (float). Raises: TypeError
– If the inputs are not strings or if one of the inputs is None.Examples
>>> jaro = Jaro() >>> jaro.get_sim_score('MARTHA', 'MARHTA') 0.9444444444444445 >>> jaro.get_sim_score('DWAYNE', 'DUANE') 0.8222222222222223 >>> jaro.get_sim_score('DIXON', 'DICKSONX') 0.7666666666666666
-
Jaro Winkler¶
-
class
py_stringmatching.similarity_measure.jaro_winkler.
JaroWinkler
(prefix_weight=0.1)[source]¶ Computes Jaro-Winkler measure.
The Jaro-Winkler measure is designed to capture cases where two strings have a low Jaro score, but share a prefix and thus are likely to match.
Parameters: prefix_weight (float) – Weight to give to the prefix (defaults to 0.1). -
prefix_weight
¶ float
An attribute to store the prefix weight.
-
get_raw_score
(string1, string2)[source]¶ Computes the raw Jaro-Winkler score between two strings.
Parameters: string1,string2 (str) – Input strings. Returns: Jaro-Winkler similarity score (float). Raises: TypeError
– If the inputs are not strings or if one of the inputs is None.Examples
>>> jw = JaroWinkler() >>> jw.get_raw_score('MARTHA', 'MARHTA') 0.9611111111111111 >>> jw.get_raw_score('DWAYNE', 'DUANE') 0.84 >>> jw.get_raw_score('DIXON', 'DICKSONX') 0.8133333333333332
-
get_sim_score
(string1, string2)[source]¶ Computes the normalized Jaro-Winkler similarity score between two strings. Simply call get_raw_score.
Parameters: string1,string2 (str) – Input strings. Returns: Normalized Jaro-Winkler similarity (float). Raises: TypeError
– If the inputs are not strings or if one of the inputs is None.Examples
>>> jw = JaroWinkler() >>> jw.get_sim_score('MARTHA', 'MARHTA') 0.9611111111111111 >>> jw.get_sim_score('DWAYNE', 'DUANE') 0.84 >>> jw.get_sim_score('DIXON', 'DICKSONX') 0.8133333333333332
-
Levenshtein¶
-
class
py_stringmatching.similarity_measure.levenshtein.
Levenshtein
[source]¶ Computes Levenshtein measure (also known as edit distance).
Levenshtein distance computes the minimum cost of transforming one string into the other. Transforming a string is carried out using a sequence of the following operators: delete a character, insert a character, and substitute one character for another.
-
get_raw_score
(string1, string2)[source]¶ Computes the raw Levenshtein distance between two strings.
Parameters: string1,string2 (str) – Input strings. Returns: Levenshtein distance (int). Raises: TypeError
– If the inputs are not strings.Examples
>>> lev = Levenshtein() >>> lev.get_raw_score('a', '') 1 >>> lev.get_raw_score('example', 'samples') 3 >>> lev.get_raw_score('levenshtein', 'frankenstein') 6
-
get_sim_score
(string1, string2)[source]¶ Computes the normalized Levenshtein similarity score between two strings.
Parameters: string1,string2 (str) – Input strings. Returns: Normalized Levenshtein similarity (float). Raises: TypeError
– If the inputs are not strings.Examples
>>> lev = Levenshtein() >>> lev.get_sim_score('a', '') 0.0 >>> lev.get_sim_score('example', 'samples') 0.5714285714285714 >>> lev.get_sim_score('levenshtein', 'frankenstein') 0.5
-
Monge Elkan¶
-
class
py_stringmatching.similarity_measure.monge_elkan.
MongeElkan
(sim_func=jaro_winkler_function)[source]¶ Computes Monge-Elkan measure.
The Monge-Elkan similarity measure is a type of hybrid similarity measure that combines the benefits of sequence-based and set-based methods. This can be effective for domains in which more control is needed over the similarity measure. It implicitly uses a secondary similarity measure, such as Levenshtein to compute over all similarity score. See the string matching chapter in the DI book (Principles of Data Integration).
Parameters: sim_func (function) – Secondary similarity function. This is expected to be a sequence-based similarity measure (defaults to Jaro-Winkler similarity measure). -
sim_func
¶ function
An attribute to store the secondary similarity function.
-
get_raw_score
(bag1, bag2)[source]¶ Computes the raw Monge-Elkan score between two bags (lists).
Parameters: bag1,bag2 (list) – Input lists. Returns: Monge-Elkan similarity score (float). Raises: TypeError
– If the inputs are not lists or if one of the inputs is None.Examples
>>> me = MongeElkan() >>> me.get_raw_score(['Niall'], ['Neal']) 0.8049999999999999 >>> me.get_raw_score(['Niall'], ['Nigel']) 0.7866666666666667 >>> me.get_raw_score(['Comput.', 'Sci.', 'and', 'Eng.', 'Dept.,', 'University', 'of', 'California,', 'San', 'Diego'], ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego']) 0.8677218614718616 >>> me.get_raw_score([''], ['a']) 0.0 >>> me = MongeElkan(sim_func=NeedlemanWunsch().get_raw_score) >>> me.get_raw_score(['Comput.', 'Sci.', 'and', 'Eng.', 'Dept.,', 'University', 'of', 'California,', 'San', 'Diego'], ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego']) 2.0 >>> me = MongeElkan(sim_func=Affine().get_raw_score) >>> me.get_raw_score(['Comput.', 'Sci.', 'and', 'Eng.', 'Dept.,', 'University', 'of', 'California,', 'San', 'Diego'], ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego']) 2.25
References
- Principles of Data Integration book
-
Needleman Wunsch¶
-
class
py_stringmatching.similarity_measure.needleman_wunsch.
NeedlemanWunsch
(gap_cost=1.0, sim_func=identity_function)[source]¶ Computes Needleman-Wunsch measure.
The Needleman-Wunsch distance generalizes the Levenshtein distance and considers global alignment between two strings. Specifically, it is computed by assigning a score to each alignment between the two input strings and choosing the score of the best alignment, that is, the maximal score. An alignment between two strings is a set of correspondences between their characters, allowing for gaps.
Parameters: - gap_cost (float) – Cost of gap (defaults to 1.0).
- sim_func (function) – Similarity function to give a score for each correspondence between the characters (defaults to an identity function, which returns 1 if the two characters are the same and 0 otherwise.
-
gap_cost
¶ float
An attribute to store the gap cost.
-
sim_func
¶ function
An attribute to store the similarity function.
-
get_raw_score
(string1, string2)[source]¶ Computes the raw Needleman-Wunsch score between two strings.
Parameters: string1,string2 (str) – Input strings. Returns: Needleman-Wunsch similarity score (float). Raises: TypeError
– If the inputs are not strings or if one of the inputs is None.Examples
>>> nw = NeedlemanWunsch() >>> nw.get_raw_score('dva', 'deeva') 1.0 >>> nw = NeedlemanWunsch(gap_cost=0.0) >>> nw.get_raw_score('dva', 'deeve') 2.0 >>> nw = NeedlemanWunsch(gap_cost=1.0, sim_func=lambda s1, s2 : (2.0 if s1 == s2 else -1.0)) >>> nw.get_raw_score('dva', 'deeve') 1.0 >>> nw = NeedlemanWunsch(gap_cost=0.5, sim_func=lambda s1, s2 : (1.0 if s1 == s2 else -1.0)) >>> nw.get_raw_score('GCATGCUA', 'GATTACA') 2.5
Overlap Coefficient¶
-
class
py_stringmatching.similarity_measure.overlap_coefficient.
OverlapCoefficient
[source]¶ Computes overlap coefficient measure.
The overlap coefficient is a similarity measure related to the Jaccard measure that measures the overlap between two sets, and is defined as the size of the intersection divided by the smaller of the size of the two sets. For two sets X and Y, the overlap coefficient is:
\(overlap\_coefficient(X, Y) = \frac{|X \cap Y|}{\min(|X|, |Y|)}\)Note
- In the case where one of X and Y is an empty set and the other is a non-empty set, we define their overlap coefficient to be 0.
- In the case where both X and Y are empty sets, we define their overlap coefficient to be 1.
-
get_raw_score
(set1, set2)[source]¶ Computes the raw overlap coefficient score between two sets.
Parameters: set1,set2 (set or list) – Input sets (or lists). Input lists are converted to sets. Returns: Overlap coefficient (float). Raises: TypeError
– If the inputs are not sets (or lists) or if one of the inputs is None.Examples
>>> oc = OverlapCoefficient() >>> oc.get_raw_score(['data', 'science'], ['data']) 1.0 >>> oc.get_raw_score([], []) 1.0 >>> oc.get_raw_score([], ['data']) 0
References
- Wikipedia article : https://en.wikipedia.org/wiki/Overlap_coefficient
- SimMetrics library
-
get_sim_score
(set1, set2)[source]¶ Computes the normalized overlap coefficient between two sets. Simply call get_raw_score.
Parameters: set1,set2 (set or list) – Input sets (or lists). Input lists are converted to sets. Returns: Normalized overlap coefficient (float). Raises: TypeError
– If the inputs are not sets (or lists) or if one of the inputs is None.Examples
>>> oc = OverlapCoefficient() >>> oc.get_sim_score(['data', 'science'], ['data']) 1.0 >>> oc.get_sim_score([], []) 1.0 >>> oc.get_sim_score([], ['data']) 0
Partial Ratio¶
Fuzzy Wuzzy Partial Ratio Similarity Measure
-
class
py_stringmatching.similarity_measure.partial_ratio.
PartialRatio
[source]¶ Computes the Fuzzy Wuzzy partial ratio similarity between two strings.
Fuzzy Wuzzy partial ratio raw score is a measure of the strings similarity as an int in the range [0, 100]. Given two strings X and Y, let the shorter string (X) be of length m. It finds the fuzzy wuzzy ratio similarity measure between the shorter string and every substring of length m of the longer string, and returns the maximum of those similarity measures. Fuzzy Wuzzy partial ratio sim score is a float in the range [0, 1] and is obtained by dividing the raw score by 100.
Note: In the case where either of strings X or Y are empty, we define the Fuzzy Wuzzy ratio similarity score to be 0.
-
get_raw_score
(string1, string2)[source]¶ Computes the Fuzzy Wuzzy partial ratio measure raw score between two strings. This score is in the range [0,100].
Parameters: string1,string2 (str) – Input strings Returns: Partial Ratio measure raw score (int) is returned Raises: TypeError
– If the inputs are not stringsExamples
>>> s = PartialRatio() >>> s.get_raw_score('Robert Rupert', 'Rupert') 100 >>> s.get_raw_score('Sue', 'sue') 67 >>> s.get_raw_score('example', 'samples') 86
References
-
get_sim_score
(string1, string2)[source]¶ Computes the Fuzzy Wuzzy partial ratio similarity score between two strings. This score is in the range [0,1].
Parameters: string1,string2 (str) – Input strings Returns: Partial Ratio measure similarity score (float) is returned Raises: TypeError
– If the inputs are not stringsExamples
>>> s = PartialRatio() >>> s.get_sim_score('Robert Rupert', 'Rupert') 1.0 >>> s.get_sim_score('Sue', 'sue') 0.67 >>> s.get_sim_score('example', 'samples') 0.86
References
-
Partial Token Sort¶
Fuzzy Wuzzy Token Sort Similarity Measure
-
class
py_stringmatching.similarity_measure.partial_token_sort.
PartialTokenSort
[source]¶ Computes Fuzzy Wuzzy partial token sort similarity measure.
Fuzzy Wuzzy partial token sort ratio raw raw_score is a measure of the strings similarity as an int in the range [0, 100]. For two strings X and Y, the score is obtained by splitting the two strings into tokens and then sorting the tokens. The score is then the fuzzy wuzzy partial ratio raw score of the transformed strings. Fuzzy Wuzzy token sort sim score is a float in the range [0, 1] and is obtained by dividing the raw score by 100.
- Note:
- In the case where either of strings X or Y are empty, we define the Fuzzy Wuzzy partial ratio similarity score to be 0.
-
get_raw_score
(string1, string2, force_ascii=True, full_process=True)[source]¶ Computes the Fuzzy Wuzzy partial token sort measure raw score between two strings. This score is in the range [0,100].
Parameters: - string1,string2 (str) – Input strings
- force_ascii (boolean) – Flag to remove non-ascii characters or not
- full_process (boolean) – Flag to process the string or not. Processing includes
- non alphanumeric characters, converting string to lower case and (removing) –
- leading and trailing whitespaces. (removing) –
Returns: Partial Token Sort measure raw score (int) is returned
Raises: TypeError
– If the inputs are not stringsExamples
>>> s = PartialTokenSort() >>> s.get_raw_score('great is scala', 'java is great') 81 >>> s.get_raw_score('Sue', 'sue') 100 >>> s.get_raw_score('C++ and Java', 'Java and Python') 64
References
-
get_sim_score
(string1, string2, force_ascii=True, full_process=True)[source]¶ Computes the Fuzzy Wuzzy partial token sort similarity score between two strings. This score is in the range [0,1].
Parameters: - string1,string2 (str) – Input strings
- force_ascii (boolean) – Flag to remove non-ascii characters or not
- full_process (boolean) – Flag to process the string or not. Processing includes
- non alphanumeric characters, converting string to lower case and (removing) –
- leading and trailing whitespaces. (removing) –
Returns: Partial Token Sort measure similarity score (float) is returned
Raises: TypeError
– If the inputs are not stringsExamples
>>> s = PartialTokenSort() >>> s.get_sim_score('great is scala', 'java is great') 0.81 >>> s.get_sim_score('Sue', 'sue') 1.0 >>> s.get_sim_score('C++ and Java', 'Java and Python') 0.64
References
Ratio¶
Fuzzy Wuzzy Ratio Similarity Measure
-
class
py_stringmatching.similarity_measure.ratio.
Ratio
[source]¶ Computes Fuzzy Wuzzy ratio similarity measure.
Fuzzy Wuzzy ratio raw score is a measure of the strings similarity as an int in the range [0, 100]. For two strings X and Y, the score is defined by int(round((2.0 * M / T) * 100)) where T is the total number of characters in both strings, and M is the number of matches in the two strings. Fuzzy Wuzzy ratio sim score is a float in the range [0, 1] and is obtained by dividing the raw score by 100.
- Note:
- In the case where either of strings X or Y are empty, we define the Fuzzy Wuzzy ratio similarity score to be 0.
-
get_raw_score
(string1, string2)[source]¶ Computes the Fuzzy Wuzzy ratio measure raw score between two strings. This score is in the range [0,100].
Parameters: string1,string2 (str) – Input strings Returns: Ratio measure raw score (int) is returned Raises: TypeError
– If the inputs are not stringsExamples
>>> s = Ratio() >>> s.get_raw_score('Robert', 'Rupert') 67 >>> s.get_raw_score('Sue', 'sue') 67 >>> s.get_raw_score('example', 'samples') 71
References
-
get_sim_score
(string1, string2)[source]¶ Computes the Fuzzy Wuzzy ratio similarity score between two strings. This score is in the range [0,1].
Parameters: string1,string2 (str) – Input strings Returns: Ratio measure similarity score (float) is returned Raises: TypeError
– If the inputs are not stringsExamples
>>> s = Ratio() >>> s.get_sim_score('Robert', 'Rupert') 0.67 >>> s.get_sim_score('Sue', 'sue') 0.67 >>> s.get_sim_score('example', 'samples') 0.71
References
Smith Waterman¶
-
class
py_stringmatching.similarity_measure.smith_waterman.
SmithWaterman
(gap_cost=1.0, sim_func=identity_function)[source]¶ Computes Smith-Waterman measure.
The Smith-Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings. Instead of looking at the total sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure. See the string matching chapter in the DI book (Principles of Data Integration).
Parameters: - gap_cost (float) – Cost of gap (defaults to 1.0).
- sim_func (function) – Similarity function to give a score for the correspondence between the characters (defaults to an identity function, which returns 1 if the two characters are the same and 0 otherwise).
-
gap_cost
¶ float
An attribute to store the gap cost.
-
sim_func
¶ function
An attribute to store the similarity function.
-
get_raw_score
(string1, string2)[source]¶ Computes the raw Smith-Waterman score between two strings.
Parameters: string1,string2 (str) – Input strings. Returns: Smith-Waterman similarity score (float). Raises: TypeError
– If the inputs are not strings or if one of the inputs is None.Examples
>>> sw = SmithWaterman() >>> sw.get_raw_score('cat', 'hat') 2.0 >>> sw = SmithWaterman(gap_cost=2.2) >>> sw.get_raw_score('dva', 'deeve') 1.0 >>> sw = SmithWaterman(gap_cost=1, sim_func=lambda s1, s2 : (2 if s1 == s2 else -1)) >>> sw.get_raw_score('dva', 'deeve') 2.0 >>> sw = SmithWaterman(gap_cost=1.4, sim_func=lambda s1, s2 : (1.5 if s1 == s2 else 0.5)) >>> sw.get_raw_score('GCATAGCU', 'GATTACA') 6.5
Soft TF/IDF¶
-
class
py_stringmatching.similarity_measure.soft_tfidf.
SoftTfIdf
(corpus_list=None, sim_func=jaro_function, threshold=0.5)[source]¶ Computes soft TF/IDF measure.
Note
Currently, this measure is implemented without dampening. This is similar to setting dampen flag to be False in TF-IDF. We plan to add the dampen flag in the next release.
Parameters: - corpus_list (list of lists) – Corpus list (default is set to None) of strings. If set to None, the input list are considered the only corpus.
- sim_func (function) – Secondary similarity function. This should return a similarity score between two strings (optional), default is the Jaro similarity measure.
- threshold (float) – Threshold value for the secondary similarity function (defaults to 0.5). If the similarity of a token pair exceeds the threshold, then the token pair is considered a match.
-
sim_func
¶ function
An attribute to store the secondary similarity function.
-
threshold
¶ float
An attribute to store the threshold value for the secondary similarity function.
-
get_raw_score
(bag1, bag2)[source]¶ Computes the raw soft TF/IDF score between two lists given the corpus information.
Parameters: bag1,bag2 (list) – Input lists Returns: Soft TF/IDF score between the input lists (float). Raises: TypeError
– If the inputs are not lists or if one of the inputs is None.Examples
>>> soft_tfidf = SoftTfIdf([['a', 'b', 'a'], ['a', 'c'], ['a']], sim_func=Jaro().get_raw_score, threshold=0.8) >>> soft_tfidf.get_raw_score(['a', 'b', 'a'], ['a', 'c']) 0.17541160386140586 >>> soft_tfidf = SoftTfIdf([['a', 'b', 'a'], ['a', 'c'], ['a']], threshold=0.9) >>> soft_tfidf.get_raw_score(['a', 'b', 'a'], ['a']) 0.5547001962252291 >>> soft_tfidf = SoftTfIdf([['x', 'y'], ['w'], ['q']]) >>> soft_tfidf.get_raw_score(['a', 'b', 'a'], ['a']) 0.0 >>> soft_tfidf = SoftTfIdf(sim_func=Affine().get_raw_score, threshold=0.6) >>> soft_tfidf.get_raw_score(['aa', 'bb', 'a'], ['ab', 'ba']) 0.81649658092772592
References
- the string matching chapter of the “Principles of Data Integration” book.
-
get_sim_func
()[source]¶ Get secondary similarity function.
Returns: secondary similarity function (function).
-
get_threshold
()[source]¶ Get threshold used for the secondary similarity function.
Returns: threshold (float).
-
set_corpus_list
(corpus_list)[source]¶ Set corpus list.
Parameters: corpus_list (list of lists) – Corpus list.
Soundex¶
Soundex phonetic similarity measure
-
class
py_stringmatching.similarity_measure.soundex.
Soundex
[source]¶ Soundex phonetic similarity measure class.
-
get_raw_score
(string1, string2)[source]¶ Computes the Soundex phonetic similarity between two strings.
Phonetic measure such as soundex match string based on their sound. These measures have been especially effective in matching names, since names are often spelled in different ways that sound the same. For example, Meyer, Meier, and Mire sound the same, as do Smith, Smithe, and Smythe.
Soundex is used primarily to match surnames. It does not work as well for names of East Asian origins, because much of the discriminating power of these names resides in the vowel sounds, which the code ignores.
Parameters: string1,string2 (str) – Input strings Returns: Soundex similarity score (int) is returned Raises: TypeError
– If the inputs are not stringsExamples
>>> s = Soundex() >>> s.get_raw_score('Robert', 'Rupert') 1 >>> s.get_raw_score('Sue', 's') 1 >>> s.get_raw_score('Gough', 'Goff') 0 >>> s.get_raw_score('a,,li', 'ali') 1
-
get_sim_score
(string1, string2)[source]¶ Computes the normalized soundex similarity between two strings.
Parameters: string1,string2 (str) – Input strings Returns: Normalized soundex similarity (int) Raises: TypeError
– If the inputs are not strings or if one of the inputs is None.Examples
>>> s = Soundex() >>> s.get_sim_score('Robert', 'Rupert') 1 >>> s.get_sim_score('Sue', 's') 1 >>> s.get_sim_score('Gough', 'Goff') 0 >>> s.get_sim_score('a,,li', 'ali') 1
-
TF/IDF¶
-
class
py_stringmatching.similarity_measure.tfidf.
TfIdf
(corpus_list=None, dampen=True)[source]¶ Computes TF/IDF measure.
This measure employs the notion of TF/IDF score commonly used in information retrieval (IR) to find documents that are relevant to keyword queries. The intuition underlying the TF/IDF measure is that two strings are similar if they share distinguishing terms. See the string matching chapter in the book “Principles of Data Integration”
Parameters: - corpus_list (list of lists) – The corpus that will be used to compute TF and IDF values. This corpus is a list of strings, where each string has been tokenized into a list of tokens (that is, a bag of tokens). The default is set to None. In this case, when we call this TF/IDF measure on two input strings (using get_raw_score or get_sim_score), the corpus is taken to be the list of those two strings.
- dampen (boolean) – Flag to indicate whether ‘log’ should be used in TF and IDF formulas (defaults to True).
-
dampen
¶ boolean
An attribute to store the dampen flag.
-
get_raw_score
(bag1, bag2)[source]¶ Computes the raw TF/IDF score between two lists.
Parameters: bag1,bag2 (list) – Input lists. Returns: TF/IDF score between the input lists (float). Raises: TypeError
– If the inputs are not lists or if one of the inputs is None.Examples
>>> # here the corpus is a list of three strings that >>> # have been tokenized into three lists of tokens >>> tfidf = TfIdf([['a', 'b', 'a'], ['a', 'c'], ['a']]) >>> tfidf.get_raw_score(['a', 'b', 'a'], ['b', 'c']) 0.7071067811865475 >>> tfidf.get_raw_score(['a', 'b', 'a'], ['a']) 0.0 >>> tfidf = TfIdf([['x', 'y'], ['w'], ['q']]) >>> tfidf.get_raw_score(['a', 'b', 'a'], ['a']) 0.0 >>> tfidf = TfIdf([['a', 'b', 'a'], ['a', 'c'], ['a'], ['b']], False) >>> tfidf.get_raw_score(['a', 'b', 'a'], ['a', 'c']) 0.25298221281347033 >>> tfidf = TfIdf(dampen=False) >>> tfidf.get_raw_score(['a', 'b', 'a'], ['a']) 0.7071067811865475 >>> tfidf = TfIdf() >>> tfidf.get_raw_score(['a', 'b', 'a'], ['a']) 0.0
-
get_sim_score
(bag1, bag2)[source]¶ Computes the normalized TF/IDF similarity score between two lists. Simply call get_raw_score.
Parameters: bag1,bag2 (list) – Input lists. Returns: Normalized TF/IDF similarity score between the input lists (float). Raises: TypeError
– If the inputs are not lists or if one of the inputs is None.Examples
>>> # here the corpus is a list of three strings that >>> # have been tokenized into three lists of tokens >>> tfidf = TfIdf([['a', 'b', 'a'], ['a', 'c'], ['a']]) >>> tfidf.get_sim_score(['a', 'b', 'a'], ['b', 'c']) 0.7071067811865475 >>> tfidf.get_sim_score(['a', 'b', 'a'], ['a']) 0.0 >>> tfidf = TfIdf([['x', 'y'], ['w'], ['q']]) >>> tfidf.get_sim_score(['a', 'b', 'a'], ['a']) 0.0 >>> tfidf = TfIdf([['a', 'b', 'a'], ['a', 'c'], ['a'], ['b']], False) >>> tfidf.get_sim_score(['a', 'b', 'a'], ['a', 'c']) 0.25298221281347033 >>> tfidf = TfIdf(dampen=False) >>> tfidf.get_sim_score(['a', 'b', 'a'], ['a']) 0.7071067811865475 >>> tfidf = TfIdf() >>> tfidf.get_sim_score(['a', 'b', 'a'], ['a']) 0.0
Token Sort¶
Fuzzy Wuzzy Token Sort Similarity Measure
-
class
py_stringmatching.similarity_measure.token_sort.
TokenSort
[source]¶ Computes Fuzzy Wuzzy token sort similarity measure.
Fuzzy Wuzzy token sort ratio raw raw_score is a measure of the strings similarity as an int in the range [0, 100]. For two strings X and Y, the score is obtained by splitting the two strings into tokens and then sorting the tokens. The score is then the fuzzy wuzzy ratio raw score of the transformed strings. Fuzzy Wuzzy token sort sim score is a float in the range [0, 1] and is obtained by dividing the raw score by 100.
- Note:
- In the case where either of strings X or Y are empty, we define the Fuzzy Wuzzy ratio similarity score to be 0.
-
get_raw_score
(string1, string2, force_ascii=True, full_process=True)[source]¶ Computes the Fuzzy Wuzzy token sort measure raw score between two strings. This score is in the range [0,100].
Parameters: - string1,string2 (str) – Input strings
- force_ascii (boolean) – Flag to remove non-ascii characters or not
- full_process (boolean) – Flag to process the string or not. Processing includes
- non alphanumeric characters, converting string to lower case and (removing) –
- leading and trailing whitespaces. (removing) –
Returns: Token Sort measure raw score (int) is returned
Raises: TypeError
– If the inputs are not stringsExamples
>>> s = TokenSort() >>> s.get_raw_score('great is scala', 'java is great') 81 >>> s.get_raw_score('Sue', 'sue') 100 >>> s.get_raw_score('C++ and Java', 'Java and Python') 64
References
-
get_sim_score
(string1, string2, force_ascii=True, full_process=True)[source]¶ Computes the Fuzzy Wuzzy token sort similarity score between two strings. This score is in the range [0,1].
Parameters: - string1,string2 (str) – Input strings
- force_ascii (boolean) – Flag to remove non-ascii characters or not
- full_process (boolean) – Flag to process the string or not. Processing includes
- non alphanumeric characters, converting string to lower case and (removing) –
- leading and trailing whitespaces. (removing) –
Returns: Token Sort measure similarity score (float) is returned
Raises: TypeError
– If the inputs are not stringsExamples
>>> s = TokenSort() >>> s.get_sim_score('great is scala', 'java is great') 0.81 >>> s.get_sim_score('Sue', 'sue') 1.0 >>> s.get_sim_score('C++ and Java', 'Java and Python') 0.64
References
Tversky Index¶
Tversky index similarity measure
-
class
py_stringmatching.similarity_measure.tversky_index.
TverskyIndex
(alpha=0.5, beta=0.5)[source]¶ Tversky index similarity measure class.
Parameters: beta (alpha,) – Tversky index parameters (defaults to 0.5). -
get_raw_score
(set1, set2)[source]¶ Computes the Tversky index similarity between two sets.
The Tversky index is an asymmetric similarity measure on sets that compares a variant to a prototype. The Tversky index can be seen as a generalization of Dice’s coefficient and Tanimoto coefficient.
For sets X and Y the Tversky index is a number between 0 and 1 given by: \(tversky_index(X, Y) = \frac{|X \cap Y|}{|X \cap Y| + lpha |X-Y| + eta |Y-X|}\) where, :math: lpha, eta >=0
Parameters: set1,set2 (set or list) – Input sets (or lists). Input lists are converted to sets. Returns: Tversly index similarity (float) Raises: TypeError
– If the inputs are not sets (or lists) or if one of the inputs is None.Examples
>>> tvi = TverskyIndex() >>> tvi.get_raw_score(['data', 'science'], ['data']) 0.6666666666666666 >>> tvi.get_raw_score(['data', 'management'], ['data', 'data', 'science']) 0.5 >>> tvi.get_raw_score({1, 1, 2, 3, 4}, {2, 3, 4, 5, 6, 7, 7, 8}) 0.5454545454545454 >>> tvi = TverskyIndex(0.5, 0.5) >>> tvi.get_raw_score({1, 1, 2, 3, 4}, {2, 3, 4, 5, 6, 7, 7, 8}) 0.5454545454545454 >>> tvi = TverskyIndex(beta=0.5) >>> tvi.get_raw_score(['data', 'management'], ['data', 'data', 'science']) 0.5
-
get_sim_score
(set1, set2)[source]¶ Computes the normalized tversky index similarity between two sets.
Parameters: set1,set2 (set or list) – Input sets (or lists). Input lists are converted to sets. Returns: Normalized tversky index similarity (float) Raises: TypeError
– If the inputs are not sets (or lists) or if one of the inputs is None.Examples
>>> tvi = TverskyIndex() >>> tvi.get_sim_score(['data', 'science'], ['data']) 0.6666666666666666 >>> tvi.get_sim_score(['data', 'management'], ['data', 'data', 'science']) 0.5 >>> tvi.get_sim_score({1, 1, 2, 3, 4}, {2, 3, 4, 5, 6, 7, 7, 8}) 0.5454545454545454 >>> tvi = TverskyIndex(0.5, 0.5) >>> tvi.get_sim_score({1, 1, 2, 3, 4}, {2, 3, 4, 5, 6, 7, 7, 8}) 0.5454545454545454 >>> tvi = TverskyIndex(beta=0.5) >>> tvi.get_sim_score(['data', 'management'], ['data', 'data', 'science']) 0.5
-