Tutorial¶
Once the package has been installed, you can import the package as follows:
In [1]: import py_stringmatching as sm
Computing a similarity score between two given strings x and y then typically consists of four steps: (1) selecting a similarity measure type, (2) selecting a tokenizer type, (3) creating a tokenizer object (of the selected type) and using it to tokenize the two given strings x and y, and (4) creating a similarity measure object (of the selected type) and applying it to the output of the tokenizer to compute a similarity score. We now elaborate on these steps.
1. Selecting a Similarity Measure¶
First, you must select a similarity measure. The package py_stringmatching currently provides a set of different measures (with plan to add more). Examples of such measures are Jaccard, Levenshtein, TF/IDF, etc. To understand more about these measures, a good place to start is the string matching chapter of the book “Principles of Data Integration”. (This chapter is available on the package’s homepage.)
A major group of similarity measures treats input strings as sequences of characters (e.g., Levenshtein, Smith Waterman). Another group treats input strings as sets of tokens (e.g., Jaccard). Yet another group treats input strings as bags of tokens (e.g., TF/IDF). A bag of tokens is a collection of tokens such that a token can appear multiple times in the collection (as opposed to a set of tokens, where each token can appear only once).
- The currently implemented similarity measures include:
- sequence-based measures: affine gap, bag distance, editex, Hamming distance, Jaro, Jaro Winkler, Levenshtein, Needleman Wunsch, partial ratio, partial token sort, ratio, Smith Waterman, token sort.
- set-based measures: cosine, Dice, Jaccard, overlap coefficient, Tversky Index.
- bag-based measures: TF/IDF.
- phonetic-based measures: soundex.
(There are also hybrid similarity measures: Monge Elkan, Soft TF/IDF, and Generalized Jaccard. They are so called because each of these measures uses multiple similarity measures. See their descriptions in this user manual to understand what types of input they expect.)
At this point, you should know if the selected similarity measure treats input strings as sequences, bags, or sets, so that later you can set the parameters of the tokenizing function properly (see Steps 2-3 below).
2. Selecting a Tokenizer Type¶
If the above selected similarity measure treats input strings as sequences of characters, then you do not need to tokenize the input strings x and y, and hence do not have to select a tokenizer type.
Otherwise, you need to select a tokenizer type. The package py_stringmatching currently provides a set of different tokenizer types: alphabetical tokenizer, alphanumeric tokenizer, delimiter-based tokenizer, qgram tokenizer, and whitespace tokenizer (more tokenizer types can easily be added).
A tokenizer will convert an input string into a set or a bag of tokens, as discussed in Step 3.
3. Creating a Tokenizer Object and Using It to Tokenize the Input Strings¶
If you have selected a tokenizer type in Step 2, then in Step 3 you create a tokenizer object of that type. If the intended similarity measure (selected in Step 1) treats the input strings as sets of tokens, then when creating the tokenizer object, you must set the flag return_set to True. Otherwise this flag defaults to False, and the created tokenizer object will tokenize a string into a bag of tokens.
The following examples create tokenizer objects where the flag return_set is not mentioned, thus defaulting to False. So these tokenizer objects will tokenize a string into a bag of tokens.
# create an alphabetical tokenizer that returns a bag of tokens
In [2]: alphabet_tok = sm.AlphabeticTokenizer()
# create an alphanumeric tokenizer
In [3]: alnum_tok = sm.AlphanumericTokenizer()
# create a delimiter tokenizer using comma as a delimiter
In [4]: delim_tok = sm.DelimiterTokenizer(delim_set=[','])
# create a qgram tokenizer using q=3
In [5]: qg3_tok = sm.QgramTokenizer(qval=3)
# create a whitespace tokenizer
In [6]: ws_tok = sm.WhitespaceTokenizer()
Given the string “up up and away”, the tokenizer alphabet_tok (defined above) will convert it into a bag of tokens [‘up’, ‘up’, ‘and’, ‘away’], where the token ‘up’ appears twice.
The following examples create tokenizer objects where the flag return_set is set to True. Thus these tokenizers will tokenize a string into a set of tokens.
# create an alphabetical tokenizer that returns a set of tokens
In [7]: alphabet_tok_set = sm.AlphabeticTokenizer(return_set=True)
# create a whitespace tokenizer that returns a set of tokens
In [8]: ws_tok_set = sm.WhitespaceTokenizer(return_set=True)
# create a qgram tokenizer with q=3 that returns a set of tokens
In [9]: qg3_tok_set = sm.QgramTokenizer(qval=3, return_set=True)
So given the same string “up up and away”, the tokenizer alphabet_tok_set (defined above) will convert it into a set of tokens [‘up’, ‘and’, ‘away’].
All tokenizers have a tokenize method which tokenizes a given input string into a set or bag of tokens (depending on whether the flag return_set is True or False), as these examples illustrate:
In [10]: test_string = ' .hello, world!! data, science, is amazing!!. hello.'
# tokenize into a bag of alphabetical tokens
In [11]: alphabet_tok.tokenize(test_string)
Out[11]: ['hello', 'world', 'data', 'science', 'is', 'amazing', 'hello']
# tokenize into alphabetical tokens (with return_set set to True)
In [12]: alphabet_tok_set.tokenize(test_string)