User Manual for py_stringmatching¶
This document shows the users how to install and use the package. To contribute to or further develop the package, see the project website, section “For Contributors and Developers”.
Contents¶
What is New?¶
Compared to Version 0.4.0, the followings are new:
- Cython version was updated. The package is now built with updated Cython version >= 0.27.3.
- Added support for Python 3.7 version and dropped Testing support for Python 3.3 version.
Installation¶
Requirements¶
- Python 2.7 or Python 3.4+
- C or C++ compiler (parts of the package are in Cython for efficiency reasons, and you need C or C++ compiler to compile these parts)
Platforms¶
py_stringmatching has been tested on Linux (Ubuntu with Kernel Version 3.13.0-40-generic), OS X (Darwin with Kernel Version 13.4.0), and Windows 8.1.
Dependencies¶
- numpy 1.7.0 or higher
- six
Note
The py_stringmatching installer will automatically install the above required packages.
C Compiler Required¶
Before installing this package, you need to make sure that you have a C compiler installed. This is necessary because this package contains Cython files. Go here for more information about how to check whether you already have a C compiler and how to install a C compiler.
After you have confirmed that you have a C compiler installed, you are ready to install the package. There are two ways to install py_stringmatching package: using pip or source distribution.
Installing Using pip¶
The easiest way to install the package is to use pip, which will retrieve py_stringmatching from PyPI then install it:
pip install py_stringmatching
Installing from Source Distribution¶
Step 1: Download the py_stringmatching package from here.
Step 2: Unzip the package and execute the following command from the package root:
python setup.py install
Note
The above command will try to install py_stringmatching into the defaul Python directory on your machine. If you do not have installation permission for that directory then you can install the package in your home directory as follows:
python setup.py install --user
For more information see the StackOverflow link.
Tutorial¶
Once the package has been installed, you can import the package as follows:
In [1]: import py_stringmatching as sm
Computing a similarity score between two given strings x and y then typically consists of four steps: (1) selecting a similarity measure type, (2) selecting a tokenizer type, (3) creating a tokenizer object (of the selected type) and using it to tokenize the two given strings x and y, and (4) creating a similarity measure object (of the selected type) and applying it to the output of the tokenizer to compute a similarity score. We now elaborate on these steps.
1. Selecting a Similarity Measure¶
First, you must select a similarity measure. The package py_stringmatching currently provides a set of different measures (with plan to add more). Examples of such measures are Jaccard, Levenshtein, TF/IDF, etc. To understand more about these measures, a good place to start is the string matching chapter of the book “Principles of Data Integration”. (This chapter is available on the package’s homepage.)
A major group of similarity measures treats input strings as sequences of characters (e.g., Levenshtein, Smith Waterman). Another group treats input strings as sets of tokens (e.g., Jaccard). Yet another group treats input strings as bags of tokens (e.g., TF/IDF). A bag of tokens is a collection of tokens such that a token can appear multiple times in the collection (as opposed to a set of tokens, where each token can appear only once).
- The currently implemented similarity measures include:
- sequence-based measures: affine gap, bag distance, editex, Hamming distance, Jaro, Jaro Winkler, Levenshtein, Needleman Wunsch, partial ratio, partial token sort, ratio, Smith Waterman, token sort.
- set-based measures: cosine, Dice, Jaccard, overlap coefficient, Tversky Index.
- bag-based measures: TF/IDF.
- phonetic-based measures: soundex.
(There are also hybrid similarity measures: Monge Elkan, Soft TF/IDF, and Generalized Jaccard. They are so called because each of these measures uses multiple similarity measures. See their descriptions in this user manual to understand what types of input they expect.)
At this point, you should know if the selected similarity measure treats input strings as sequences, bags, or sets, so that later you can set the parameters of the tokenizing function properly (see Steps 2-3 below).
2. Selecting a Tokenizer Type¶
If the above selected similarity measure treats input strings as sequences of characters, then you do not need to tokenize the input strings x and y, and hence do not have to select a tokenizer type.
Otherwise, you need to select a tokenizer type. The package py_stringmatching currently provides a set of different tokenizer types: alphabetical tokenizer, alphanumeric tokenizer, delimiter-based tokenizer, qgram tokenizer, and whitespace tokenizer (more tokenizer types can easily be added).
A tokenizer will convert an input string into a set or a bag of tokens, as discussed in Step 3.
3. Creating a Tokenizer Object and Using It to Tokenize the Input Strings¶
If you have selected a tokenizer type in Step 2, then in Step 3 you create a tokenizer object of that type. If the intended similarity measure (selected in Step 1) treats the input strings as sets of tokens, then when creating the tokenizer object, you must set the flag return_set to True. Otherwise this flag defaults to False, and the created tokenizer object will tokenize a string into a bag of tokens.
The following examples create tokenizer objects where the flag return_set is not mentioned, thus defaulting to False. So these tokenizer objects will tokenize a string into a bag of tokens.
# create an alphabetical tokenizer that returns a bag of tokens
In [2]: alphabet_tok = sm.AlphabeticTokenizer()
# create an alphanumeric tokenizer
In [3]: alnum_tok = sm.AlphanumericTokenizer()
# create a delimiter tokenizer using comma as a delimiter
In [4]: delim_tok = sm.DelimiterTokenizer(delim_set=[','])
# create a qgram tokenizer using q=3
In [5]: qg3_tok = sm.QgramTokenizer(qval=3)
# create a whitespace tokenizer
In [6]: ws_tok = sm.WhitespaceTokenizer()
Given the string “up up and away”, the tokenizer alphabet_tok (defined above) will convert it into a bag of tokens [‘up’, ‘up’, ‘and’, ‘away’], where the token ‘up’ appears twice.
The following examples create tokenizer objects where the flag return_set is set to True. Thus these tokenizers will tokenize a string into a set of tokens.
# create an alphabetical tokenizer that returns a set of tokens
In [7]: alphabet_tok_set = sm.AlphabeticTokenizer(return_set=True)
# create a whitespace tokenizer that returns a set of tokens
In [8]: ws_tok_set = sm.WhitespaceTokenizer(return_set=True)
# create a qgram tokenizer with q=3 that returns a set of tokens
In [9]: qg3_tok_set = sm.QgramTokenizer(qval=3, return_set=True)
So given the same string “up up and away”, the tokenizer alphabet_tok_set (defined above) will convert it into a set of tokens [‘up’, ‘and’, ‘away’].
All tokenizers have a tokenize method which tokenizes a given input string into a set or bag of tokens (depending on whether the flag return_set is True or False), as these examples illustrate:
In [10]: test_string = ' .hello, world!! data, science, is amazing!!. hello.'
# tokenize into a bag of alphabetical tokens
In [11]: alphabet_tok.tokenize(test_string)
Out[11]: ['hello', 'world', 'data', 'science', 'is', 'amazing', 'hello']
# tokenize into alphabetical tokens (with return_set set to True)
In [12]: alphabet_tok_set.tokenize(test_string)