Partial Token Sort¶
Fuzzy Wuzzy Token Sort Similarity Measure
-
class
py_stringmatching.similarity_measure.partial_token_sort.
PartialTokenSort
[source]¶ Computes Fuzzy Wuzzy partial token sort similarity measure.
Fuzzy Wuzzy partial token sort ratio raw raw_score is a measure of the strings similarity as an int in the range [0, 100]. For two strings X and Y, the score is obtained by splitting the two strings into tokens and then sorting the tokens. The score is then the fuzzy wuzzy partial ratio raw score of the transformed strings. Fuzzy Wuzzy token sort sim score is a float in the range [0, 1] and is obtained by dividing the raw score by 100.
- Note:
- In the case where either of strings X or Y are empty, we define the Fuzzy Wuzzy partial ratio similarity score to be 0.
-
get_raw_score
(string1, string2, force_ascii=True, full_process=True)[source]¶ Computes the Fuzzy Wuzzy partial token sort measure raw score between two strings. This score is in the range [0,100].
Parameters: - string1,string2 (str) – Input strings
- force_ascii (boolean) – Flag to remove non-ascii characters or not
- full_process (boolean) – Flag to process the string or not. Processing includes
- non alphanumeric characters, converting string to lower case and (removing) –
- leading and trailing whitespaces. (removing) –
Returns: Partial Token Sort measure raw score (int) is returned
Raises: TypeError
– If the inputs are not stringsExamples
>>> s = PartialTokenSort() >>> s.get_raw_score('great is scala', 'java is great') 81 >>> s.get_raw_score('Sue', 'sue') 100 >>> s.get_raw_score('C++ and Java', 'Java and Python') 64
References
-
get_sim_score
(string1, string2, force_ascii=True, full_process=True)[source]¶ Computes the Fuzzy Wuzzy partial token sort similarity score between two strings. This score is in the range [0,1].
Parameters: - string1,string2 (str) – Input strings
- force_ascii (boolean) – Flag to remove non-ascii characters or not
- full_process (boolean) – Flag to process the string or not. Processing includes
- non alphanumeric characters, converting string to lower case and (removing) –
- leading and trailing whitespaces. (removing) –
Returns: Partial Token Sort measure similarity score (float) is returned
Raises: TypeError
– If the inputs are not stringsExamples
>>> s = PartialTokenSort() >>> s.get_sim_score('great is scala', 'java is great') 0.81 >>> s.get_sim_score('Sue', 'sue') 1.0 >>> s.get_sim_score('C++ and Java', 'Java and Python') 0.64
References