Generalized Jaccard¶
Generalized jaccard similarity measure
-
class
py_stringmatching.similarity_measure.generalized_jaccard.
GeneralizedJaccard
(sim_func=<bound method Jaro.get_raw_score of <py_stringmatching.similarity_measure.jaro.Jaro object>>, threshold=0.5)[source]¶ Generalized jaccard similarity measure class.
- Parameters
sim_func (function) – similarity function. This should return a similarity score between two strings in set (optional), default is jaro similarity measure
threshold (float) – Threshold value (defaults to 0.5). If the similarity of a token pair exceeds the threshold, then the token pair is considered a match.
-
get_raw_score
(set1, set2)[source]¶ Computes the Generalized Jaccard measure between two sets.
This similarity measure is softened version of the Jaccard measure. The Jaccard measure is promising candidate for tokens which exactly match across the sets. However, in practice tokens are often misspelled, such as energy vs. eneryg. THe generalized Jaccard measure will enable matching in such cases.
- Parameters
set1 (set or list) – Input sets (or lists) of strings. Input lists are converted to sets.
set2 (set or list) – Input sets (or lists) of strings. Input lists are converted to sets.
- Returns
Generalized Jaccard similarity (float)
- Raises
TypeError – If the inputs are not sets (or lists) or if one of the inputs is None.
ValueError – If the similarity measure doesn’t return values in the range [0,1]
Examples
>>> gj = GeneralizedJaccard() >>> gj.get_raw_score(['data', 'science'], ['data']) 0.5 >>> gj.get_raw_score(['data', 'management'], ['data', 'data', 'science']) 0.3333333333333333 >>> gj.get_raw_score(['Niall'], ['Neal', 'Njall']) 0.43333333333333335 >>> gj = GeneralizedJaccard(sim_func=JaroWinkler().get_raw_score, threshold=0.8) >>> gj.get_raw_score(['Comp', 'Sci.', 'and', 'Engr', 'Dept.,', 'Universty', 'of', 'Cal,', 'San', 'Deigo'], ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego']) 0.45810185185185187
-
get_sim_score
(set1, set2)[source]¶ Computes the normalized Generalized Jaccard similarity between two sets.
- Parameters
set1 (set or list) – Input sets (or lists) of strings. Input lists are converted to sets.
set2 (set or list) – Input sets (or lists) of strings. Input lists are converted to sets.
- Returns
Normalized Generalized Jaccard similarity (float)
- Raises
TypeError – If the inputs are not sets (or lists) or if one of the inputs is None.
ValueError – If the similarity measure doesn’t return values in the range [0,1]
Examples
>>> gj = GeneralizedJaccard() >>> gj.get_sim_score(['data', 'science'], ['data']) 0.5 >>> gj.get_sim_score(['data', 'management'], ['data', 'data', 'science']) 0.3333333333333333 >>> gj.get_sim_score(['Niall'], ['Neal', 'Njall']) 0.43333333333333335 >>> gj = GeneralizedJaccard(sim_func=JaroWinkler().get_raw_score, threshold=0.8) >>> gj.get_sim_score(['Comp', 'Sci.', 'and', 'Engr', 'Dept.,', 'Universty', 'of', 'Cal,', 'San', 'Deigo'], ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego']) 0.45810185185185187