Generalized Jaccard¶
Generalized jaccard similarity measure
-
class
py_stringmatching.similarity_measure.generalized_jaccard.
GeneralizedJaccard
(sim_func=<bound method Jaro.get_raw_score of <py_stringmatching.similarity_measure.jaro.Jaro object>>, threshold=0.5)[source]¶ Generalized jaccard similarity measure class.
Parameters: - sim_func (function) – similarity function. This should return a similarity score between two strings in set (optional), default is jaro similarity measure
- threshold (float) – Threshold value (defaults to 0.5). If the similarity of a token pair exceeds the threshold, then the token pair is considered a match.
-
get_raw_score
(set1, set2)[source]¶ Computes the Generalized Jaccard measure between two sets.
This similarity measure is softened version of the Jaccard measure. The Jaccard measure is promising candidate for tokens which exactly match across the sets. However, in practice tokens are often misspelled, such as energy vs. eneryg. THe generalized Jaccard measure will enable matching in such cases.
Parameters: set1,set2 (set or list) – Input sets (or lists) of strings. Input lists are converted to sets.
Returns: Generalized Jaccard similarity (float)
Raises: TypeError
– If the inputs are not sets (or lists) or if one of the inputs is None.ValueError
– If the similarity measure doesn’t return values in the range [0,1]
Examples
>>> gj = GeneralizedJaccard() >>> gj.get_raw_score(['data', 'science'], ['data']) 0.5 >>> gj.get_raw_score(['data', 'management'], ['data', 'data', 'science']) 0.3333333333333333 >>> gj.get_raw_score(['Niall'], ['Neal', 'Njall']) 0.43333333333333335 >>> gj = GeneralizedJaccard(sim_func=JaroWinkler().get_raw_score, threshold=0.8) >>> gj.get_raw_score(['Comp', 'Sci.', 'and', 'Engr', 'Dept.,', 'Universty', 'of', 'Cal,', 'San', 'Deigo'], ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego']) 0.45810185185185187
-
get_sim_score
(set1, set2)[source]¶ Computes the normalized Generalized Jaccard similarity between two sets.
Parameters: set1,set2 (set or list) – Input sets (or lists) of strings. Input lists are converted to sets.
Returns: Normalized Generalized Jaccard similarity (float)
Raises: TypeError
– If the inputs are not sets (or lists) or if one of the inputs is None.ValueError
– If the similarity measure doesn’t return values in the range [0,1]
Examples
>>> gj = GeneralizedJaccard() >>> gj.get_sim_score(['data', 'science'], ['data']) 0.5 >>> gj.get_sim_score(['data', 'management'], ['data', 'data', 'science']) 0.3333333333333333 >>> gj.get_sim_score(['Niall'], ['Neal', 'Njall']) 0.43333333333333335 >>> gj = GeneralizedJaccard(sim_func=JaroWinkler().get_raw_score, threshold=0.8) >>> gj.get_sim_score(['Comp', 'Sci.', 'and', 'Engr', 'Dept.,', 'Universty', 'of', 'Cal,', 'San', 'Deigo'], ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego']) 0.45810185185185187