Generalized Jaccard

Generalized jaccard similarity measure

class py_stringmatching.similarity_measure.generalized_jaccard.GeneralizedJaccard(sim_func=<bound method Jaro.get_raw_score of <py_stringmatching.similarity_measure.jaro.Jaro object>>, threshold=0.5)[source]

Generalized jaccard similarity measure class.

Parameters:
  • sim_func (function) – similarity function. This should return a similarity score between two strings in set (optional), default is jaro similarity measure
  • threshold (float) – Threshold value (defaults to 0.5). If the similarity of a token pair exceeds the threshold, then the token pair is considered a match.
get_raw_score(set1, set2)[source]

Computes the Generalized Jaccard measure between two sets.

This similarity measure is softened version of the Jaccard measure. The Jaccard measure is promising candidate for tokens which exactly match across the sets. However, in practice tokens are often misspelled, such as energy vs. eneryg. THe generalized Jaccard measure will enable matching in such cases.

Parameters:

set1,set2 (set or list) – Input sets (or lists) of strings. Input lists are converted to sets.

Returns:

Generalized Jaccard similarity (float)

Raises:
  • TypeError – If the inputs are not sets (or lists) or if one of the inputs is None.
  • ValueError – If the similarity measure doesn’t return values in the range [0,1]

Examples

>>> gj = GeneralizedJaccard()
>>> gj.get_raw_score(['data', 'science'], ['data'])
0.5
>>> gj.get_raw_score(['data', 'management'], ['data', 'data', 'science'])
0.3333333333333333
>>> gj.get_raw_score(['Niall'], ['Neal', 'Njall'])
0.43333333333333335
>>> gj = GeneralizedJaccard(sim_func=JaroWinkler().get_raw_score, threshold=0.8)
>>> gj.get_raw_score(['Comp', 'Sci.', 'and', 'Engr', 'Dept.,', 'Universty', 'of', 'Cal,', 'San', 'Deigo'],
                     ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego'])
0.45810185185185187
get_sim_func()[source]

Get similarity function

Returns:similarity function (function)
get_sim_score(set1, set2)[source]

Computes the normalized Generalized Jaccard similarity between two sets.

Parameters:

set1,set2 (set or list) – Input sets (or lists) of strings. Input lists are converted to sets.

Returns:

Normalized Generalized Jaccard similarity (float)

Raises:
  • TypeError – If the inputs are not sets (or lists) or if one of the inputs is None.
  • ValueError – If the similarity measure doesn’t return values in the range [0,1]

Examples

>>> gj = GeneralizedJaccard()
>>> gj.get_sim_score(['data', 'science'], ['data'])
0.5
>>> gj.get_sim_score(['data', 'management'], ['data', 'data', 'science'])
0.3333333333333333
>>> gj.get_sim_score(['Niall'], ['Neal', 'Njall'])
0.43333333333333335
>>> gj = GeneralizedJaccard(sim_func=JaroWinkler().get_raw_score, threshold=0.8)
>>> gj.get_sim_score(['Comp', 'Sci.', 'and', 'Engr', 'Dept.,', 'Universty', 'of', 'Cal,', 'San', 'Deigo'],
                     ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego'])
0.45810185185185187
get_threshold()[source]

Get threshold used for the similarity function

Returns:threshold (float)
set_sim_func(sim_func)[source]

Set similarity function

Parameters:sim_func (function) – similarity function
set_threshold(threshold)[source]

Set threshold value for the similarity function

Parameters:threshold (float) – threshold value