Jaccard

class py_stringmatching.similarity_measure.jaccard.Jaccard[source]

Computes Jaccard measure.

For two sets X and Y, the Jaccard similarity score is:

\(jaccard(X, Y) = \frac{|X \cap Y|}{|X \cup Y|}\)

Note

In the case where both X and Y are empty sets, we define their Jaccard score to be 1.

get_raw_score(set1, set2)[source]

Computes the raw Jaccard score between two sets.

Parameters:set1,set2 (set or list) – Input sets (or lists). Input lists are converted to sets.
Returns:Jaccard similarity score (float).
Raises:TypeError – If the inputs are not sets (or lists) or if one of the inputs is None.

Examples

>>> jac = Jaccard()
>>> jac.get_raw_score(['data', 'science'], ['data'])
0.5
>>> jac.get_raw_score({1, 1, 2, 3, 4}, {2, 3, 4, 5, 6, 7, 7, 8})
0.375
>>> jac.get_raw_score(['data', 'management'], ['data', 'data', 'science'])
0.3333333333333333
get_sim_score(set1, set2)[source]

Computes the normalized Jaccard similarity between two sets. Simply call get_raw_score.

Parameters:set1,set2 (set or list) – Input sets (or lists). Input lists are converted to sets.
Returns:Normalized Jaccard similarity (float).
Raises:TypeError – If the inputs are not sets (or lists) or if one of the inputs is None.

Examples

>>> jac = Jaccard()
>>> jac.get_sim_score(['data', 'science'], ['data'])
0.5
>>> jac.get_sim_score({1, 1, 2, 3, 4}, {2, 3, 4, 5, 6, 7, 7, 8})
0.375
>>> jac.get_sim_score(['data', 'management'], ['data', 'data', 'science'])
0.3333333333333333