Overlap Coefficient

class py_stringmatching.similarity_measure.overlap_coefficient.OverlapCoefficient[source]

Computes overlap coefficient measure.

The overlap coefficient is a similarity measure related to the Jaccard measure that measures the overlap between two sets, and is defined as the size of the intersection divided by the smaller of the size of the two sets. For two sets X and Y, the overlap coefficient is:

\(overlap\_coefficient(X, Y) = \frac{|X \cap Y|}{\min(|X|, |Y|)}\)

Note

  • In the case where one of X and Y is an empty set and the other is a non-empty set, we define their overlap coefficient to be 0.
  • In the case where both X and Y are empty sets, we define their overlap coefficient to be 1.
get_raw_score(set1, set2)[source]

Computes the raw overlap coefficient score between two sets.

Parameters:set1,set2 (set or list) – Input sets (or lists). Input lists are converted to sets.
Returns:Overlap coefficient (float).
Raises:TypeError – If the inputs are not sets (or lists) or if one of the inputs is None.

Examples

>>> oc = OverlapCoefficient()
>>> oc.get_raw_score(['data', 'science'], ['data'])
1.0
>>> oc.get_raw_score([], [])
1.0
>>> oc.get_raw_score([], ['data'])
0

References

get_sim_score(set1, set2)[source]

Computes the normalized overlap coefficient between two sets. Simply call get_raw_score.

Parameters:set1,set2 (set or list) – Input sets (or lists). Input lists are converted to sets.
Returns:Normalized overlap coefficient (float).
Raises:TypeError – If the inputs are not sets (or lists) or if one of the inputs is None.

Examples

>>> oc = OverlapCoefficient()
>>> oc.get_sim_score(['data', 'science'], ['data'])
1.0
>>> oc.get_sim_score([], [])
1.0
>>> oc.get_sim_score([], ['data'])
0