Overlap Coefficient¶
-
class
py_stringmatching.similarity_measure.overlap_coefficient.
OverlapCoefficient
[source]¶ Computes overlap coefficient measure.
The overlap coefficient is a similarity measure related to the Jaccard measure that measures the overlap between two sets, and is defined as the size of the intersection divided by the smaller of the size of the two sets. For two sets X and Y, the overlap coefficient is:
\(overlap\_coefficient(X, Y) = \frac{|X \cap Y|}{\min(|X|, |Y|)}\)
Note
In the case where one of X and Y is an empty set and the other is a non-empty set, we define their overlap coefficient to be 0.
In the case where both X and Y are empty sets, we define their overlap coefficient to be 1.
-
get_raw_score
(set1, set2)[source]¶ Computes the raw overlap coefficient score between two sets.
- Parameters
set1 (set or list) – Input sets (or lists). Input lists are converted to sets.
set2 (set or list) – Input sets (or lists). Input lists are converted to sets.
- Returns
Overlap coefficient (float).
- Raises
TypeError – If the inputs are not sets (or lists) or if one of the inputs is None.
Examples
>>> oc = OverlapCoefficient() >>> oc.get_raw_score(['data', 'science'], ['data']) 1.0 >>> oc.get_raw_score([], []) 1.0 >>> oc.get_raw_score([], ['data']) 0
References
Wikipedia article : https://en.wikipedia.org/wiki/Overlap_coefficient
SimMetrics library
-
get_sim_score
(set1, set2)[source]¶ Computes the normalized overlap coefficient between two sets. Simply call get_raw_score.
- Parameters
set1 (set or list) – Input sets (or lists). Input lists are converted to sets.
set2 (set or list) – Input sets (or lists). Input lists are converted to sets.
- Returns
Normalized overlap coefficient (float).
- Raises
TypeError – If the inputs are not sets (or lists) or if one of the inputs is None.
Examples
>>> oc = OverlapCoefficient() >>> oc.get_sim_score(['data', 'science'], ['data']) 1.0 >>> oc.get_sim_score([], []) 1.0 >>> oc.get_sim_score([], ['data']) 0