Supported Similarity Functions¶
-
py_entitymatching.
affine
(s1, s2)[source]¶ This function computes the affine measure between the two input strings.
- Parameters
s1 (string) – The input strings for which the similarity measure should be computed.
s2 (string) – The input strings for which the similarity measure should be computed.
- Returns
The affine measure if both the strings are not missing (i.e NaN or None), else returns NaN.
Examples
>>> import py_entitymatching as em >>> em.affine('dva', 'deeva') 1.5 >>> em.affine(None, 'deeva') nan
-
py_entitymatching.
hamming_dist
(s1, s2)[source]¶ This function computes the Hamming distance between the two input strings.
- Parameters
s1 (string) – The input strings for which the similarity measure should be computed.
s2 (string) – The input strings for which the similarity measure should be computed.
- Returns
The Hamming distance if both the strings are not missing (i.e NaN), else returns NaN.
Examples
>>> import py_entitymatching as em >>> em.hamming_dist('alex', 'john') 4 >>> em.hamming_dist(None, 'john') nan
-
py_entitymatching.
hamming_sim
(s1, s2)[source]¶ This function computes the Hamming similarity between the two input strings.
- Parameters
s1 (string) – The input strings for which the similarity measure should be computed.
s2 (string) – The input strings for which the similarity measure should be computed.
- Returns
The Hamming similarity if both the strings are not missing (i.e NaN), else returns NaN.
Examples
>>> import py_entitymatching as em >>> em.hamming_sim('alex', 'alxe') 0.5 >>> em.hamming_sim(None, 'alex') nan
-
py_entitymatching.
lev_dist
(s1, s2)[source]¶ This function computes the Levenshtein distance between the two input strings.
- Parameters
s1 (string) – The input strings for which the similarity measure should be computed.
s2 (string) – The input strings for which the similarity measure should be computed.
- Returns
The Levenshtein distance if both the strings are not missing (i.e NaN), else returns NaN.
Examples
>>> import py_entitymatching as em >>> em.lev_dist('alex', 'alxe') 2 >>> em.lev_dist(None, 'alex') nan
-
py_entitymatching.
lev_sim
(s1, s2)[source]¶ This function computes the Levenshtein similarity between the two input strings.
- Parameters
s1 (string) – The input strings for which the similarity measure should be computed.
s2 (string) – The input strings for which the similarity measure should be computed.
- Returns
The Levenshtein similarity if both the strings are not missing (i.e NaN), else returns NaN.
Examples
>>> import py_entitymatching as em >>> em.lev_sim('alex', 'alxe') 0.5 >>> em.lev_dist(None, 'alex') nan
-
py_entitymatching.
jaro
(s1, s2)[source]¶ This function computes the Jaro measure between the two input strings.
- Parameters
s1 (string) – The input strings for which the similarity measure should be computed.
s2 (string) – The input strings for which the similarity measure should be computed.
- Returns
The Jaro measure if both the strings are not missing (i.e NaN), else returns NaN.
Examples
>>> import py_entitymatching as em >>> em.jaro('MARTHA', 'MARHTA') 0.9444444444444445 >>> em.jaro(None, 'MARTHA') nan
-
py_entitymatching.
jaro_winkler
(s1, s2)[source]¶ This function computes the Jaro Winkler measure between the two input strings.
- Parameters
s1 (string) – The input strings for which the similarity measure should be computed.
s2 (string) – The input strings for which the similarity measure should be computed.
- Returns
The Jaro Winkler measure if both the strings are not missing (i.e NaN), else returns NaN.
Examples
>>> import py_entitymatching as em >>> em.jaro_winkler('MARTHA', 'MARHTA') 0.9611111111111111 >>> >>> em.jaro_winkler('MARTHA', None) nan
-
py_entitymatching.
needleman_wunsch
(s1, s2)[source]¶ This function computes the Needleman-Wunsch measure between the two input strings.
- Parameters
s1 (string) – The input strings for which the similarity measure should be computed.
s2 (string) – The input strings for which the similarity measure should be computed.
- Returns
The Needleman-Wunsch measure if both the strings are not missing (i.e NaN), else returns NaN.
Examples
>>> import py_entitymatching as em >>> em.needleman_wunsch('dva', 'deeva') 1.0 >>> em.needleman_wunsch('dva', None) nan
-
py_entitymatching.
smith_waterman
(s1, s2)[source]¶ This function computes the Smith-Waterman measure between the two input strings.
- Parameters
s1 (string) – The input strings for which the similarity measure should be computed.
s2 (string) – The input strings for which the similarity measure should be computed.
- Returns
The Smith-Waterman measure if both the strings are not missing (i.e NaN), else returns NaN.
Examples
>>> import py_entitymatching as em >>> em.smith_waterman('cat', 'hat') 2.0 >>> em.smith_waterman('cat', None) nan
-
py_entitymatching.
jaccard
(arr1, arr2)[source]¶ This function computes the Jaccard measure between the two input lists/sets.
- Parameters
arr1 (list or set) – The input list or sets for which the Jaccard measure should be computed.
arr2 (list or set) – The input list or sets for which the Jaccard measure should be computed.
- Returns
The Jaccard measure if both the lists/set are not None and do not have any missing tokens (i.e NaN), else returns NaN.
Examples
>>> import py_entitymatching as em >>> em.jaccard(['data', 'science'], ['data']) 0.5 >>> em.jaccard(['data', 'science'], None) nan
-
py_entitymatching.
cosine
(arr1, arr2)[source]¶ This function computes the cosine measure between the two input lists/sets.
- Parameters
arr1 (list or set) – The input list or sets for which the cosine measure should be computed.
arr2 (list or set) – The input list or sets for which the cosine measure should be computed.
- Returns
The cosine measure if both the lists/set are not None and do not have any missing tokens (i.e NaN), else returns NaN.
Examples
>>> import py_entitymatching as em >>> em.cosine(['data', 'science'], ['data']) 0.7071067811865475 >>> em.cosine(['data', 'science'], None) nan
-
py_entitymatching.
overlap_coeff
(arr1, arr2)[source]¶ This function computes the overlap coefficient between the two input lists/sets.
- Parameters
arr1 (list or set) – The input lists or sets for which the overlap coefficient should be computed.
arr2 (list or set) – The input lists or sets for which the overlap coefficient should be computed.
- Returns
The overlap coefficient if both the lists/sets are not None and do not have any missing tokens (i.e NaN), else returns NaN.
Examples
>>> import py_entitymatching as em >>> em.overlap_coeff(['data', 'science'], ['data']) 1.0 >>> em.overlap_coeff(['data', 'science'], None) nan
-
py_entitymatching.
dice
(arr1, arr2)[source]¶ This function computes the Dice score between the two input lists/sets.
- Parameters
arr1 (list or set) – The input list or sets for which the Dice score should be computed.
arr2 (list or set) – The input list or sets for which the Dice score should be computed.
- Returns
The Dice score if both the lists/set are not None and do not have any missing tokens (i.e NaN), else returns NaN.
Examples
>>> import py_entitymatching as em >>> em.dice(['data', 'science'], ['data']) 0.6666666666666666 >>> em.dice(['data', 'science'], None) nan
-
py_entitymatching.
monge_elkan
(arr1, arr2)[source]¶ This function computes the Monge-Elkan measure between the two input lists/sets. Specifically, this function uses Jaro-Winkler measure as the secondary function to compute the similarity score.
- Parameters
arr1 (list or set) – The input list or sets for which the Monge-Elkan measure should be computed.
arr2 (list or set) – The input list or sets for which the Monge-Elkan measure should be computed.
- Returns
The Monge-Elkan measure if both the lists/set are not None and do not have any missing tokens (i.e NaN), else returns NaN.
Examples
>>> import py_entitymatching as em >>> em.monge_elkan(['Niall'], ['Neal']) 0.8049999999999999 >>> em.monge_elkan(['Niall'], None) nan
-
py_entitymatching.
exact_match
(d1, d2)[source]¶ This function check if two objects are match exactly. Typically the objects are string, boolean and ints.
- Parameters
d1 (str, boolean, int) – The input objects which should checked whether they match exactly.
d2 (str, boolean, int) – The input objects which should checked whether they match exactly.
- Returns
A value of 1 is returned if they match exactly, else returns 0. Further if one of the objects is NaN or None, it returns NaN.
Examples
>>> import py_entitymatching as em >>> em.exact_match('Niall', 'Neal') 0 >>> em.exact_match('Niall', 'Niall') 1 >>> em.exact_match(10, 10) 1 >>> em.exact_match(10, 20) 0 >>> em.exact_match(True, True) 1 >>> em.exact_match(False, True) 0 >>> em.exact_match(10, None) nan
-
py_entitymatching.
rel_diff
(d1, d2)[source]¶ This function computes the relative difference between two numbers
- Parameters
d1 (float) – The input numbers for which the relative difference must be computed.
d2 (float) – The input numbers for which the relative difference must be computed.
- Returns
A float value of relative difference between the input numbers (if they are valid). Further if one of the input objects is NaN or None, it returns NaN.
Examples
>>> import py_entitymatching as em >>> em.rel_diff(100, 200) 0.6666666666666666 >>> em.rel_diff(100, 100) 0.0 >>> em.rel_diff(100, None) nan
-
py_entitymatching.
abs_norm
(d1, d2)[source]¶ This function computes the absolute norm similarity between two numbers
- Parameters
d1 (float) – Input numbers for which the absolute norm must be computed.
d2 (float) – Input numbers for which the absolute norm must be computed.
- Returns
A float value of absolute norm between the input numbers (if they are valid). Further if one of the input objects is NaN or None, it returns NaN.
Examples
>>> import py_entitymatching as em >>> em.abs_norm(100, 200) 0.5 >>> em.abs_norm(100, 100) 1.0 >>> em.abs_norm(100, None) nan