Supported Similarity Functions¶

py_entitymatching.affine(s1, s2)[source]¶

This function computes the affine measure between the two input strings.

Parameters

s1 (string) – The input strings for which the similarity measure should be computed.
s2 (string) – The input strings for which the similarity measure should be computed.

Returns

The affine measure if both the strings are not missing (i.e NaN or None), else returns NaN.

Examples

>>> import py_entitymatching as em
>>> em.affine('dva', 'deeva')
1.5
>>> em.affine(None, 'deeva')
nan

py_entitymatching.hamming_dist(s1, s2)[source]¶

This function computes the Hamming distance between the two input strings.

Parameters

s1 (string) – The input strings for which the similarity measure should be computed.
s2 (string) – The input strings for which the similarity measure should be computed.

Returns

The Hamming distance if both the strings are not missing (i.e NaN), else returns NaN.

Examples

>>> import py_entitymatching as em
>>> em.hamming_dist('alex', 'john')
4
>>> em.hamming_dist(None, 'john')
nan

py_entitymatching.hamming_sim(s1, s2)[source]¶

This function computes the Hamming similarity between the two input strings.

Parameters

s1 (string) – The input strings for which the similarity measure should be computed.
s2 (string) – The input strings for which the similarity measure should be computed.

Returns

The Hamming similarity if both the strings are not missing (i.e NaN), else returns NaN.

Examples

>>> import py_entitymatching as em
>>> em.hamming_sim('alex', 'alxe')
0.5
>>> em.hamming_sim(None, 'alex')
nan

py_entitymatching.lev_dist(s1, s2)[source]¶

This function computes the Levenshtein distance between the two input strings.

Parameters

s1 (string) – The input strings for which the similarity measure should be computed.
s2 (string) – The input strings for which the similarity measure should be computed.

Returns

The Levenshtein distance if both the strings are not missing (i.e NaN), else returns NaN.

Examples

>>> import py_entitymatching as em
>>> em.lev_dist('alex', 'alxe')
2
>>> em.lev_dist(None, 'alex')
nan

py_entitymatching.lev_sim(s1, s2)[source]¶

This function computes the Levenshtein similarity between the two input strings.

Parameters

s1 (string) – The input strings for which the similarity measure should be computed.
s2 (string) – The input strings for which the similarity measure should be computed.

Returns

The Levenshtein similarity if both the strings are not missing (i.e NaN), else returns NaN.

Examples

>>> import py_entitymatching as em
>>> em.lev_sim('alex', 'alxe')
0.5
>>> em.lev_dist(None, 'alex')
nan

py_entitymatching.jaro(s1, s2)[source]¶

This function computes the Jaro measure between the two input strings.

Parameters

s1 (string) – The input strings for which the similarity measure should be computed.
s2 (string) – The input strings for which the similarity measure should be computed.

Returns

The Jaro measure if both the strings are not missing (i.e NaN), else returns NaN.

Examples

>>> import py_entitymatching as em
>>> em.jaro('MARTHA', 'MARHTA')
0.9444444444444445
>>> em.jaro(None, 'MARTHA')
nan

py_entitymatching.jaro_winkler(s1, s2)[source]¶

This function computes the Jaro Winkler measure between the two input strings.

Parameters

s1 (string) – The input strings for which the similarity measure should be computed.
s2 (string) – The input strings for which the similarity measure should be computed.

Returns

The Jaro Winkler measure if both the strings are not missing (i.e NaN), else returns NaN.

Examples

>>> import py_entitymatching as em
>>> em.jaro_winkler('MARTHA', 'MARHTA')
0.9611111111111111
>>> >>> em.jaro_winkler('MARTHA', None)
nan

py_entitymatching.needleman_wunsch(s1, s2)[source]¶

This function computes the Needleman-Wunsch measure between the two input strings.

Parameters

s1 (string) – The input strings for which the similarity measure should be computed.
s2 (string) – The input strings for which the similarity measure should be computed.

Returns

The Needleman-Wunsch measure if both the strings are not missing (i.e NaN), else returns NaN.

Examples

>>> import py_entitymatching as em
>>> em.needleman_wunsch('dva', 'deeva')
1.0
>>> em.needleman_wunsch('dva', None)
nan

py_entitymatching.smith_waterman(s1, s2)[source]¶

This function computes the Smith-Waterman measure between the two input strings.

Parameters

s1 (string) – The input strings for which the similarity measure should be computed.
s2 (string) – The input strings for which the similarity measure should be computed.

Returns

The Smith-Waterman measure if both the strings are not missing (i.e NaN), else returns NaN.

Examples

>>> import py_entitymatching as em
>>> em.smith_waterman('cat', 'hat')
2.0
>>> em.smith_waterman('cat', None)
nan

py_entitymatching.jaccard(arr1, arr2)[source]¶

This function computes the Jaccard measure between the two input lists/sets.

Parameters

arr1 (list or set) – The input list or sets for which the Jaccard measure should be computed.
arr2 (list or set) – The input list or sets for which the Jaccard measure should be computed.

Returns

The Jaccard measure if both the lists/set are not None and do not have any missing tokens (i.e NaN), else returns NaN.

Examples

>>> import py_entitymatching as em
>>> em.jaccard(['data', 'science'], ['data'])
0.5
>>> em.jaccard(['data', 'science'], None)
nan

py_entitymatching.cosine(arr1, arr2)[source]¶

This function computes the cosine measure between the two input lists/sets.

Parameters

arr1 (list or set) – The input list or sets for which the cosine measure should be computed.
arr2 (list or set) – The input list or sets for which the cosine measure should be computed.

Returns

The cosine measure if both the lists/set are not None and do not have any missing tokens (i.e NaN), else returns NaN.

Examples

>>> import py_entitymatching as em
>>> em.cosine(['data', 'science'], ['data'])
0.7071067811865475
>>> em.cosine(['data', 'science'], None)
nan

py_entitymatching.overlap_coeff(arr1, arr2)[source]¶

This function computes the overlap coefficient between the two input lists/sets.

Parameters

arr1 (list or set) – The input lists or sets for which the overlap coefficient should be computed.
arr2 (list or set) – The input lists or sets for which the overlap coefficient should be computed.

Returns

The overlap coefficient if both the lists/sets are not None and do not have any missing tokens (i.e NaN), else returns NaN.

Examples

>>> import py_entitymatching as em
>>> em.overlap_coeff(['data', 'science'], ['data'])
1.0
>>> em.overlap_coeff(['data', 'science'], None)
nan

py_entitymatching.dice(arr1, arr2)[source]¶

This function computes the Dice score between the two input lists/sets.

Parameters

arr1 (list or set) – The input list or sets for which the Dice score should be computed.
arr2 (list or set) – The input list or sets for which the Dice score should be computed.

Returns

The Dice score if both the lists/set are not None and do not have any missing tokens (i.e NaN), else returns NaN.

Examples

>>> import py_entitymatching as em
>>> em.dice(['data', 'science'], ['data'])
0.6666666666666666
>>> em.dice(['data', 'science'], None)
nan

py_entitymatching.monge_elkan(arr1, arr2)[source]¶

This function computes the Monge-Elkan measure between the two input lists/sets. Specifically, this function uses Jaro-Winkler measure as the secondary function to compute the similarity score.

Parameters

arr1 (list or set) – The input list or sets for which the Monge-Elkan measure should be computed.
arr2 (list or set) – The input list or sets for which the Monge-Elkan measure should be computed.

Returns

The Monge-Elkan measure if both the lists/set are not None and do not have any missing tokens (i.e NaN), else returns NaN.

Examples

>>> import py_entitymatching as em
>>> em.monge_elkan(['Niall'], ['Neal'])
0.8049999999999999
>>> em.monge_elkan(['Niall'], None)
nan

py_entitymatching.exact_match(d1, d2)[source]¶

This function check if two objects are match exactly. Typically the objects are string, boolean and ints.

Parameters

d1 (str, boolean, int) – The input objects which should checked whether they match exactly.
d2 (str, boolean, int) – The input objects which should checked whether they match exactly.

Returns

A value of 1 is returned if they match exactly, else returns 0. Further if one of the objects is NaN or None, it returns NaN.

Examples

>>> import py_entitymatching as em
>>> em.exact_match('Niall', 'Neal')
0
>>> em.exact_match('Niall', 'Niall')
1
>>> em.exact_match(10, 10)
1
>>> em.exact_match(10, 20)
0
>>> em.exact_match(True, True)
1
>>> em.exact_match(False, True)
0
>>> em.exact_match(10, None)
nan

py_entitymatching.rel_diff(d1, d2)[source]¶

This function computes the relative difference between two numbers

Parameters

d1 (float) – The input numbers for which the relative difference must be computed.
d2 (float) – The input numbers for which the relative difference must be computed.

Returns

A float value of relative difference between the input numbers (if they are valid). Further if one of the input objects is NaN or None, it returns NaN.

Examples

>>> import py_entitymatching as em
>>> em.rel_diff(100, 200)
0.6666666666666666
>>> em.rel_diff(100, 100)
0.0
>>> em.rel_diff(100, None)
nan

py_entitymatching.abs_norm(d1, d2)[source]¶

This function computes the absolute norm similarity between two numbers

Parameters

d1 (float) – Input numbers for which the absolute norm must be computed.
d2 (float) – Input numbers for which the absolute norm must be computed.

Returns

A float value of absolute norm between the input numbers (if they are valid). Further if one of the input objects is NaN or None, it returns NaN.

Examples

>>> import py_entitymatching as em
>>> em.abs_norm(100, 200)
0.5
>>> em.abs_norm(100, 100)
1.0
>>> em.abs_norm(100, None)
nan