Monge Elkan

class py_stringmatching.similarity_measure.monge_elkan.MongeElkan(sim_func=jaro_winkler_function)[source]

Computes Monge-Elkan measure.

The Monge-Elkan similarity measure is a type of hybrid similarity measure that combines the benefits of sequence-based and set-based methods. This can be effective for domains in which more control is needed over the similarity measure. It implicitly uses a secondary similarity measure, such as Levenshtein to compute over all similarity score. See the string matching chapter in the DI book (Principles of Data Integration).

Parameters:sim_func (function) – Secondary similarity function. This is expected to be a sequence-based similarity measure (defaults to Jaro-Winkler similarity measure).
sim_func

function – An attribute to store the secondary similarity function.

get_raw_score(bag1, bag2)[source]

Computes the raw Monge-Elkan score between two bags (lists).

Parameters:bag1,bag2 (list) – Input lists.
Returns:Monge-Elkan similarity score (float).
Raises:TypeError – If the inputs are not lists or if one of the inputs is None.

Examples

>>> me = MongeElkan()
>>> me.get_raw_score(['Niall'], ['Neal'])
0.8049999999999999
>>> me.get_raw_score(['Niall'], ['Nigel'])
0.7866666666666667
>>> me.get_raw_score(['Comput.', 'Sci.', 'and', 'Eng.', 'Dept.,', 'University', 'of', 'California,', 'San', 'Diego'], ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego'])
0.8677218614718616
>>> me.get_raw_score([''], ['a'])
0.0
>>> me = MongeElkan(sim_func=NeedlemanWunsch().get_raw_score)
>>> me.get_raw_score(['Comput.', 'Sci.', 'and', 'Eng.', 'Dept.,', 'University', 'of', 'California,', 'San', 'Diego'], ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego'])
2.0
>>> me = MongeElkan(sim_func=Affine().get_raw_score)
>>> me.get_raw_score(['Comput.', 'Sci.', 'and', 'Eng.', 'Dept.,', 'University', 'of', 'California,', 'San', 'Diego'], ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego'])
2.25

References

  • Principles of Data Integration book
get_sim_func()[source]

Get the secondary similarity function.

Returns:secondary similarity function (function).
set_sim_func(sim_func)[source]

Set the secondary similarity function.

Parameters:sim_func (function) – Secondary similarity function.