Monge Elkan

class py_stringmatching.similarity_measure.monge_elkan.MongeElkan(sim_func=jaro_winkler_function)[source]

Computes Monge-Elkan measure.

The Monge-Elkan similarity measure is a type of hybrid similarity measure that combines the benefits of sequence-based and set-based methods. This can be effective for domains in which more control is needed over the similarity measure. It implicitly uses a secondary similarity measure, such as Levenshtein to compute over all similarity score. See the string matching chapter in the DI book (Principles of Data Integration).

Parameters

sim_func (function) – Secondary similarity function. This is expected to be a sequence-based similarity measure (defaults to Jaro-Winkler similarity measure).

sim_func

An attribute to store the secondary similarity function.

Type

function

get_raw_score(bag1, bag2)[source]

Computes the raw Monge-Elkan score between two bags (lists).

Parameters
  • bag1 (list) – Input lists.

  • bag2 (list) – Input lists.

Returns

Monge-Elkan similarity score (float).

Raises

TypeError – If the inputs are not lists or if one of the inputs is None.

Examples

>>> me = MongeElkan()
>>> me.get_raw_score(['Niall'], ['Neal'])
0.8049999999999999
>>> me.get_raw_score(['Niall'], ['Nigel'])
0.7866666666666667
>>> me.get_raw_score(['Comput.', 'Sci.', 'and', 'Eng.', 'Dept.,', 'University', 'of', 'California,', 'San', 'Diego'], ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego'])
0.8677218614718616
>>> me.get_raw_score([''], ['a'])
0.0
>>> me = MongeElkan(sim_func=NeedlemanWunsch().get_raw_score)
>>> me.get_raw_score(['Comput.', 'Sci.', 'and', 'Eng.', 'Dept.,', 'University', 'of', 'California,', 'San', 'Diego'], ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego'])
2.0
>>> me = MongeElkan(sim_func=Affine().get_raw_score)
>>> me.get_raw_score(['Comput.', 'Sci.', 'and', 'Eng.', 'Dept.,', 'University', 'of', 'California,', 'San', 'Diego'], ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego'])
2.25

References

  • Principles of Data Integration book

get_sim_func()[source]

Get the secondary similarity function.

Returns

secondary similarity function (function).

set_sim_func(sim_func)[source]

Set the secondary similarity function.

Parameters

sim_func (function) – Secondary similarity function.