Editex

Editex distance measure

class py_stringmatching.similarity_measure.editex.Editex(match_cost=0, group_cost=1, mismatch_cost=2, local=False)[source]

Editex distance measure class.

Parameters
  • match_cost (int) – Weight to give the correct char match, default=0

  • group_cost (int) – Weight to give if the chars are in the same editex group, default=1

  • mismatch_cost (int) – Weight to give the incorrect char match, default=2

  • local (boolean) – Local variant on/off, default=False

get_group_cost()[source]

Get group cost

Returns

group cost (int)

get_local()[source]

Get local flag

Returns

local flag (boolean)

get_match_cost()[source]

Get match cost

Returns

match cost (int)

get_mismatch_cost()[source]

Get mismatch cost

Returns

mismatch cost (int)

get_raw_score(string1, string2)[source]

Computes the editex distance between two strings.

As described on pages 3 & 4 of Zobel, Justin and Philip Dart. 1996. Phonetic string matching: Lessons from information retrieval. In: Proceedings of the ACM-SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland. 166–173. http://goanna.cs.rmit.edu.au/~jz/fulltext/sigir96.pdf

The local variant is based on Ring, Nicholas and Alexandra L. Uitdenbogerd. 2009. Finding ‘Lucy in Disguise’: The Misheard Lyric Matching Problem. In: Proceedings of the 5th Asia Information Retrieval Symposium, Sapporo, Japan. 157-167. http://www.seg.rmit.edu.au/research/download.php?manuscript=404

Parameters
  • string1 (str) – Input strings

  • string2 (str) – Input strings

Returns

Editex distance (int)

Raises

TypeError – If the inputs are not strings

Examples

>>> ed = Editex()
>>> ed.get_raw_score('cat', 'hat')
2
>>> ed.get_raw_score('Niall', 'Neil')
2
>>> ed.get_raw_score('aluminum', 'Catalan')
12
>>> ed.get_raw_score('ATCG', 'TAGC')
6

References

get_sim_score(string1, string2)[source]

Computes the normalized editex similarity between two strings.

Parameters
  • string1 (str) – Input strings

  • string2 (str) – Input strings

Returns

Normalized editex similarity (float)

Raises

TypeError – If the inputs are not strings

Examples

>>> ed = Editex()
>>> ed.get_sim_score('cat', 'hat')
0.66666666666666674
>>> ed.get_sim_score('Niall', 'Neil')
0.80000000000000004
>>> ed.get_sim_score('aluminum', 'Catalan')
0.25
>>> ed.get_sim_score('ATCG', 'TAGC')
0.25

References

set_group_cost(group_cost)[source]

Set group cost

Parameters

group_cost (int) – Weight to give if the chars are in the same editex group

set_local(local)[source]

Set local flag

Parameters

local (boolean) – Local variant on/off

set_match_cost(match_cost)[source]

Set match cost

Parameters

match_cost (int) – Weight to give the correct char match

set_mismatch_cost(mismatch_cost)[source]

Set mismatch cost

Parameters

mismatch_cost (int) – Weight to give the incorrect char match