Editex

Editex distance measure

class py_stringmatching.similarity_measure.editex.Editex(match_cost=0, group_cost=1, mismatch_cost=2, local=False)[source]

Editex distance measure class.

Parameters:
  • match_cost (int) – Weight to give the correct char match, default=0
  • group_cost (int) – Weight to give if the chars are in the same editex group, default=1
  • mismatch_cost (int) – Weight to give the incorrect char match, default=2
  • local (boolean) – Local variant on/off, default=False
get_group_cost()[source]

Get group cost

Returns:group cost (int)
get_local()[source]

Get local flag

Returns:local flag (boolean)
get_match_cost()[source]

Get match cost

Returns:match cost (int)
get_mismatch_cost()[source]

Get mismatch cost

Returns:mismatch cost (int)
get_raw_score(string1, string2)[source]

Computes the editex distance between two strings.

As described on pages 3 & 4 of Zobel, Justin and Philip Dart. 1996. Phonetic string matching: Lessons from information retrieval. In: Proceedings of the ACM-SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland. 166–173. http://goanna.cs.rmit.edu.au/~jz/fulltext/sigir96.pdf

The local variant is based on Ring, Nicholas and Alexandra L. Uitdenbogerd. 2009. Finding ‘Lucy in Disguise’: The Misheard Lyric Matching Problem. In: Proceedings of the 5th Asia Information Retrieval Symposium, Sapporo, Japan. 157-167. http://www.seg.rmit.edu.au/research/download.php?manuscript=404

Parameters:string1,string2 (str) – Input strings
Returns:Editex distance (int)
Raises:TypeError – If the inputs are not strings

Examples

>>> ed = Editex()
>>> ed.get_raw_score('cat', 'hat')
2
>>> ed.get_raw_score('Niall', 'Neil')
2
>>> ed.get_raw_score('aluminum', 'Catalan')
12
>>> ed.get_raw_score('ATCG', 'TAGC')
6

References

get_sim_score(string1, string2)[source]

Computes the normalized editex similarity between two strings.

Parameters:string1,string2 (str) – Input strings
Returns:Normalized editex similarity (float)
Raises:TypeError – If the inputs are not strings

Examples

>>> ed = Editex()
>>> ed.get_sim_score('cat', 'hat')
0.66666666666666674
>>> ed.get_sim_score('Niall', 'Neil')
0.80000000000000004
>>> ed.get_sim_score('aluminum', 'Catalan')
0.25
>>> ed.get_sim_score('ATCG', 'TAGC')
0.25

References

set_group_cost(group_cost)[source]

Set group cost

Parameters:group_cost (int) – Weight to give if the chars are in the same editex group
set_local(local)[source]

Set local flag

Parameters:local (boolean) – Local variant on/off
set_match_cost(match_cost)[source]

Set match cost

Parameters:match_cost (int) – Weight to give the correct char match
set_mismatch_cost(mismatch_cost)[source]

Set mismatch cost

Parameters:mismatch_cost (int) – Weight to give the incorrect char match