Smith Waterman

class py_stringmatching.similarity_measure.smith_waterman.SmithWaterman(gap_cost=1.0, sim_func=identity_function)[source]

Computes Smith-Waterman measure.

The Smith-Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings. Instead of looking at the total sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure. See the string matching chapter in the DI book (Principles of Data Integration).

Parameters:
  • gap_cost (float) – Cost of gap (defaults to 1.0).
  • sim_func (function) – Similarity function to give a score for the correspondence between the characters (defaults to an identity function, which returns 1 if the two characters are the same and 0 otherwise).
gap_cost

An attribute to store the gap cost.

Type:float
sim_func

An attribute to store the similarity function.

Type:function
get_gap_cost()[source]

Get gap cost.

Returns:Gap cost (float).
get_raw_score(string1, string2)[source]

Computes the raw Smith-Waterman score between two strings.

Parameters:string1,string2 (str) – Input strings.
Returns:Smith-Waterman similarity score (float).
Raises:TypeError – If the inputs are not strings or if one of the inputs is None.

Examples

>>> sw = SmithWaterman()
>>> sw.get_raw_score('cat', 'hat')
2.0
>>> sw = SmithWaterman(gap_cost=2.2)
>>> sw.get_raw_score('dva', 'deeve')
1.0
>>> sw = SmithWaterman(gap_cost=1, sim_func=lambda s1, s2 : (2 if s1 == s2 else -1))
>>> sw.get_raw_score('dva', 'deeve')
2.0
>>> sw = SmithWaterman(gap_cost=1.4, sim_func=lambda s1, s2 : (1.5 if s1 == s2 else 0.5))
>>> sw.get_raw_score('GCATAGCU', 'GATTACA')
6.5
get_sim_func()[source]

Get similarity function.

Returns:Similarity function (function).
set_gap_cost(gap_cost)[source]

Set gap cost.

Parameters:gap_cost (float) – Cost of gap.
set_sim_func(sim_func)[source]

Set similarity function.

Parameters:sim_func (function) – Similarity function to give a score for the correspondence between the characters.