Smith Waterman

class py_stringmatching.similarity_measure.smith_waterman.SmithWaterman(gap_cost=1.0, sim_func=identity_function)[source]

Computes Smith-Waterman measure.

The Smith-Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings. Instead of looking at the total sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure. See the string matching chapter in the DI book (Principles of Data Integration).

Parameters
  • gap_cost (float) – Cost of gap (defaults to 1.0).

  • sim_func (function) – Similarity function to give a score for the correspondence between the characters (defaults to an identity function, which returns 1 if the two characters are the same and 0 otherwise).

gap_cost

An attribute to store the gap cost.

Type

float

sim_func

An attribute to store the similarity function.

Type

function

get_gap_cost()[source]

Get gap cost.

Returns

Gap cost (float).

get_raw_score(string1, string2)[source]

Computes the raw Smith-Waterman score between two strings.

Parameters
  • string1 (str) – Input strings.

  • string2 (str) – Input strings.

Returns

Smith-Waterman similarity score (float).

Raises

TypeError – If the inputs are not strings or if one of the inputs is None.

Examples

>>> sw = SmithWaterman()
>>> sw.get_raw_score('cat', 'hat')
2.0
>>> sw = SmithWaterman(gap_cost=2.2)
>>> sw.get_raw_score('dva', 'deeve')
1.0
>>> sw = SmithWaterman(gap_cost=1, sim_func=lambda s1, s2 : (2 if s1 == s2 else -1))
>>> sw.get_raw_score('dva', 'deeve')
2.0
>>> sw = SmithWaterman(gap_cost=1.4, sim_func=lambda s1, s2 : (1.5 if s1 == s2 else 0.5))
>>> sw.get_raw_score('GCATAGCU', 'GATTACA')
6.5
get_sim_func()[source]

Get similarity function.

Returns

Similarity function (function).

set_gap_cost(gap_cost)[source]

Set gap cost.

Parameters

gap_cost (float) – Cost of gap.

set_sim_func(sim_func)[source]

Set similarity function.

Parameters

sim_func (function) – Similarity function to give a score for the correspondence between the characters.