Soft TF/IDF

class py_stringmatching.similarity_measure.soft_tfidf.SoftTfIdf(corpus_list=None, sim_func=jaro_function, threshold=0.5)[source]

Computes soft TF/IDF measure.

Note

Currently, this measure is implemented without dampening. This is similar to setting dampen flag to be False in TF-IDF. We plan to add the dampen flag in the next release.

Parameters:
  • corpus_list (list of lists) – Corpus list (default is set to None) of strings. If set to None, the input list are considered the only corpus.
  • sim_func (function) – Secondary similarity function. This should return a similarity score between two strings (optional), default is the Jaro similarity measure.
  • threshold (float) – Threshold value for the secondary similarity function (defaults to 0.5). If the similarity of a token pair exceeds the threshold, then the token pair is considered a match.
sim_func

function

An attribute to store the secondary similarity function.

threshold

float

An attribute to store the threshold value for the secondary similarity function.

get_corpus_list()[source]

Get corpus list.

Returns:corpus list (list of lists).
get_raw_score(bag1, bag2)[source]

Computes the raw soft TF/IDF score between two lists given the corpus information.

Parameters:bag1,bag2 (list) – Input lists
Returns:Soft TF/IDF score between the input lists (float).
Raises:TypeError – If the inputs are not lists or if one of the inputs is None.

Examples

>>> soft_tfidf = SoftTfIdf([['a', 'b', 'a'], ['a', 'c'], ['a']], sim_func=Jaro().get_raw_score, threshold=0.8)
>>> soft_tfidf.get_raw_score(['a', 'b', 'a'], ['a', 'c'])
0.17541160386140586
>>> soft_tfidf = SoftTfIdf([['a', 'b', 'a'], ['a', 'c'], ['a']], threshold=0.9)
>>> soft_tfidf.get_raw_score(['a', 'b', 'a'], ['a'])
0.5547001962252291
>>> soft_tfidf = SoftTfIdf([['x', 'y'], ['w'], ['q']])
>>> soft_tfidf.get_raw_score(['a', 'b', 'a'], ['a'])
0.0
>>> soft_tfidf = SoftTfIdf(sim_func=Affine().get_raw_score, threshold=0.6)
>>> soft_tfidf.get_raw_score(['aa', 'bb', 'a'], ['ab', 'ba'])
0.81649658092772592

References

  • the string matching chapter of the “Principles of Data Integration” book.
get_sim_func()[source]

Get secondary similarity function.

Returns:secondary similarity function (function).
get_threshold()[source]

Get threshold used for the secondary similarity function.

Returns:threshold (float).
set_corpus_list(corpus_list)[source]

Set corpus list.

Parameters:corpus_list (list of lists) – Corpus list.
set_sim_func(sim_func)[source]

Set secondary similarity function.

Parameters:sim_func (function) – Secondary similarity function.
set_threshold(threshold)[source]

Set threshold value for the secondary similarity function.

Parameters:threshold (float) – threshold value.