Soft TF/IDF

class py_stringmatching.similarity_measure.soft_tfidf.SoftTfIdf(corpus_list=None, sim_func=jaro_function, threshold=0.5)[source]

Computes soft TF/IDF measure.

Parameters:
  • corpus_list (list of lists) – Corpus list (default is set to None) of strings. If set to None, the input list are considered the only corpus.
  • sim_func (function) – Secondary similarity function. This should return a similarity score between two strings (optional), default is the Jaro similarity measure.
  • threshold (float) – Threshold value for the secondary similarity function (defaults to 0.5). If the similarity of a token pair exceeds the threshold, then the token pair is considered a match.
sim_func

function

An attribute to store the secondary similarity function.

threshold

float

An attribute to store the threshold value for the secondary similarity function.

Note

Currently, this measure is implemented without dampening. This is similar to setting dampen flag to be False in TF-IDF. We plan to add the dampen flag in the next release.

get_corpus_list()[source]

Get corpus list.

Returns:corpus list (list of lists).
get_raw_score(bag1, bag2)[source]

Computes the raw soft TF/IDF score between two lists given the corpus information.

Parameters:bag1,bag2 (list) – Input lists
Returns:Soft TF/IDF score between the input lists (float).
Raises:TypeError – If the inputs are not lists or if one of the inputs is None.

Examples

>>> soft_tfidf = SoftTfIdf([['a', 'b', 'a'], ['a', 'c'], ['a']], sim_func=Jaro().get_raw_score, threshold=0.8)
>>> soft_tfidf.get_raw_score(['a', 'b', 'a'], ['a', 'c'])
0.17541160386140586
>>> soft_tfidf = SoftTfIdf([['a', 'b', 'a'], ['a', 'c'], ['a']], threshold=0.9)
>>> soft_tfidf.get_raw_score(['a', 'b', 'a'], ['a'])
0.5547001962252291
>>> soft_tfidf = SoftTfIdf([['x', 'y'], ['w'], ['q']])
>>> soft_tfidf.get_raw_score(['a', 'b', 'a'], ['a'])
0.0
>>> soft_tfidf = SoftTfIdf(sim_func=Affine().get_raw_score, threshold=0.6)
>>> soft_tfidf.get_raw_score(['aa', 'bb', 'a'], ['ab', 'ba'])
0.81649658092772592

References

  • the string matching chapter of the “Principles of Data Integration” book.
get_sim_func()[source]

Get secondary similarity function.

Returns:secondary similarity function (function).
get_threshold()[source]

Get threshold used for the secondary similarity function.

Returns:threshold (float).
set_corpus_list(corpus_list)[source]

Set corpus list.

Parameters:corpus_list (list of lists) – Corpus list.
set_sim_func(sim_func)[source]

Set secondary similarity function.

Parameters:sim_func (function) – Secondary similarity function.
set_threshold(threshold)[source]

Set threshold value for the secondary similarity function.

Parameters:threshold (float) – threshold value.