Soft TF/IDF¶
-
class
py_stringmatching.similarity_measure.soft_tfidf.
SoftTfIdf
(corpus_list=None, sim_func=jaro_function, threshold=0.5)[source]¶ Computes soft TF/IDF measure.
Note
Currently, this measure is implemented without dampening. This is similar to setting dampen flag to be False in TF-IDF. We plan to add the dampen flag in the next release.
Parameters: - corpus_list (list of lists) – Corpus list (default is set to None) of strings. If set to None, the input list are considered the only corpus.
- sim_func (function) – Secondary similarity function. This should return a similarity score between two strings (optional), default is the Jaro similarity measure.
- threshold (float) – Threshold value for the secondary similarity function (defaults to 0.5). If the similarity of a token pair exceeds the threshold, then the token pair is considered a match.
-
sim_func
¶ function
An attribute to store the secondary similarity function.
-
threshold
¶ float
An attribute to store the threshold value for the secondary similarity function.
-
get_raw_score
(bag1, bag2)[source]¶ Computes the raw soft TF/IDF score between two lists given the corpus information.
Parameters: bag1,bag2 (list) – Input lists Returns: Soft TF/IDF score between the input lists (float). Raises: TypeError
– If the inputs are not lists or if one of the inputs is None.Examples
>>> soft_tfidf = SoftTfIdf([['a', 'b', 'a'], ['a', 'c'], ['a']], sim_func=Jaro().get_raw_score, threshold=0.8) >>> soft_tfidf.get_raw_score(['a', 'b', 'a'], ['a', 'c']) 0.17541160386140586 >>> soft_tfidf = SoftTfIdf([['a', 'b', 'a'], ['a', 'c'], ['a']], threshold=0.9) >>> soft_tfidf.get_raw_score(['a', 'b', 'a'], ['a']) 0.5547001962252291 >>> soft_tfidf = SoftTfIdf([['x', 'y'], ['w'], ['q']]) >>> soft_tfidf.get_raw_score(['a', 'b', 'a'], ['a']) 0.0 >>> soft_tfidf = SoftTfIdf(sim_func=Affine().get_raw_score, threshold=0.6) >>> soft_tfidf.get_raw_score(['aa', 'bb', 'a'], ['ab', 'ba']) 0.81649658092772592
References
- the string matching chapter of the “Principles of Data Integration” book.
-
get_sim_func
()[source]¶ Get secondary similarity function.
Returns: secondary similarity function (function).
-
get_threshold
()[source]¶ Get threshold used for the secondary similarity function.
Returns: threshold (float).
-
set_corpus_list
(corpus_list)[source]¶ Set corpus list.
Parameters: corpus_list (list of lists) – Corpus list.