TF/IDF

class py_stringmatching.similarity_measure.tfidf.TfIdf(corpus_list=None, dampen=False)[source]

Computes TF/IDF measure.

This measure employs the notion of TF/IDF score commonly used in information retrieval (IR) to find documents that are relevant to keyword queries. The intuition underlying the TF/IDF measure is that two strings are similar if they share distinguishing terms. See the string matching chapter in the book “Principles of Data Integration”

Note

Currently when you create a TF/IDF similarity measure object, the dampen flag is set to False by default. In most cases, you will want to set this flag to True, so that the TF and IDF formulas use logarithmic. So when creating this object, consider setting the flag to True. This will likely be fixed in the next release.

Parameters:
  • corpus_list (list of lists) – The corpus that will be used to compute TF and IDF values. This corpus is a list of strings, where each string has been tokenized into a list of tokens (that is, a bag of tokens). The default is set to None. In this case, when we call this TF/IDF measure on two input strings (using get_raw_score or get_sim_score), the corpus is taken to be the list of those two strings.
  • dampen (boolean) – Flag to indicate whether ‘log’ should be used in TF and IDF formulas. In general this flag should be set to True.
dampen

boolean

An attribute to store the dampen flag.

get_corpus_list()[source]

Get corpus list.

Returns:corpus list (list of lists).
get_dampen()[source]

Get dampen flag.

Returns:dampen flag (boolean).
get_raw_score(bag1, bag2)[source]

Computes the raw TF/IDF score between two lists.

Parameters:bag1,bag2 (list) – Input lists.
Returns:TF/IDF score between the input lists (float).
Raises:TypeError – If the inputs are not lists or if one of the inputs is None.

Examples

>>> # here the corpus is a list of three strings that
>>> # have been tokenized into three lists of tokens
>>> tfidf = TfIdf([['a', 'b', 'a'], ['a', 'c'], ['a']])
>>> tfidf.get_raw_score(['a', 'b', 'a'], ['a', 'c'])
0.17541160386140586
>>> tfidf.get_raw_score(['a', 'b', 'a'], ['a'])
0.5547001962252291
>>> tfidf = TfIdf([['a', 'b', 'a'], ['a', 'c'], ['a'], ['b']], True)
>>> tfidf.get_raw_score(['a', 'b', 'a'], ['a', 'c'])
0.11166746710505392
>>> tfidf = TfIdf([['x', 'y'], ['w'], ['q']])
>>> tfidf.get_raw_score(['a', 'b', 'a'], ['a'])
0.0
>>> tfidf = TfIdf([['x', 'y'], ['w'], ['q']], True)
>>> tfidf.get_raw_score(['a', 'b', 'a'], ['a'])
0.0
>>> tfidf = TfIdf()
>>> tfidf.get_raw_score(['a', 'b', 'a'], ['a'])
0.7071067811865475
get_sim_score(bag1, bag2)[source]

Computes the normalized TF/IDF similarity score between two lists. Simply call get_raw_score.

Parameters:bag1,bag2 (list) – Input lists.
Returns:Normalized TF/IDF similarity score between the input lists (float).
Raises:TypeError – If the inputs are not lists or if one of the inputs is None.

Examples

>>> tfidf = TfIdf([['a', 'b', 'a'], ['a', 'c'], ['a']])
>>> tfidf.get_sim_score(['a', 'b', 'a'], ['a', 'c'])
0.17541160386140586
>>> tfidf.get_sim_score(['a', 'b', 'a'], ['a'])
0.5547001962252291
>>> tfidf = TfIdf([['a', 'b', 'a'], ['a', 'c'], ['a'], ['b']], True)
>>> tfidf.get_sim_score(['a', 'b', 'a'], ['a', 'c'])
0.11166746710505392
>>> tfidf = TfIdf([['x', 'y'], ['w'], ['q']])
>>> tfidf.get_sim_score(['a', 'b', 'a'], ['a'])
0.0
>>> tfidf = TfIdf([['x', 'y'], ['w'], ['q']], True)
>>> tfidf.get_sim_score(['a', 'b', 'a'], ['a'])
0.0
>>> tfidf = TfIdf()
>>> tfidf.get_sim_score(['a', 'b', 'a'], ['a'])
0.7071067811865475
set_corpus_list(corpus_list)[source]

Set corpus list.

Parameters:corpus_list (list of lists) – Corpus list.
set_dampen(dampen)[source]

Set dampen flag.

Parameters:dampen (boolean) – Flag to indicate whether ‘log’ should be applied to TF and IDF formulas.