TF/IDF¶

class py_stringmatching.similarity_measure.tfidf.TfIdf(corpus_list=None, dampen=True)[source]¶

Computes TF/IDF measure.

This measure employs the notion of TF/IDF score commonly used in information retrieval (IR) to find documents that are relevant to keyword queries. The intuition underlying the TF/IDF measure is that two strings are similar if they share distinguishing terms. See the string matching chapter in the book “Principles of Data Integration”

Parameters:

corpus_list (list of lists) – The corpus that will be used to compute TF and IDF values. This corpus is a list of strings, where each string has been tokenized into a list of tokens (that is, a bag of tokens). The default is set to None. In this case, when we call this TF/IDF measure on two input strings (using get_raw_score or get_sim_score), the corpus is taken to be the list of those two strings.
dampen (boolean) – Flag to indicate whether ‘log’ should be used in TF and IDF formulas (defaults to True).

dampen¶

boolean

An attribute to store the dampen flag.

get_corpus_list()[source]¶

Get corpus list.

Returns:	corpus list (list of lists).

get_dampen()[source]¶

Get dampen flag.

Returns:	dampen flag (boolean).

get_raw_score(bag1, bag2)[source]¶

Computes the raw TF/IDF score between two lists.

Parameters:	bag1,bag2 (list) – Input lists.
Returns:	TF/IDF score between the input lists (float).
Raises:	`TypeError` – If the inputs are not lists or if one of the inputs is None.

Examples

>>> # here the corpus is a list of three strings that
>>> # have been tokenized into three lists of tokens
>>> tfidf = TfIdf([['a', 'b', 'a'], ['a', 'c'], ['a']])
>>> tfidf.get_raw_score(['a', 'b', 'a'], ['b', 'c'])
0.7071067811865475
>>> tfidf.get_raw_score(['a', 'b', 'a'], ['a'])
0.0
>>> tfidf = TfIdf([['x', 'y'], ['w'], ['q']])
>>> tfidf.get_raw_score(['a', 'b', 'a'], ['a'])
0.0
>>> tfidf = TfIdf([['a', 'b', 'a'], ['a', 'c'], ['a'], ['b']], False)
>>> tfidf.get_raw_score(['a', 'b', 'a'], ['a', 'c'])
0.25298221281347033
>>> tfidf = TfIdf(dampen=False)
>>> tfidf.get_raw_score(['a', 'b', 'a'], ['a'])
0.7071067811865475
>>> tfidf = TfIdf()
>>> tfidf.get_raw_score(['a', 'b', 'a'], ['a'])
0.0

get_sim_score(bag1, bag2)[source]¶

Computes the normalized TF/IDF similarity score between two lists. Simply call get_raw_score.

Parameters:	bag1,bag2 (list) – Input lists.
Returns:	Normalized TF/IDF similarity score between the input lists (float).
Raises:	`TypeError` – If the inputs are not lists or if one of the inputs is None.

Examples

>>> # here the corpus is a list of three strings that
>>> # have been tokenized into three lists of tokens
>>> tfidf = TfIdf([['a', 'b', 'a'], ['a', 'c'], ['a']])
>>> tfidf.get_sim_score(['a', 'b', 'a'], ['b', 'c'])
0.7071067811865475
>>> tfidf.get_sim_score(['a', 'b', 'a'], ['a'])
0.0
>>> tfidf = TfIdf([['x', 'y'], ['w'], ['q']])
>>> tfidf.get_sim_score(['a', 'b', 'a'], ['a'])
0.0
>>> tfidf = TfIdf([['a', 'b', 'a'], ['a', 'c'], ['a'], ['b']], False)
>>> tfidf.get_sim_score(['a', 'b', 'a'], ['a', 'c'])
0.25298221281347033
>>> tfidf = TfIdf(dampen=False)
>>> tfidf.get_sim_score(['a', 'b', 'a'], ['a'])
0.7071067811865475
>>> tfidf = TfIdf()
>>> tfidf.get_sim_score(['a', 'b', 'a'], ['a'])
0.0

set_corpus_list(corpus_list)[source]¶

Set corpus list.

Parameters:	corpus_list (list of lists) – Corpus list.

set_dampen(dampen)[source]¶

Set dampen flag.

Parameters:	dampen (boolean) – Flag to indicate whether ‘log’ should be applied to TF and IDF formulas.