Qgram Tokenizer¶

class py_stringmatching.tokenizer.qgram_tokenizer.QgramTokenizer(qval=2, return_set=False)[source]¶

Returns tokens that are sequences of q consecutive characters.

A qgram of an input string s is a substring t (of s) which is a sequence of q consecutive characters. Qgrams are also known as ngrams or kgrams.

Parameters:	qval (int) – A value for q, that is, the qgram’s length (defaults to 2). return_set (boolean) – A flag to indicate whether to return a set of tokens or a bag of tokens (defaults to False).

qval¶

int

An attribute to store the q value.

return_set¶

boolean

An attribute to store the flag return_set.

get_qval()[source]¶

Gets the value of the qval attribute, which is the length of qgrams.

Returns:	The value of the qval attribute.

get_return_set()¶

Gets the value of the return_set flag.

Returns:	The boolean value of the return_set flag.

set_qval(qval)[source]¶

Sets the value of the qval attribute.

Parameters:	qval (int) – A value for q (the length of qgrams).
Raises:	`AssertionError` – If qval is less than 1.

set_return_set(return_set)¶

Sets the value of the return_set flag.

Parameters:	return_set (boolean) – a flag to indicate whether to return a set of tokens instead of a bag of tokens.

tokenize(input_string)[source]¶

Tokenizes input string into qgrams.

Parameters:	input_string (str) – The string to be tokenized.
Returns:	A Python list, which is a set or a bag of qgrams, depending on whether return_set flag is True or False.
Raises:	`TypeError` – If the input is not a string

Examples

>>> qg2_tok = QgramTokenizer()
>>> qg2_tok.tokenize('database')
['da','at','ta','ab','ba','as','se']
>>> qg2_tok.tokenize('a')
[]
>>> qg3_tok = QgramTokenizer(3)
>>> qg3_tok.tokenize('database')
['dat', 'ata', 'tab', 'aba', 'bas', 'ase']

As these examples show, the current qgram tokenizer does not consider the case of appending #s at the start and the end of the input string. This is left for future work.