Qgram Tokenizer¶

class py_stringmatching.tokenizer.qgram_tokenizer.QgramTokenizer(qval=2, padding=True, prefix_pad='#', suffix_pad='$', return_set=False)[source]¶

Returns tokens that are sequences of q consecutive characters.

A qgram of an input string s is a substring t (of s) which is a sequence of q consecutive characters. Qgrams are also known as ngrams or kgrams.

Parameters:

qval (int) – A value for q, that is, the qgram’s length (defaults to 2).
return_set (boolean) – A flag to indicate whether to return a set of tokens or a bag of tokens (defaults to False).
padding (boolean) – A flag to indicate whether a prefix and a suffix should be added to the input string (defaults to True).
prefix_pad (str) – A character (that is, a string of length 1 in Python) that should be replicated (qval-1) times and prepended to the input string, if padding was set to True (defaults to ‘#’).
suffix_pad (str) – A character (that is, a string of length 1 in Python) that should be replicated (qval-1) times and appended to the input string, if padding was set to True (defaults to ‘$’).

qval¶: int – An attribute to store the q value.

return_set¶: boolean – An attribute to store the flag return_set.

padding¶: boolean – An attribute to store the padding flag.

prefix_pad¶: str – An attribute to store the prefix string that should be used for padding.

suffix_pad¶: str – An attribute to store the suffix string that should be used for padding.

get_padding()[source]¶

Gets the value of the padding flag. This flag determines whether the padding should be done for the input strings or not.

Returns:	The Boolean value of the padding flag.

get_prefix_pad()[source]¶

Gets the value of the prefix pad.

Returns:	The prefix pad string.

get_qval()[source]¶

Gets the value of the qval attribute, which is the length of qgrams.

Returns:	The value of the qval attribute.

get_return_set()¶

Gets the value of the return_set flag.

Returns:	The boolean value of the return_set flag.

get_suffix_pad()[source]¶

Gets the value of the suffix pad.

Returns:	The suffix pad string.

set_padding(padding)[source]¶

Sets the value of the padding flag.

Parameters:	padding (boolean) – Flag to indicate whether padding should be done or not.
Returns:	The Boolean value of True is returned if the update was successful.
Raises:	`AssertionError` – If the padding is not of type boolean

set_prefix_pad(prefix_pad)[source]¶

Sets the value of the prefix pad string.

Parameters:	prefix_pad (str) – String that should be prepended to the input string before tokenization.
Returns:	The Boolean value of True is returned if the update was successful.
Raises:	`AssertionError` – If the prefix_pad is not of type string. `AssertionError` – If the length of prefix_pad is not one.

set_qval(qval)[source]¶

Sets the value of the qval attribute.

Parameters:	qval (int) – A value for q (the length of qgrams).
Raises:	`AssertionError` – If qval is less than 1.

set_return_set(return_set)¶

Sets the value of the return_set flag.

Parameters:	return_set (boolean) – a flag to indicate whether to return a set of tokens instead of a bag of tokens.

set_suffix_pad(suffix_pad)[source]¶

Sets the value of the suffix pad string.

Parameters:	suffix_pad (str) – String that should be appended to the input string before tokenization.
Returns:	The boolean value of True is returned if the update was successful.
Raises:	`AssertionError` – If the suffix_pad is not of type string. `AssertionError` – If the length of suffix_pad is not one.

tokenize(input_string)[source]¶

Tokenizes input string into qgrams.

Parameters:	input_string (str) – The string to be tokenized.
Returns:	A Python list, which is a set or a bag of qgrams, depending on whether return_set flag is True or False.
Raises:	`TypeError` – If the input is not a string

Examples

>>> qg2_tok = QgramTokenizer()
>>> qg2_tok.tokenize('database')
['#d', 'da', 'at', 'ta', 'ab', 'ba', 'as', 'se', 'e$']
>>> qg2_tok.tokenize('a')
['#a', 'a$']
>>> qg3_tok = QgramTokenizer(qval=3)
>>> qg3_tok.tokenize('database')
['##d', '#da', 'dat', 'ata', 'tab', 'aba', 'bas', 'ase', 'se$', 'e$$']
>>> qg3_nopad = QgramTokenizer(padding=False)
>>> qg3_nopad.tokenize('database')
['da', 'at', 'ta', 'ab', 'ba', 'as', 'se']
>>> qg3_diffpads = QgramTokenizer(prefix_pad='^', suffix_pad='!')
>>> qg3_diffpads.tokenize('database')
['^d', 'da', 'at', 'ta', 'ab', 'ba', 'as', 'se', 'e!']