Qgram Tokenizer

class py_stringmatching.tokenizer.qgram_tokenizer.QgramTokenizer(qval=2, padding=True, prefix_pad='#', suffix_pad='$', return_set=False)[source]

Returns tokens that are sequences of q consecutive characters.

A qgram of an input string s is a substring t (of s) which is a sequence of q consecutive characters. Qgrams are also known as ngrams or kgrams.

Parameters
  • qval (int) – A value for q, that is, the qgram’s length (defaults to 2).

  • return_set (boolean) – A flag to indicate whether to return a set of tokens or a bag of tokens (defaults to False).

  • padding (boolean) – A flag to indicate whether a prefix and a suffix should be added to the input string (defaults to True).

  • prefix_pad (str) – A character (that is, a string of length 1 in Python) that should be replicated (qval-1) times and prepended to the input string, if padding was set to True (defaults to ‘#’).

  • suffix_pad (str) – A character (that is, a string of length 1 in Python) that should be replicated (qval-1) times and appended to the input string, if padding was set to True (defaults to ‘$’).

qval

An attribute to store the q value.

Type

int

return_set

An attribute to store the flag return_set.

Type

boolean

padding

An attribute to store the padding flag.

Type

boolean

prefix_pad

An attribute to store the prefix string that should be used for padding.

Type

str

suffix_pad

An attribute to store the suffix string that should be used for padding.

Type

str

get_padding()[source]

Gets the value of the padding flag. This flag determines whether the padding should be done for the input strings or not.

Returns

The Boolean value of the padding flag.

get_prefix_pad()[source]

Gets the value of the prefix pad.

Returns

The prefix pad string.

get_qval()[source]

Gets the value of the qval attribute, which is the length of qgrams.

Returns

The value of the qval attribute.

get_return_set()

Gets the value of the return_set flag.

Returns

The boolean value of the return_set flag.

get_suffix_pad()[source]

Gets the value of the suffix pad.

Returns

The suffix pad string.

set_padding(padding)[source]

Sets the value of the padding flag.

Parameters

padding (boolean) – Flag to indicate whether padding should be done or not.

Returns

The Boolean value of True is returned if the update was successful.

Raises

AssertionError – If the padding is not of type boolean

set_prefix_pad(prefix_pad)[source]

Sets the value of the prefix pad string.

Parameters

prefix_pad (str) – String that should be prepended to the input string before tokenization.

Returns

The Boolean value of True is returned if the update was successful.

Raises
  • AssertionError – If the prefix_pad is not of type string.

  • AssertionError – If the length of prefix_pad is not one.

set_qval(qval)[source]

Sets the value of the qval attribute.

Parameters

qval (int) – A value for q (the length of qgrams).

Raises

AssertionError – If qval is less than 1.

set_return_set(return_set)

Sets the value of the return_set flag.

Parameters

return_set (boolean) – a flag to indicate whether to return a set of tokens instead of a bag of tokens.

set_suffix_pad(suffix_pad)[source]

Sets the value of the suffix pad string.

Parameters

suffix_pad (str) – String that should be appended to the input string before tokenization.

Returns

The boolean value of True is returned if the update was successful.

Raises
  • AssertionError – If the suffix_pad is not of type string.

  • AssertionError – If the length of suffix_pad is not one.

tokenize(input_string)[source]

Tokenizes input string into qgrams.

Parameters

input_string (str) – The string to be tokenized.

Returns

A Python list, which is a set or a bag of qgrams, depending on whether return_set flag is True or False.

Raises

TypeError – If the input is not a string

Examples

>>> qg2_tok = QgramTokenizer()
>>> qg2_tok.tokenize('database')
['#d', 'da', 'at', 'ta', 'ab', 'ba', 'as', 'se', 'e$']
>>> qg2_tok.tokenize('a')
['#a', 'a$']
>>> qg3_tok = QgramTokenizer(qval=3)
>>> qg3_tok.tokenize('database')
['##d', '#da', 'dat', 'ata', 'tab', 'aba', 'bas', 'ase', 'se$', 'e$$']
>>> qg3_nopad = QgramTokenizer(padding=False)
>>> qg3_nopad.tokenize('database')
['da', 'at', 'ta', 'ab', 'ba', 'as', 'se']
>>> qg3_diffpads = QgramTokenizer(prefix_pad='^', suffix_pad='!')
>>> qg3_diffpads.tokenize('database')
['^d', 'da', 'at', 'ta', 'ab', 'ba', 'as', 'se', 'e!']