Qgram Tokenizer¶
-
class
py_stringmatching.tokenizer.qgram_tokenizer.
QgramTokenizer
(qval=2, padding=True, prefix_pad='#', suffix_pad='$', return_set=False)[source]¶ Returns tokens that are sequences of q consecutive characters.
A qgram of an input string s is a substring t (of s) which is a sequence of q consecutive characters. Qgrams are also known as ngrams or kgrams.
Parameters: - qval (int) – A value for q, that is, the qgram’s length (defaults to 2).
- return_set (boolean) – A flag to indicate whether to return a set of tokens or a bag of tokens (defaults to False).
- padding (boolean) – A flag to indicate whether a prefix and a suffix should be added to the input string (defaults to True).
- prefix_pad (str) – A character (that is, a string of length 1 in Python) that should be replicated (qval-1) times and prepended to the input string, if padding was set to True (defaults to ‘#’).
- suffix_pad (str) – A character (that is, a string of length 1 in Python) that should be replicated (qval-1) times and appended to the input string, if padding was set to True (defaults to ‘$’).
-
qval
¶ int – An attribute to store the q value.
-
return_set
¶ boolean – An attribute to store the flag return_set.
-
padding
¶ boolean – An attribute to store the padding flag.
-
prefix_pad
¶ str – An attribute to store the prefix string that should be used for padding.
-
suffix_pad
¶ str – An attribute to store the suffix string that should be used for padding.
-
get_padding
()[source]¶ Gets the value of the padding flag. This flag determines whether the padding should be done for the input strings or not.
Returns: The Boolean value of the padding flag.
-
get_qval
()[source]¶ Gets the value of the qval attribute, which is the length of qgrams.
Returns: The value of the qval attribute.
-
get_return_set
()¶ Gets the value of the return_set flag.
Returns: The boolean value of the return_set flag.
-
set_padding
(padding)[source]¶ Sets the value of the padding flag.
Parameters: padding (boolean) – Flag to indicate whether padding should be done or not. Returns: The Boolean value of True is returned if the update was successful. Raises: AssertionError
– If the padding is not of type boolean
-
set_prefix_pad
(prefix_pad)[source]¶ Sets the value of the prefix pad string.
Parameters: prefix_pad (str) – String that should be prepended to the input string before tokenization.
Returns: The Boolean value of True is returned if the update was successful.
Raises: AssertionError
– If the prefix_pad is not of type string.AssertionError
– If the length of prefix_pad is not one.
-
set_qval
(qval)[source]¶ Sets the value of the qval attribute.
Parameters: qval (int) – A value for q (the length of qgrams). Raises: AssertionError
– If qval is less than 1.
-
set_return_set
(return_set)¶ Sets the value of the return_set flag.
Parameters: return_set (boolean) – a flag to indicate whether to return a set of tokens instead of a bag of tokens.
-
set_suffix_pad
(suffix_pad)[source]¶ Sets the value of the suffix pad string.
Parameters: suffix_pad (str) – String that should be appended to the input string before tokenization.
Returns: The boolean value of True is returned if the update was successful.
Raises: AssertionError
– If the suffix_pad is not of type string.AssertionError
– If the length of suffix_pad is not one.
-
tokenize
(input_string)[source]¶ Tokenizes input string into qgrams.
Parameters: input_string (str) – The string to be tokenized. Returns: A Python list, which is a set or a bag of qgrams, depending on whether return_set flag is True or False. Raises: TypeError
– If the input is not a stringExamples
>>> qg2_tok = QgramTokenizer() >>> qg2_tok.tokenize('database') ['#d', 'da', 'at', 'ta', 'ab', 'ba', 'as', 'se', 'e$'] >>> qg2_tok.tokenize('a') ['#a', 'a$'] >>> qg3_tok = QgramTokenizer(qval=3) >>> qg3_tok.tokenize('database') ['##d', '#da', 'dat', 'ata', 'tab', 'aba', 'bas', 'ase', 'se$', 'e$$'] >>> qg3_nopad = QgramTokenizer(padding=False) >>> qg3_nopad.tokenize('database') ['da', 'at', 'ta', 'ab', 'ba', 'as', 'se'] >>> qg3_diffpads = QgramTokenizer(prefix_pad='^', suffix_pad='!') >>> qg3_diffpads.tokenize('database') ['^d', 'da', 'at', 'ta', 'ab', 'ba', 'as', 'se', 'e!']