Alphanumeric Tokenizer¶
-
class
py_stringmatching.tokenizer.alphanumeric_tokenizer.
AlphanumericTokenizer
(return_set=False)[source]¶ Returns tokens that are maximal sequences of consecutive alphanumeric characters.
Parameters: return_set (boolean) – A flag to indicate whether to return a set of tokens instead of a bag of tokens (defaults to False). -
return_set
¶ boolean – An attribute to store the value of the flag return_set.
-
get_return_set
()¶ Gets the value of the return_set flag.
Returns: The boolean value of the return_set flag.
-
set_return_set
(return_set)¶ Sets the value of the return_set flag.
Parameters: return_set (boolean) – a flag to indicate whether to return a set of tokens instead of a bag of tokens.
-
tokenize
(input_string)[source]¶ Tokenizes input string into alphanumeric tokens.
Parameters: input_string (str) – The string to be tokenized. Returns: A Python list, which represents a set of tokens if the flag return_set is true, and a bag of tokens otherwise. Raises: TypeError
– If the input is not a string.Examples
>>> alnum_tok = AlphanumericTokenizer() >>> alnum_tok.tokenize('data9,(science), data9#.(integration).88') ['data9', 'science', 'data9', 'integration', '88'] >>> alnum_tok.tokenize('#.&') [] >>> alnum_tok = AlphanumericTokenizer(return_set=True) >>> alnum_tok.tokenize('data9,(science), data9#.(integration).88') ['data9', 'science', 'integration', '88']
-