Alphanumeric Tokenizer¶

class py_stringmatching.tokenizer.alphanumeric_tokenizer.AlphanumericTokenizer(return_set=False)[source]¶

Returns tokens that are maximal sequences of consecutive alphanumeric characters.

Parameters:	return_set (boolean) – A flag to indicate whether to return a set of tokens instead of a bag of tokens (defaults to False).

return_set¶

boolean

An attribute to store the value of the flag return_set.

get_return_set()¶

Gets the value of the return_set flag.

Returns:	The boolean value of the return_set flag.

set_return_set(return_set)¶

Sets the value of the return_set flag.

Parameters:	return_set (boolean) – a flag to indicate whether to return a set of tokens instead of a bag of tokens.

tokenize(input_string)[source]¶

Tokenizes input string into alphanumeric tokens.

Parameters:	input_string (str) – The string to be tokenized.
Returns:	A Python list, which represents a set of tokens if the flag return_set is true, and a bag of tokens otherwise.
Raises:	`TypeError` – If the input is not a string.

Examples

>>> alnum_tok = AlphanumericTokenizer()
>>> alnum_tok.tokenize('data9,(science), data9#.(integration).88')
['data9', 'science', 'data9', 'integration', '88']
>>> alnum_tok.tokenize('#.&')
[]
>>> alnum_tok = AlphanumericTokenizer(return_set=True)
>>> alnum_tok.tokenize('data9,(science), data9#.(integration).88')
['data9', 'science', 'integration', '88']