Alphabetic Tokenizer

class py_stringmatching.tokenizer.alphabetic_tokenizer.AlphabeticTokenizer(return_set=False)[source]

Returns tokens that are maximal sequences of consecutive alphabetical characters.

Parameters:return_set (boolean) – A flag to indicate whether to return a set of tokens instead of a bag of tokens (defaults to False).
return_set

An attribute that stores the value for the flag return_set.

Type:boolean
get_return_set()

Gets the value of the return_set flag.

Returns:The boolean value of the return_set flag.
set_return_set(return_set)

Sets the value of the return_set flag.

Parameters:return_set (boolean) – a flag to indicate whether to return a set of tokens instead of a bag of tokens.
tokenize(input_string)[source]

Tokenizes input string into alphabetical tokens.

Parameters:input_string (str) – The string to be tokenized.
Returns:A Python list, which represents a set of tokens if the flag return_set is True, and a bag of tokens otherwise.
Raises:TypeError – If the input is not a string.

Examples

>>> al_tok = AlphabeticTokenizer()
>>> al_tok.tokenize('data99science, data#integration.')
['data', 'science', 'data', 'integration']
>>> al_tok.tokenize('99')
[]
>>> al_tok = AlphabeticTokenizer(return_set=True)
>>> al_tok.tokenize('data99science, data#integration.')
['data', 'science', 'integration']