Whitespace Tokenizer

class py_stringmatching.tokenizer.whitespace_tokenizer.WhitespaceTokenizer(return_set=False)[source]

Segments the input string using whitespaces then returns the segments as tokens.

Currently using the split function in Python, so whitespace character refers to the actual whitespace character as well as the tab and newline characters.

Parameters

return_set (boolean) – A flag to indicate whether to return a set of tokens instead of a bag of tokens (defaults to False).

return_set

An attribute to store the flag return_set.

Type

boolean

get_delim_set()

Gets the current set of delimiters.

Returns

A Python set which is the current set of delimiters.

get_return_set()

Gets the value of the return_set flag.

Returns

The boolean value of the return_set flag.

set_delim_set(delim_set)[source]

Sets the current set of delimiters.

Parameters

delim_set (set) – A set of delimiter strings.

set_return_set(return_set)

Sets the value of the return_set flag.

Parameters

return_set (boolean) – a flag to indicate whether to return a set of tokens instead of a bag of tokens.

tokenize(input_string)[source]

Tokenizes input string based on white space.

Parameters

input_string (str) – The string to be tokenized.

Returns

A Python list, which is a set or a bag of tokens, depending on whether return_set is True or False.

Raises

TypeError – If the input is not a string.

Examples

>>> ws_tok = WhitespaceTokenizer()
>>> ws_tok.tokenize('data science')
['data', 'science']
>>> ws_tok.tokenize('data        science')
['data', 'science']
>>> ws_tok.tokenize('data   science')
['data', 'science']
>>> ws_tok = WhitespaceTokenizer(return_set=True)
>>> ws_tok.tokenize('data   science data integration')
['data', 'science', 'integration']