Whitespace Tokenizer¶
- 
class py_stringmatching.tokenizer.whitespace_tokenizer.WhitespaceTokenizer(return_set=False)[source]¶
- Segments the input string using whitespaces then returns the segments as tokens. - Currently using the split function in Python, so whitespace character refers to the actual whitespace character as well as the tab and newline characters. - Parameters: - return_set (boolean) – A flag to indicate whether to return a set of tokens instead of a bag of tokens (defaults to False). - 
return_set¶
- boolean - An attribute to store the flag return_set. 
 - 
get_delim_set()¶
- Gets the current set of delimiters. - Returns: - A Python set which is the current set of delimiters. 
 - 
get_return_set()¶
- Gets the value of the return_set flag. - Returns: - The boolean value of the return_set flag. 
 - 
set_return_set(return_set)¶
- Sets the value of the return_set flag. - Parameters: - return_set (boolean) – a flag to indicate whether to return a set of tokens instead of a bag of tokens. 
 - 
tokenize(input_string)[source]¶
- Tokenizes input string based on white space. - Parameters: - input_string (str) – The string to be tokenized. - Returns: - A Python list, which is a set or a bag of tokens, depending on whether return_set is True or False. - Raises: - TypeError– If the input is not a string.- Examples - >>> ws_tok = WhitespaceTokenizer() >>> ws_tok.tokenize('data science') ['data', 'science'] >>> ws_tok.tokenize('data science') ['data', 'science'] >>> ws_tok.tokenize('data science') ['data', 'science'] >>> ws_tok = WhitespaceTokenizer(return_set=True) >>> ws_tok.tokenize('data science data integration') ['data', 'science', 'integration'] 
 
-