Delimiter Tokenizer¶
-
class
py_stringmatching.tokenizer.delimiter_tokenizer.
DelimiterTokenizer
(delim_set={' '}, return_set=False)[source]¶ Uses delimiters to find tokens, as apposed to using definitions.
Examples of delimiters include white space and punctuations. Examples of definitions include alphabetical and qgram tokens.
Parameters: - delim_set (set) – A set of delimiter strings (defaults to space delimiter).
- return_set (boolean) – A flag to indicate whether to return a set of tokens instead of a bag of tokens (defaults to False).
-
return_set
¶ boolean – An attribute to store the value of the flag return_set.
-
get_delim_set
()[source]¶ Gets the current set of delimiters.
Returns: A Python set which is the current set of delimiters.
-
get_return_set
()¶ Gets the value of the return_set flag.
Returns: The boolean value of the return_set flag.
-
set_delim_set
(delim_set)[source]¶ Sets the current set of delimiters.
Parameters: delim_set (set) – A set of delimiter strings.
-
set_return_set
(return_set)¶ Sets the value of the return_set flag.
Parameters: return_set (boolean) – a flag to indicate whether to return a set of tokens instead of a bag of tokens.
-
tokenize
(input_string)[source]¶ Tokenizes input string based on the set of delimiters.
Parameters: input_string (str) – The string to be tokenized. Returns: A Python list which is a set or a bag of tokens, depending on whether return_set flag is set to True or False. Raises: TypeError
– If the input is not a string.Examples
>>> delim_tok = DelimiterTokenizer() >>> delim_tok.tokenize('data science') ['data', 'science'] >>> delim_tok = DelimiterTokenizer(['$#$']) >>> delim_tok.tokenize('data$#$science') ['data', 'science'] >>> delim_tok = DelimiterTokenizer([',', '.']) >>> delim_tok.tokenize('data,science.data,integration.') ['data', 'science', 'data', 'integration'] >>> delim_tok = DelimiterTokenizer([',', '.'], return_set=True) >>> delim_tok.tokenize('data,science.data,integration.') ['data', 'science', 'integration']