Supported Tokenizers¶
- py_entitymatching.tok_qgram(input_string, q)[source]¶
This function splits the input string into a list of q-grams. Note that, by default the input strings are padded and then tokenized.
- Parameters
input_string (string) – Input string that should be tokenized.
q (int) – q-val that should be used to tokenize the input string.
- Returns
A list of tokens, if the input string is not NaN, else returns NaN.
Examples
>>> import py_entitymatching as em >>> em.tok_qgram('database', q=2) ['#d', 'da', 'at', 'ta', 'ab', 'ba', 'as', 'se', 'e$'] >>> em.tok_qgram('database', q=3) ['##d', '#da', 'dat', 'ata', 'tab', 'aba', 'bas', 'ase', 'se$', 'e$$'] >>> em.tok_qgram(None, q=2) nan
- py_entitymatching.tok_delim(input_string, d)[source]¶
This function splits the input string into a list of tokens (based on the delimiter).
- Parameters
input_string (string) – Input string that should be tokenized.
d (string) – Delimiter string.
- Returns
A list of tokens, if the input string is not NaN , else returns NaN.
Examples
>>> import py_entitymatching as em >>> em.tok_delim('data science', ' ') ['data', 'science'] >>> em.tok_delim('data$#$science', '$#$') ['data', 'science'] >>> em.tok_delim(None, ' ') nan
- py_entitymatching.tok_wspace(input_string)[source]¶
This function splits the input string into a list of tokens (based on the white space).
- Parameters
input_string (string) – Input string that should be tokenized.
- Returns
A list of tokens, if the input string is not NaN , else returns NaN.
Examples
>>> import py_entitymatching as em >>> em.tok_wspace('data science') ['data', 'science'] >>> em.tok_wspace('data science') ['data', 'science'] >>> em.tok_wspace(None) nan
- py_entitymatching.tok_alphabetic(input_string)[source]¶
This function returns a list of tokens that are maximal sequences of consecutive alphabetical characters.
- Parameters
input_string (string) – Input string that should be tokenized.
- Returns
A list of tokens, if the input string is not NaN , else returns NaN.
Examples
>>> import py_entitymatching as em >>> em.tok_alphabetic('data99science, data#integration.') ['data', 'science', 'data', 'integration'] >>> em.tok_alphabetic('99') [] >>> em.tok_alphabetic(None) nan
- py_entitymatching.tok_alphanumeric(input_string)[source]¶
This function returns a list of tokens that are maximal sequences of consecutive alphanumeric characters.
- Parameters
input_string (string) – Input string that should be tokenized.
- Returns
A list of tokens, if the input string is not NaN , else returns NaN.
Examples
>>> import py_entitymatching as em >>> em.tok_alphanumeric('data9,(science), data9#.(integration).88') ['data9', 'science', 'data9', 'integration', '88'] >>> em.tok_alphanumeric('#.$') [] >>> em.tok_alphanumeric(None) nan