Supported Tokenizers¶

py_entitymatching.tok_qgram(input_string, q)[source]¶

This function splits the input string into a list of q-grams. Note that, by default the input strings are padded and then tokenized.

Parameters

input_string (string) – Input string that should be tokenized.
q (int) – q-val that should be used to tokenize the input string.

Returns

A list of tokens, if the input string is not NaN, else returns NaN.

Examples

>>> import py_entitymatching as em
>>> em.tok_qgram('database', q=2)
['#d', 'da', 'at', 'ta', 'ab', 'ba', 'as', 'se', 'e$']
>>> em.tok_qgram('database', q=3)
['##d', '#da', 'dat', 'ata', 'tab', 'aba', 'bas', 'ase', 'se$', 'e$$']
>>> em.tok_qgram(None, q=2)
nan

py_entitymatching.tok_delim(input_string, d)[source]¶

This function splits the input string into a list of tokens (based on the delimiter).

Parameters

input_string (string) – Input string that should be tokenized.
d (string) – Delimiter string.

Returns

A list of tokens, if the input string is not NaN , else returns NaN.

Examples

>>> import py_entitymatching as em
>>> em.tok_delim('data science', ' ')
['data', 'science']
>>> em.tok_delim('data$#$science', '$#$')
['data', 'science']
>>> em.tok_delim(None, ' ')
nan

py_entitymatching.tok_wspace(input_string)[source]¶

This function splits the input string into a list of tokens (based on the white space).

Parameters: input_string (string) – Input string that should be tokenized.
Returns: A list of tokens, if the input string is not NaN , else returns NaN.

Examples

>>> import py_entitymatching as em
>>> em.tok_wspace('data science')
['data', 'science']
>>> em.tok_wspace('data         science')
['data', 'science']
>>> em.tok_wspace(None)
nan

py_entitymatching.tok_alphabetic(input_string)[source]¶

This function returns a list of tokens that are maximal sequences of consecutive alphabetical characters.

Parameters: input_string (string) – Input string that should be tokenized.
Returns: A list of tokens, if the input string is not NaN , else returns NaN.

Examples

>>> import py_entitymatching as em
>>> em.tok_alphabetic('data99science, data#integration.')
['data', 'science', 'data', 'integration']
>>> em.tok_alphabetic('99')
[]
>>> em.tok_alphabetic(None)
nan

py_entitymatching.tok_alphanumeric(input_string)[source]¶

This function returns a list of tokens that are maximal sequences of consecutive alphanumeric characters.

Parameters: input_string (string) – Input string that should be tokenized.
Returns: A list of tokens, if the input string is not NaN , else returns NaN.

Examples

>>> import py_entitymatching as em
>>> em.tok_alphanumeric('data9,(science), data9#.(integration).88')
['data9', 'science', 'data9', 'integration', '88']
>>> em.tok_alphanumeric('#.$')
[]
>>> em.tok_alphanumeric(None)
nan