Tuners for the Dask-based Commands

Downsampling

py_entitymatching.tuner.tuner_down_sample.tuner_down_sample(ltable, rtable, size, y_param, seed, rem_stop_words, rem_puncs, n_bins=50, sample_proportion=0.1, repeat=1)

WARNING THIS COMMAND IS EXPERIMENTAL AND NOT TESTED. USE AT YOUR OWN RISK.

Tunes the parameters for down sampling command implemented using Dask.

Given the input tables and the parameters for Dask-based down sampling command, this command returns the configuration including whether the input tables need to be swapped, the number of left table chunks, and the number of right table chunks. It uses “Staged Tuning” approach to select the configuration setting. The key idea of this approach select the configuration for one parameter at a time.

Conceptually, this command performs the following steps. First, it samples the left table and down sampled rtable using stratified sampling. Next, it uses the sampled tables to decide if they need to be swapped or not (by running the down sample command and comparing the runtimes). Next, it finds the number of rtable partitions using the sampled tables (by trying the a fixed set of partitions and comparing the runtimes). The number of partitions is selected to be the number before which the runtime starts increasing. Then it finds the number of right table partitions similar to selecting the number of left table partitions. while doing this, set the number of right table partitions is set to the value found in the previous step. Finally, it returns the configuration setting back to the user as a triplet (x, y, z) where x indicates if the tables need to be swapped or not, y indicates the number of left table partitions (if the tables need to be swapped, then this indicates the number of left table partitions after swapping), and z indicates the number of down sampled right table partitions.

Parameters
  • ltable (DataFrame) – The left input table, i.e., table A.

  • rtable (DataFrame) – The right input table, i.e., table B.

  • size (int) – The size that table B should be down sampled to.

  • y_param (int) – The parameter to control the down sample size of table A. Specifically, the down sampled size of table A should be close to size * y_param.

  • seed (int) – The seed for the pseudo random number generator to select the tuples from A and B (defaults to None).

  • rem_stop_words (boolean) – A flag to indicate whether a default set of stop words must be removed.

  • rem_puncs (boolean) – A flag to indicate whether the punctuations must be removed from the strings.

  • n_bins (int) – The number of bins to be used for stratified sampling.

  • sample_proportion (float) – The proportion used to sample the tables. This value is expected to be greater than 0 and less thank 1.

  • repeat (int) – The number of times to execute the down sample command while selecting the values for the parameters.

Returns

A tuple containing 3 values. For example if the tuple is represented as (x, y, z) then x indicates if the tables need to be swapped or not, y indicates the number of left table partitions (if the tables need to be swapped, then this indicates the number of left table partitions after swapping), and z indicates the number of down sampled right table partitions.

Examples

>>> from py_entitymatching.tuner.tuner_down_sample import tuner_down_sample
>>> (swap_or_not, n_ltable_chunks, n_sample_rtable_chunks) = tuner_down_sample(ltable, rtable, size, y_param, seed, rem_stop_words, rem_puncs)

Overlap Blocker

py_entitymatching.tuner.tuner_overlap_blocker.tuner_overlap_blocker(ltable, rtable, l_key, r_key, l_overlap_attr, r_overlap_attr, rem_stop_words, q_val, word_level, overlap_size, ob_obj, n_bins=50, sample_proportion=0.1, seed=0, repeat=1)

WARNING THIS COMMAND IS EXPERIMENTAL AND NOT TESTED. USE AT YOUR OWN RISK.

Tunes the parameters for blocking two tables command implemented using Dask.

Given the input tables and the parameters for Dask-based overlap blocker command, this command returns the configuration including whether the input tables need to be swapped, the number of left table chunks, and the number of right table chunks. It uses “Staged Tuning” approach to select the configuration setting. The key idea of this approach select the configuration for one parameter at a time.

Conceptually, this command performs the following steps. First, it samples the left table and rtable using stratified sampling. Next, it uses the sampled tables to decide if they need to be swapped or not (by running the down sample command and comparing the runtimes). Next, it finds the number of rtable partitions using the sampled tables (by trying the a fixed set of partitions and comparing the runtimes). The number of partitions is selected to be the number before which the runtime starts increasing. Then it finds the number of right table partitions similar to selecting the number of left table partitions. while doing this, set the number of right table partitions is set to the value found in the previous step. Finally, it returns the configuration setting back to the user as a triplet (x, y, z) where x indicates if the tables need to be swapped or not, y indicates the number of left table partitions (if the tables need to be swapped, then this indicates the number of left table partitions after swapping), and z indicates the number of right table partitions.

Parameters
  • ltable (DataFrame) – The left input table.

  • rtable (DataFrame) – The right input table.

  • l_overlap_attr (string) – The overlap attribute in left table.

  • r_overlap_attr (string) – The overlap attribute in right table.

  • rem_stop_words (boolean) – A flag to indicate whether stop words (e.g., a, an, the) should be removed from the token sets of the overlap attribute values (defaults to False).

  • q_val (int) – The value of q to use if the overlap attributes values are to be tokenized as qgrams (defaults to None).

  • word_level (boolean) – A flag to indicate whether the overlap attributes should be tokenized as words (i.e, using whitespace as delimiter) (defaults to True).

  • overlap_size (int) – The minimum number of tokens that must overlap.

  • ob_obj (OverlapBlocker) – The object used to call commands to block two tables and a candidate set

  • n_bins (int) – The number of bins to be used for stratified sampling.

  • sample_proportion (float) – The proportion used to sample the tables. This value is expected to be greater than 0 and less thank 1.

  • repeat (int) – The number of times to execute the down sample command while selecting the values for the parameters.

Returns

A tuple containing 3 values. For example if the tuple is represented as (x, y, z) then x indicates if the tables need to be swapped or not, y indicates the number of left table partitions (if the tables need to be swapped, then this indicates the number of left table partitions after swapping), and z indicates the number of right table partitions.

Examples

>>> from py_entitymatching.tuner.tuner_overlap_blocker import tuner_overlap_blocker
>>> from py_entitymatching.dask.dask_overlap_blocker import DaskOverlapBlocker
>>> obj = DaskOverlapBlocker()
>>> (swap_or_not, n_ltable_chunks, n_sample_rtable_chunks) = tuner_overlap_blocker(ltable, rtable, 'id', 'id', "title", "title", rem_stop_words=True, q_val=None, word_level=True, overlap_size=1, ob_obj=obj)
Scroll To Top