Downsampling

py_entitymatching.down_sample(table_a, table_b, size, y_param, show_progress=True, verbose=False, seed=None, rem_stop_words=True, rem_puncs=True, n_jobs=1)[source]

This function down samples two tables A and B into smaller tables A’ and B’ respectively.

Specifically, first it randomly selects size tuples from the table B to be table B’. Next, it builds an inverted index I (token, tuple_id) on table A. For each tuple x ∈ B’, the algorithm finds a set P of k/2 tuples from I that match x, and a set Q of k/2 tuples randomly selected from A - P. The idea is for A’ and B’ to share some matches yet be as representative of A and B as possible.

Parameters
  • table_a (DataFrame) – The input tables A and B.

  • table_b (DataFrame) – The input tables A and B.

  • size (int) – The size that table B should be down sampled to.

  • y_param (int) – The parameter to control the down sample size of table A. Specifically, the down sampled size of table A should be close to size * y_param.

  • show_progress (boolean) – A flag to indicate whether a progress bar should be displayed (defaults to True).

  • verbose (boolean) – A flag to indicate whether the debug information should be displayed (defaults to False).

  • seed (int) – The seed for the pseudo random number generator to select the tuples from A and B (defaults to None).

  • rem_stop_words (boolean) – A flag to indicate whether a default set of stop words must be removed.

  • rem_puncs (boolean) – A flag to indicate whether the punctuations must be removed from the strings.

  • n_jobs (int) – The number of parallel jobs to be used for computation (defaults to 1). If -1 all CPUs are used. If 0 or 1, no parallel computation is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus, for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) is less than 1, then no parallel computation is used (i.e., equivalent to the default).

Returns

Down sampled tables A and B as pandas DataFrames.

Raises
  • AssertionError – If any of the input tables (table_a, table_b) are empty or not a DataFrame.

  • AssertionError – If size or y_param is empty or 0 or not a valid integer value.

  • AssertionError – If seed is not a valid integer value.

  • AssertionError – If verbose is not of type bool.

  • AssertionError – If show_progress is not of type bool.

  • AssertionError – If n_jobs is not of type int.

Examples

>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> sample_A, sample_B = em.down_sample(A, B, 500, 1, n_jobs=-1)

# Example with seed = 0. This means the same sample data set will be returned # each time this function is run. >>> A = em.read_csv_metadata(‘path_to_csv_dir/table_A.csv’, key=’ID’) >>> B = em.read_csv_metadata(‘path_to_csv_dir/table_B.csv’, key=’ID’) >>> sample_A, sample_B = em.down_sample(A, B, 500, 1, seed=0, n_jobs=-1)