Downsampling

py_entitymatching.down_sample(table_a, table_b, size, y_param, show_progress=True, verbose=False, seed=None)

This function down samples two tables A and B into smaller tables A’ and B’ respectively.

Specifically, first it randomly selects size tuples from the table B to be table B’. Next, it builds an inverted index I (token, tuple_id) on table A. For each tuple x ∈ B’, the algorithm finds a set P of k/2 tuples from I that match x, and a set Q of k/2 tuples randomly selected from A - P. The idea is for A’ and B’ to share some matches yet be as representative of A and B as possible.

Parameters:
  • table_a,table_b (DataFrame) – The input tables A and B.
  • size (int) – The size that table B should be down sampled to.
  • y_param (int) – The parameter to control the down sample size of table A. Specifically, the down sampled size of table A should be close to size * y_param.
  • show_progress (boolean) – A flag to indicate whether a progress bar should be displayed (defaults to True).
  • verbose (boolean) – A flag to indicate whether the debug information should be displayed (defaults to False).
  • seed (int) – The seed for the pseudo random number generator to select the tuples from A and B (defaults to None).
Returns:

Down sampled tables A and B as pandas DataFrames.

Raises:
  • AssertionError – If any of the input tables (table_a, table_b) are empty or not a DataFrame.
  • AssertionError – If size or y_param is empty or 0 or not a valid integer value.
  • AssertionError – If seed is not a valid integer value.

Examples

>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> sample_A, sample_B = em.down_sample(A, B, 500, 1)

# Example with seed = 0. This means the same sample data set will be returned # each time this function is run. >>> A = em.read_csv_metadata(‘path_to_csv_dir/table_A.csv’, key=’ID’) >>> B = em.read_csv_metadata(‘path_to_csv_dir/table_B.csv’, key=’ID’) >>> sample_A, sample_B = em.down_sample(A, B, 500, 1, seed=0)

Scroll To Top