Downsampling¶

py_entitymatching.down_sample(table_a, table_b, size, y_param, show_progress=True, verbose=False, seed=None)¶

This function down samples two tables A and B into smaller tables A’ and B’ respectively.

Specifically, first it randomly selects size tuples from the table B to be table B’. Next, it builds an inverted index I (token, tuple_id) on table A. For each tuple x ∈ B’, the algorithm finds a set P of k/2 tuples from I that match x, and a set Q of k/2 tuples randomly selected from A - P. The idea is for A’ and B’ to share some matches yet be as representative of A and B as possible.

Parameters:	table_a,table_b (DataFrame) – The input tables A and B. size (int) – The size that table B should be down sampled to. y_param (int) – The parameter to control the down sample size of table A. Specifically, the down sampled size of table A should be close to size * y_param. show_progress (boolean) – A flag to indicate whether a progress bar should be displayed (defaults to True). verbose (boolean) – A flag to indicate whether the debug information should be displayed (defaults to False). seed (int) – The seed for the pseudo random number generator to select the tuples from A and B (defaults to None).
Returns:	Down sampled tables A and B as pandas DataFrames.
Raises:	`AssertionError` – If any of the input tables (table_a, table_b) are empty or not a DataFrame. `AssertionError` – If size or y_param is empty or 0 or not a valid integer value. `AssertionError` – If seed is not a valid integer value.

Examples

>>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID')
>>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID')
>>> sample_A, sample_B = em.down_sample(A, B, 500, 1)

# Example with seed = 0. This means the same sample data set will be returned # each time this function is run. >>> A = em.read_csv_metadata(‘path_to_csv_dir/table_A.csv’, key=’ID’) >>> B = em.read_csv_metadata(‘path_to_csv_dir/table_B.csv’, key=’ID’) >>> sample_A, sample_B = em.down_sample(A, B, 500, 1, seed=0)

Table Of Contents

Search

Downsampling¶