Down SamplingΒΆ

Once the tables to be matched are read, they must be down sampled if the number of tuples in them are large (for example, 100K+ tuples). This is because working with large tables can be very time consuming (as any operation performed would have to process these large tables).

Random sampling however does not work, because the sampled may end up sharing very few matches, especially if the number of matches between the input tables are small to begin with.

In py_entitymatching, you can use sample the input tables using down_sample command. This command samples the input tables intelligently that ensures a reasonable number of matches between them.

If A and B are the input tables, then you can use down_sample command as shown below:

>>> sample_A, sample_B = em.down_sample(A, B, size=500, y_param=1)

Conceptually, the command takes in two original input tables, A, B (and some parameters), and produces two sampled tables, sample_A and sample_B. Specifically, you must set the size to be the number of tuples that should be sampled from B (this will be the size of sample_B table) and set the y_param to be the number of tuples to be selected from A (for each tuple in sample_B table). The command internally uses a heuristic to ensure a reasonable number of matches between sample_A and sample_B.

Please look at the API reference of down_sample() for more details.

Note

Currently, the input tables must be loaded in memory before the user can down sample.