=============
Down Sampling
=============
Once the tables to be matched are read, they must be down sampled if the number of
tuples in them are large (for example, 100K+ tuples). This is because working with
large tables can be very time consuming (as any operation performed would have
to process these large tables).

Random sampling however does not work, because the sampled may end up sharing very
few matches, especially if the number of matches between the
input tables are small to begin with.

In py_entitymatching, you can use sample the input tables using `down_sample` command.
This command samples the input tables intelligently that ensures a reasonable number of
matches between them.

If `A` and `B` are the input tables, then you can use `down_sample` command as shown
below:

>>> sample_A, sample_B = em.down_sample(A, B, size=500, y_param=1)

Conceptually, the command takes in two original input tables, `A`, `B` (and some parameters),
and produces two sampled tables, `sample_A` and `sample_B`.
Specifically, you must set the `size` to be the number of tuples that
should be sampled from `B` (this will be the size of `sample_B` table) and set the
`y_param` to be the number of tuples to be selected from `A` (for each tuple in
`sample_B` table). The command internally uses a
heuristic to ensure a reasonable number of matches between `sample_A` and `sample_B`.

Please look at the API reference of :py:meth:`~py_entitymatching.down_sample` for more
details.

.. note:: Currently, the input tables must be loaded in memory before the user can down
 sample.