Sampling

py_entitymatching.sample_table(table, sample_size, replace=False, verbose=False)[source]

Samples a candidate set of tuple pairs (for labeling purposes).

This function samples a DataFrame, typically used for labeling purposes. This function expects the input DataFrame containing the metadata of a candidate set (such as key, fk_ltable, fk_rtable, ltable, rtable). Specifically, this function creates a copy of the input DataFrame, samples the data using uniform random sampling (uses ‘random’ function from numpy to sample) and returns the sampled DataFrame. Further, also copies the properties from the input DataFrame to the output DataFrame.

Parameters
  • table (DataFrame) – The input DataFrame to be sampled. Specifically, a DataFrame containing the metadata of a candidate set (such as key, fk_ltable, fk_rtable, ltable, rtable) in the catalog.

  • sample_size (int) – The number of samples to be picked from the input DataFrame.

  • replace (boolean) – A flag to indicate whether sampling should be done with replacement or not (defaults to False).

  • verbose (boolean) – A flag to indicate whether more detailed information about the execution steps should be printed out (defaults to False).

Returns

A new DataFrame with ‘sample_size’ number of rows.

Further, this function sets the output DataFrame’s properties same as input DataFrame.

Raises
  • AssertionError – If table is not of type pandas DataFrame.

  • AssertionError – If the size of table is 0.

  • AssertionError – If the sample_size is greater than the input DataFrame size.

Examples

>>> import py_entitymatching as em
>>> S = em.sample_table(C, sample_size=450) # C is the candidate set to be sampled from.

Note

As mentioned in the above description, the output DataFrame is updated (in the catalog) with the properties from the input DataFrame. A subtle point to note here is, when the replace flag is set to True, then the output DataFrame can contain duplicate keys. In that case, this function will not set the key and it is up to the user to fix it after the function returns.