Imputing Missing Values

py_entitymatching.impute_table(table, exclude_attrs=None, missing_val='NaN', strategy='mean', fill_value=None, val_all_nans=0, verbose=True)[source]

Impute table containing missing values.

Parameters
  • table (DataFrame) – DataFrame which values should be imputed.

  • exclude_attrs (List) – list of attribute names to be excluded from imputing (defaults to None).

  • missing_val (string or int) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For missing values encoded as np.nan, use the string value ‘NaN’ (defaults to ‘NaN’).

  • strategy (string) – String that specifies on how to impute values. Valid strings: ‘mean’, ‘median’, ‘most_frequent’ (defaults to ‘mean’).

  • fill_value (any) – When strategy == “constant”, fill_value is used to replace all occurrences of missing values.

  • val_all_nans (float) – Value to fill in if all the values in the column are NaN.

Returns

Imputed DataFrame.

Raises

AssertionError – If table is not of type pandas DataFrame.

Examples

>>> import py_entitymatching as em
>>> # H is the feature vector which should be imputed. Specifically, impute the missing values
>>> # in each column, with the mean of that column
>>> H = em.impute_table(H, exclude_attrs=['_id', 'ltable_id', 'rtable_id'], strategy='mean')