Splitting Data into Train and Test

py_entitymatching.split_train_test(labeled_data, train_proportion=0.5, random_state=None, verbose=True)[source]

This function splits the input data into train and test.

Specifically, this function is just a wrapper of scikit-learn’s train_test_split function.

This function also takes care of copying the metadata from the input table to train and test splits.

Parameters
  • labeled_data (DataFrame) – The input pandas DataFrame that needs to be split into train and test.

  • train_proportion (float) – A number between 0 and 1, indicating the proportion of tuples that should be included in the train split ( defaults to 0.5).

  • random_state (object) – A number of random number object (as in scikit-learn).

  • verbose (boolean) – A flag to indicate whether the debug information should be displayed.

Returns

A Python dictionary containing two keys - train and test.

The value for the key ‘train’ is a pandas DataFrame containing tuples allocated from the input table based on train_proportion.

Similarly, the value for the key ‘test’ is a pandas DataFrame containing tuples for evaluation.

This function sets the output DataFrames (train, test) properties same as the input DataFrame.

Examples

>>> import py_entitymatching as em
>>> # G is the labeled data or the feature vectors that should be split
>>> train_test = em.split_train_test(G, train_proportion=0.5)
>>> train, test = train_test['train'], train_test['test']