Splitting Data into Train and Test¶
-
py_entitymatching.
split_train_test
(labeled_data, train_proportion=0.5, random_state=None, verbose=True)[source]¶ This function splits the input data into train and test.
Specifically, this function is just a wrapper of scikit-learn’s train_test_split function.
This function also takes care of copying the metadata from the input table to train and test splits.
- Parameters
labeled_data (DataFrame) – The input pandas DataFrame that needs to be split into train and test.
train_proportion (float) – A number between 0 and 1, indicating the proportion of tuples that should be included in the train split ( defaults to 0.5).
random_state (object) – A number of random number object (as in scikit-learn).
verbose (boolean) – A flag to indicate whether the debug information should be displayed.
- Returns
A Python dictionary containing two keys - train and test.
The value for the key ‘train’ is a pandas DataFrame containing tuples allocated from the input table based on train_proportion.
Similarly, the value for the key ‘test’ is a pandas DataFrame containing tuples for evaluation.
This function sets the output DataFrames (train, test) properties same as the input DataFrame.
Examples
>>> import py_entitymatching as em >>> # G is the labeled data or the feature vectors that should be split >>> train_test = em.split_train_test(G, train_proportion=0.5) >>> train, test = train_test['train'], train_test['test']