Splitting Labeled Data into Training and Testing Sets¶

While doing entity matching you will have to split data for multiple purposes. Some examples are:

1. Split labeled data into development and test. Th development set is used to come up with right features for learning-based matcher, and test set is used to evaluate the matcher.

2. Split feature vectors into a train and test set. The train set is used to train the learning-based matcher and test set is used for evaluation.

py_entitymatching provides split_train_test command for the above need. An example of using split_train_test is shown below:

>>> train_test = em.split_train_test(G, train_proportion=0.5)

In the above, split_train_test returns a dictionary with two keys: train, and test. The value for the key train is a Dataframe containing tuples allocated from the input table based on train_proportion. Similarly, the value for the key test is a Dataframe containing tuples for evaluation. An example of getting train and test Dataframes from the output of split_train_test command is shown below:

>>> devel_set = train_test['train']
>>> eval_set = train_test['test']

Setting the value for train proportion would depend on the context of its use. For instance, if the data is split for machine learning purposes then train proportion is typically larger than the test. The most commonly used values of train_proportion are between 0.5 and 0.8.

Please refer to the API reference of split_train_test() for more details.

Table Of Contents

Search

Splitting Labeled Data into Training and Testing Sets¶