Debugging Matcher¶

py_entitymatching.vis_debug_dt(matcher, train, test, exclude_attrs, target_attr)[source]¶

Visual debugger for Decision Tree matcher.

Parameters

matcher (DTMatcher) – The Decision tree matcher that should be debugged.
train (DataFrame) – The pandas DataFrame that will be used to train the matcher.
test (DataFrame) – The pandas DataFrame that will be used to test the matcher.
exclude_attrs (list) – The list of attributes to be excluded from train and test, for training and testing.
target_attr (string) – The attribute name in the ‘train’ containing the true labels.

Examples

>>> import py_entitymatching as em
>>> dt = em.DTMatcher()
# 'devel' is the labeled set used for development (e.g., selecting the best matcher) purposes
>>> train_test = em.split_train_test(devel, 0.5)
>>> train, test = train_test['train'], train_test['test']
>>> em.vis_debug_dt(dt, train, test, exclude_attrs=['_id', 'ltable_id', 'rtable_id'], target_attr='gold_labels')

py_entitymatching.vis_debug_rf(matcher, train, test, exclude_attrs, target_attr)[source]¶

Visual debugger for Random Forest matcher.

Parameters

matcher (RFMatcher) – The Random Forest matcher that should be debugged.
train (DataFrame) – The pandas DataFrame that will be used to train the matcher.
test (DataFrame) – The pandas DataFrame that will be used to test the matcher.
exclude_attrs (list) – The list of attributes to be excluded from train and test, for training and testing.
target_attr (string) – The attribute name in the ‘train’ containing the true labels.

Examples

>>> import py_entitymatching as em
>>> rf = em.RFMatcher()
# 'devel' is the labeled set used for development (e.g., selecting the best matcher) purposes
>>> train_test = em.split_train_test(devel, 0.5)
>>> train, test = train_test['train'], train_test['test']
>>> em.vis_debug_rf(rf, train, test, exclude_attrs=['_id', 'ltable_id', 'rtable_id'], target_attr='gold_labels')

py_entitymatching.debug_decisiontree_matcher(decision_tree, tuple_1, tuple_2, feature_table, table_columns, exclude_attrs=None)[source]¶

This function is used to debug a decision tree matcher using two input tuples.

Specifically, this function takes in two tuples, gets the feature vector using the feature table and finally passes it to the decision tree and displays the path that the feature vector takes in the decision tree.

Parameters

decision_tree (DTMatcher) – The input decision tree object that should be debugged.
tuple_1 (Series) – Input tuples that should be debugged.
tuple_2 (Series) – Input tuples that should be debugged.
feature_table (DataFrame) – Feature table containing the functions for the features.
table_columns (list) – List of all columns that will be outputted after generation of feature vectors.
exclude_attrs (list) – List of attributes that should be removed from the table columns.

Raises

AssertionError – If the input feature table is not of type pandas DataFrame.

Examples

>>> import py_entitymatching as em
>>> # devel is the labeled data used for development purposes, match_f is the feature table
>>> H = em.extract_feat_vecs(devel, feat_table=match_f, attrs_after='gold_labels')
>>> dt = em.DTMatcher()
>>> dt.fit(table=H, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold_labels'], target_attr='gold_labels')
>>> # F is the feature vector got from evaluation set of the labeled data.
>>> out = dt.predict(table=F, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold_labels'], target_attr='gold_labels')
>>> # A and B are input tables
>>> em.debug_decisiontree_matcher(dt, A.loc[1], B.loc[2], match_f, H.columns, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold_labels'], target_attr='gold_labels')

py_entitymatching.debug_randomforest_matcher(random_forest, tuple_1, tuple_2, feature_table, table_columns, exclude_attrs=None)[source]¶

This function is used to debug a random forest matcher using two input tuples.

Specifically, this function takes in two tuples, gets the feature vector using the feature table and finally passes it to the random forest and displays the path that the feature vector takes in each of the decision trees that make up the random forest matcher.

Parameters

random_forest (RFMatcher) – The input random forest object that should be debugged.
tuple_1 (Series) – Input tuples that should be debugged.
tuple_2 (Series) – Input tuples that should be debugged.
feature_table (DataFrame) – Feature table containing the functions for the features.
table_columns (list) – List of all columns that will be outputted after generation of feature vectors.
exclude_attrs (list) – List of attributes that should be removed from the table columns.

Raises

AssertionError – If the input feature table is not of type pandas DataFrame.

Examples

>>> import py_entitymatching as em
>>> # devel is the labeled data used for development purposes, match_f is the feature table
>>> H = em.extract_feat_vecs(devel, feat_table=match_f, attrs_after='gold_labels')
>>> rf = em.RFMatcher()
>>> rf.fit(table=H, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold_labels'], target_attr='gold_labels')
>>> # F is the feature vector got from evaluation set of the labeled data.
>>> out = rf.predict(table=F, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold_labels'], target_attr='gold_labels')
>>> # A and B are input tables
>>> em.debug_randomforest_matcher(rf, A.loc[1], B.loc[2], match_f, H.columns, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'gold_labels'], target_attr='gold_labels')