active_matcher.feature package
Submodules
active_matcher.feature.feature module
- class active_matcher.feature.feature.EditDistanceFeature(a_attr: str, b_attr: str)
Bases:
Feature
edit distance between two strings, case insensitive
- Attributes:
a_attr
the name of the attribute from table a used to generate this feature
b_attr
the name of the attribute from table a used to generate this feature
Methods
__call__
(rec, recs)compute the feature with A for each row in B, both A and B are preprocessed
build
(A, B, cache)Guarenteed to be called before the features preprocessing is done.
preprocess
(data, is_table_a)preprocess the data, adding the output column to data
preprocess_output_column
(for_table_a)get the name of the preprocessing output column for table A or B
template
- class active_matcher.feature.feature.ExactMatchFeature(a_attr: str, b_attr: str)
Bases:
Feature
Case insensitive exact string match
- Attributes:
a_attr
the name of the attribute from table a used to generate this feature
b_attr
the name of the attribute from table a used to generate this feature
Methods
__call__
(rec, recs)compute the feature with A for each row in B, both A and B are preprocessed
build
(A, B, cache)Guarenteed to be called before the features preprocessing is done.
preprocess
(data, is_table_a)preprocess the data, adding the output column to data
preprocess_output_column
(for_table_a)get the name of the preprocessing output column for table A or B
template
- class active_matcher.feature.feature.Feature(a_attr: str, b_attr: str)
Bases:
ABC
- Attributes:
Methods
__call__
(A, B)compute the feature with A for each row in B, both A and B are preprocessed
build
(A, B, cache)Guarenteed to be called before the features preprocessing is done.
preprocess
(data, is_table_a)preprocess the data, adding the output column to data
preprocess_output_column
(for_table_a)get the name of the preprocessing output column for table A or B
template
- property a_attr
the name of the attribute from table a used to generate this feature
- property b_attr
the name of the attribute from table a used to generate this feature
- build(A, B, cache)
Guarenteed to be called before the features preprocessing is done. this method should generate and store all of the metadata required to compute the features over A and B, NOTE B may be None
- preprocess(data, is_table_a)
preprocess the data, adding the output column to data
- preprocess_output_column(for_table_a: bool)
get the name of the preprocessing output column for table A or B
- classmethod template(**kwargs)
- class active_matcher.feature.feature.NeedlemanWunschFeature(a_attr: str, b_attr: str)
Bases:
Feature
needleman_wunch between two strings, case insensitive
- Attributes:
a_attr
the name of the attribute from table a used to generate this feature
b_attr
the name of the attribute from table a used to generate this feature
Methods
__call__
(rec, recs)compute the feature with A for each row in B, both A and B are preprocessed
build
(A, B, cache)Guarenteed to be called before the features preprocessing is done.
preprocess
(data, is_table_a)preprocess the data, adding the output column to data
preprocess_output_column
(for_table_a)get the name of the preprocessing output column for table A or B
template
- class active_matcher.feature.feature.RelDiffFeature(a_attr, b_attr)
Bases:
Feature
relative difference between two values
- Attributes:
a_attr
the name of the attribute from table a used to generate this feature
b_attr
the name of the attribute from table a used to generate this feature
Methods
__call__
(rec, recs)compute the feature with A for each row in B, both A and B are preprocessed
build
(A, B, cache)Guarenteed to be called before the features preprocessing is done.
preprocess
(data, is_table_a)preprocess the data, adding the output column to data
preprocess_output_column
(for_table_a)get the name of the preprocessing output column for table A or B
template
- class active_matcher.feature.feature.SmithWatermanFeature(a_attr: str, b_attr: str)
Bases:
Feature
smith waterman between two strings, case insensitive
- Attributes:
a_attr
the name of the attribute from table a used to generate this feature
b_attr
the name of the attribute from table a used to generate this feature
Methods
__call__
(rec, recs)compute the feature with A for each row in B, both A and B are preprocessed
build
(A, B, cache)Guarenteed to be called before the features preprocessing is done.
preprocess
(data, is_table_a)preprocess the data, adding the output column to data
preprocess_output_column
(for_table_a)get the name of the preprocessing output column for table A or B
template
active_matcher.feature.token_feature module
- class active_matcher.feature.token_feature.CosineFeature(a_attr, b_attr, tokenizer)
Bases:
TokenFeature
- Attributes:
a_attr
the name of the attribute from table a used to generate this feature
b_attr
the name of the attribute from table a used to generate this feature
Methods
__call__
(rec, recs)compute the feature with A for each row in B, both A and B are preprocessed
build
(A, B, cache)Guarenteed to be called before the features preprocessing is done.
preprocess
(data, is_table_a)preprocess the data, adding the output column to data
preprocess_output_column
(for_table_a)get the name of the preprocessing output column for table A or B
sim_func
(x, y)function that takes in two sets of tokens and outputs a float
template
- sim_func(x, y)
function that takes in two sets of tokens and outputs a float
- class active_matcher.feature.token_feature.JaccardFeature(a_attr, b_attr, tokenizer)
Bases:
TokenFeature
- Attributes:
a_attr
the name of the attribute from table a used to generate this feature
b_attr
the name of the attribute from table a used to generate this feature
Methods
__call__
(rec, recs)compute the feature with A for each row in B, both A and B are preprocessed
build
(A, B, cache)Guarenteed to be called before the features preprocessing is done.
preprocess
(data, is_table_a)preprocess the data, adding the output column to data
preprocess_output_column
(for_table_a)get the name of the preprocessing output column for table A or B
sim_func
(x, y)function that takes in two sets of tokens and outputs a float
template
- sim_func(x, y)
function that takes in two sets of tokens and outputs a float
- class active_matcher.feature.token_feature.MongeElkanFeature(a_attr, b_attr, tokenizer)
Bases:
TokenFeature
MongeElkan with jaro winkler as the inner sim func
- Attributes:
a_attr
the name of the attribute from table a used to generate this feature
b_attr
the name of the attribute from table a used to generate this feature
Methods
__call__
(rec, recs)compute the feature with A for each row in B, both A and B are preprocessed
build
(A, B, cache)Guarenteed to be called before the features preprocessing is done.
preprocess
(data, is_table_a)preprocess the data, adding the output column to data
preprocess_output_column
(for_table_a)get the name of the preprocessing output column for table A or B
sim_func
(x, y)function that takes in two sets of tokens and outputs a float
template
- sim_func(x, y)
function that takes in two sets of tokens and outputs a float
- class active_matcher.feature.token_feature.OverlapCoeffFeature(a_attr, b_attr, tokenizer)
Bases:
TokenFeature
- Attributes:
a_attr
the name of the attribute from table a used to generate this feature
b_attr
the name of the attribute from table a used to generate this feature
Methods
__call__
(rec, recs)compute the feature with A for each row in B, both A and B are preprocessed
build
(A, B, cache)Guarenteed to be called before the features preprocessing is done.
preprocess
(data, is_table_a)preprocess the data, adding the output column to data
preprocess_output_column
(for_table_a)get the name of the preprocessing output column for table A or B
sim_func
(x, y)function that takes in two sets of tokens and outputs a float
template
- sim_func(x, y)
function that takes in two sets of tokens and outputs a float
- class active_matcher.feature.token_feature.TokenFeature(a_attr, b_attr, tokenizer)
Bases:
Feature
- Attributes:
a_attr
the name of the attribute from table a used to generate this feature
b_attr
the name of the attribute from table a used to generate this feature
Methods
__call__
(rec, recs)compute the feature with A for each row in B, both A and B are preprocessed
build
(A, B, cache)Guarenteed to be called before the features preprocessing is done.
preprocess
(data, is_table_a)preprocess the data, adding the output column to data
preprocess_output_column
(for_table_a)get the name of the preprocessing output column for table A or B
sim_func
(x, y)function that takes in two sets of tokens and outputs a float
template
- abstractmethod sim_func(x, y)
function that takes in two sets of tokens and outputs a float
active_matcher.feature.vector_feature module
- class active_matcher.feature.vector_feature.DocFreqBuilder(a_attr, b_attr, tokenizer)
Bases:
object
Methods
build
- build(A, B)
- class active_matcher.feature.vector_feature.SIFFeature(a_attr, b_attr, tokenizer)
Bases:
TFIDFFeature
- Attributes:
a_attr
the name of the attribute from table a used to generate this feature
b_attr
the name of the attribute from table a used to generate this feature
Methods
__call__
(rec, recs)compute the feature with A for each row in B, both A and B are preprocessed
build
(A, B, cache)Guarenteed to be called before the features preprocessing is done.
preprocess
(data, is_table_a)preprocess the data, adding the output column to data
preprocess_output_column
(for_table_a)get the name of the preprocessing output column for table A or B
sim_func
(x, y)function that takes in two sets of tokens and outputs a float
template
- class active_matcher.feature.vector_feature.TFIDFFeature(a_attr, b_attr, tokenizer)
Bases:
TokenFeature
- Attributes:
a_attr
the name of the attribute from table a used to generate this feature
b_attr
the name of the attribute from table a used to generate this feature
Methods
__call__
(rec, recs)compute the feature with A for each row in B, both A and B are preprocessed
build
(A, B, cache)Guarenteed to be called before the features preprocessing is done.
preprocess
(data, is_table_a)preprocess the data, adding the output column to data
preprocess_output_column
(for_table_a)get the name of the preprocessing output column for table A or B
sim_func
(x, y)function that takes in two sets of tokens and outputs a float
template
- build(A, B, cache)
Guarenteed to be called before the features preprocessing is done. this method should generate and store all of the metadata required to compute the features over A and B, NOTE B may be None
- sim_func(x, y)
function that takes in two sets of tokens and outputs a float