active_matcher.feature package

Submodules

active_matcher.feature.feature module

class active_matcher.feature.feature.EditDistanceFeature(a_attr: str, b_attr: str)

Bases: Feature

edit distance between two strings, case insensitive

Attributes:
a_attr

the name of the attribute from table a used to generate this feature

b_attr

the name of the attribute from table a used to generate this feature

Methods

__call__(rec, recs)

compute the feature with A for each row in B, both A and B are preprocessed

build(A, B, cache)

Guarenteed to be called before the features preprocessing is done.

preprocess(data, is_table_a)

preprocess the data, adding the output column to data

preprocess_output_column(for_table_a)

get the name of the preprocessing output column for table A or B

template

class active_matcher.feature.feature.ExactMatchFeature(a_attr: str, b_attr: str)

Bases: Feature

Case insensitive exact string match

Attributes:
a_attr

the name of the attribute from table a used to generate this feature

b_attr

the name of the attribute from table a used to generate this feature

Methods

__call__(rec, recs)

compute the feature with A for each row in B, both A and B are preprocessed

build(A, B, cache)

Guarenteed to be called before the features preprocessing is done.

preprocess(data, is_table_a)

preprocess the data, adding the output column to data

preprocess_output_column(for_table_a)

get the name of the preprocessing output column for table A or B

template

class active_matcher.feature.feature.Feature(a_attr: str, b_attr: str)

Bases: ABC

Attributes:
a_attr

the name of the attribute from table a used to generate this feature

b_attr

the name of the attribute from table a used to generate this feature

Methods

__call__(A, B)

compute the feature with A for each row in B, both A and B are preprocessed

build(A, B, cache)

Guarenteed to be called before the features preprocessing is done.

preprocess(data, is_table_a)

preprocess the data, adding the output column to data

preprocess_output_column(for_table_a)

get the name of the preprocessing output column for table A or B

template

property a_attr

the name of the attribute from table a used to generate this feature

property b_attr

the name of the attribute from table a used to generate this feature

build(A, B, cache)

Guarenteed to be called before the features preprocessing is done. this method should generate and store all of the metadata required to compute the features over A and B, NOTE B may be None

preprocess(data, is_table_a)

preprocess the data, adding the output column to data

preprocess_output_column(for_table_a: bool)

get the name of the preprocessing output column for table A or B

classmethod template(**kwargs)
class active_matcher.feature.feature.NeedlemanWunschFeature(a_attr: str, b_attr: str)

Bases: Feature

needleman_wunch between two strings, case insensitive

Attributes:
a_attr

the name of the attribute from table a used to generate this feature

b_attr

the name of the attribute from table a used to generate this feature

Methods

__call__(rec, recs)

compute the feature with A for each row in B, both A and B are preprocessed

build(A, B, cache)

Guarenteed to be called before the features preprocessing is done.

preprocess(data, is_table_a)

preprocess the data, adding the output column to data

preprocess_output_column(for_table_a)

get the name of the preprocessing output column for table A or B

template

class active_matcher.feature.feature.RelDiffFeature(a_attr, b_attr)

Bases: Feature

relative difference between two values

Attributes:
a_attr

the name of the attribute from table a used to generate this feature

b_attr

the name of the attribute from table a used to generate this feature

Methods

__call__(rec, recs)

compute the feature with A for each row in B, both A and B are preprocessed

build(A, B, cache)

Guarenteed to be called before the features preprocessing is done.

preprocess(data, is_table_a)

preprocess the data, adding the output column to data

preprocess_output_column(for_table_a)

get the name of the preprocessing output column for table A or B

template

class active_matcher.feature.feature.SmithWatermanFeature(a_attr: str, b_attr: str)

Bases: Feature

smith waterman between two strings, case insensitive

Attributes:
a_attr

the name of the attribute from table a used to generate this feature

b_attr

the name of the attribute from table a used to generate this feature

Methods

__call__(rec, recs)

compute the feature with A for each row in B, both A and B are preprocessed

build(A, B, cache)

Guarenteed to be called before the features preprocessing is done.

preprocess(data, is_table_a)

preprocess the data, adding the output column to data

preprocess_output_column(for_table_a)

get the name of the preprocessing output column for table A or B

template

active_matcher.feature.token_feature module

class active_matcher.feature.token_feature.CosineFeature(a_attr, b_attr, tokenizer)

Bases: TokenFeature

Attributes:
a_attr

the name of the attribute from table a used to generate this feature

b_attr

the name of the attribute from table a used to generate this feature

Methods

__call__(rec, recs)

compute the feature with A for each row in B, both A and B are preprocessed

build(A, B, cache)

Guarenteed to be called before the features preprocessing is done.

preprocess(data, is_table_a)

preprocess the data, adding the output column to data

preprocess_output_column(for_table_a)

get the name of the preprocessing output column for table A or B

sim_func(x, y)

function that takes in two sets of tokens and outputs a float

template

sim_func(x, y)

function that takes in two sets of tokens and outputs a float

class active_matcher.feature.token_feature.JaccardFeature(a_attr, b_attr, tokenizer)

Bases: TokenFeature

Attributes:
a_attr

the name of the attribute from table a used to generate this feature

b_attr

the name of the attribute from table a used to generate this feature

Methods

__call__(rec, recs)

compute the feature with A for each row in B, both A and B are preprocessed

build(A, B, cache)

Guarenteed to be called before the features preprocessing is done.

preprocess(data, is_table_a)

preprocess the data, adding the output column to data

preprocess_output_column(for_table_a)

get the name of the preprocessing output column for table A or B

sim_func(x, y)

function that takes in two sets of tokens and outputs a float

template

sim_func(x, y)

function that takes in two sets of tokens and outputs a float

class active_matcher.feature.token_feature.MongeElkanFeature(a_attr, b_attr, tokenizer)

Bases: TokenFeature

MongeElkan with jaro winkler as the inner sim func

Attributes:
a_attr

the name of the attribute from table a used to generate this feature

b_attr

the name of the attribute from table a used to generate this feature

Methods

__call__(rec, recs)

compute the feature with A for each row in B, both A and B are preprocessed

build(A, B, cache)

Guarenteed to be called before the features preprocessing is done.

preprocess(data, is_table_a)

preprocess the data, adding the output column to data

preprocess_output_column(for_table_a)

get the name of the preprocessing output column for table A or B

sim_func(x, y)

function that takes in two sets of tokens and outputs a float

template

sim_func(x, y)

function that takes in two sets of tokens and outputs a float

class active_matcher.feature.token_feature.OverlapCoeffFeature(a_attr, b_attr, tokenizer)

Bases: TokenFeature

Attributes:
a_attr

the name of the attribute from table a used to generate this feature

b_attr

the name of the attribute from table a used to generate this feature

Methods

__call__(rec, recs)

compute the feature with A for each row in B, both A and B are preprocessed

build(A, B, cache)

Guarenteed to be called before the features preprocessing is done.

preprocess(data, is_table_a)

preprocess the data, adding the output column to data

preprocess_output_column(for_table_a)

get the name of the preprocessing output column for table A or B

sim_func(x, y)

function that takes in two sets of tokens and outputs a float

template

sim_func(x, y)

function that takes in two sets of tokens and outputs a float

class active_matcher.feature.token_feature.TokenFeature(a_attr, b_attr, tokenizer)

Bases: Feature

Attributes:
a_attr

the name of the attribute from table a used to generate this feature

b_attr

the name of the attribute from table a used to generate this feature

Methods

__call__(rec, recs)

compute the feature with A for each row in B, both A and B are preprocessed

build(A, B, cache)

Guarenteed to be called before the features preprocessing is done.

preprocess(data, is_table_a)

preprocess the data, adding the output column to data

preprocess_output_column(for_table_a)

get the name of the preprocessing output column for table A or B

sim_func(x, y)

function that takes in two sets of tokens and outputs a float

template

abstractmethod sim_func(x, y)

function that takes in two sets of tokens and outputs a float

active_matcher.feature.vector_feature module

class active_matcher.feature.vector_feature.DocFreqBuilder(a_attr, b_attr, tokenizer)

Bases: object

Methods

build

build(A, B)
class active_matcher.feature.vector_feature.SIFFeature(a_attr, b_attr, tokenizer)

Bases: TFIDFFeature

Attributes:
a_attr

the name of the attribute from table a used to generate this feature

b_attr

the name of the attribute from table a used to generate this feature

Methods

__call__(rec, recs)

compute the feature with A for each row in B, both A and B are preprocessed

build(A, B, cache)

Guarenteed to be called before the features preprocessing is done.

preprocess(data, is_table_a)

preprocess the data, adding the output column to data

preprocess_output_column(for_table_a)

get the name of the preprocessing output column for table A or B

sim_func(x, y)

function that takes in two sets of tokens and outputs a float

template

class active_matcher.feature.vector_feature.TFIDFFeature(a_attr, b_attr, tokenizer)

Bases: TokenFeature

Attributes:
a_attr

the name of the attribute from table a used to generate this feature

b_attr

the name of the attribute from table a used to generate this feature

Methods

__call__(rec, recs)

compute the feature with A for each row in B, both A and B are preprocessed

build(A, B, cache)

Guarenteed to be called before the features preprocessing is done.

preprocess(data, is_table_a)

preprocess the data, adding the output column to data

preprocess_output_column(for_table_a)

get the name of the preprocessing output column for table A or B

sim_func(x, y)

function that takes in two sets of tokens and outputs a float

template

build(A, B, cache)

Guarenteed to be called before the features preprocessing is done. this method should generate and store all of the metadata required to compute the features over A and B, NOTE B may be None

sim_func(x, y)

function that takes in two sets of tokens and outputs a float

Module contents