active_matcher package

Subpackages

Submodules

active_matcher.active_learning module

active_matcher.algorithms module

active_matcher.algorithms.down_sample(fvs, percent, score_column='score', search_id_column='id2', bucket_size=25000)

down sample fvs by score_column, producing fvs.count() * percent rows

Parameters:

fvspyspark.sql.DataFrame: the feature vectors to be down sampled
percentfloat: the portion of the vectors to be output, (0.0, 1.0]
score_columnstr, pyspark.sql.Column: the column that will be used to score the vectors, should be positively correlated with the probability of the pair being a match
search_id_columnstr, pyspark.sql.Column: the column that will be used to hash the vectors into buckets, if fvs are the output of top-k blocking this should be the id of the serach record.
bucket_sizeint: the size of the buckets that the vectors will be hashed into for sampling

active_matcher.algorithms.select_seeds(fvs, nseeds, labeler, score_column='score')

down sample fvs by score_column, producing fvs.count() * percent rows

Parameters:

fvspyspark.sql.DataFrame: the feature vectors to be down sampled
score_columnstr, pyspark.sql.Column: the column that will be used to score the vectors, should be positively correlated with the probability of the pair being a match
nseedsint: the number of seeds to be selected
labelerLabeler: the labeler that will be used to label the seeds

active_matcher.example module

active_matcher.feature_selector module

class active_matcher.feature_selector.FeatureSelector(extra_features=False)

Bases: object

Methods

select_features(A, B[, null_threshold])

EXTRA_TOKENIZERS = [<active_matcher.tokenizer.tokenizer.AlphaNumericTokenizer object>, <active_matcher.tokenizer.tokenizer.QGramTokenizer object>, <active_matcher.tokenizer.tokenizer.StrippedQGramTokenizer object>, <active_matcher.tokenizer.tokenizer.StrippedQGramTokenizer object>]

EXTRA_TOKEN_FEATURES = []

TOKENIZERS = [<active_matcher.tokenizer.tokenizer.StrippedWhiteSpaceTokenizer object>, <active_matcher.tokenizer.tokenizer.NumericTokenizer object>, <active_matcher.tokenizer.tokenizer.QGramTokenizer object>]

TOKEN_FEATURES = [<class 'active_matcher.feature.vector_feature.TFIDFFeature'>, <class 'active_matcher.feature.token_feature.JaccardFeature'>, <class 'active_matcher.feature.vector_feature.SIFFeature'>, <class 'active_matcher.feature.token_feature.OverlapCoeffFeature'>, <class 'active_matcher.feature.token_feature.CosineFeature'>]

select_features(A, B, null_threshold=0.5)

Parameters:

Apyspark.sql.DataFrame: the raw data that will have FVS generated for it
Bpyspark.sql.DataFrame or None: the raw data that will have FVS generated for it
null_thresholdfloat: the portion of values that must be null in order for the column to be dropped and not considered for feature generation

active_matcher.fv_generator module

class active_matcher.fv_generator.BuildCache

Bases: object

Methods

add_or_get
clear

add_or_get(builder)

clear()

class active_matcher.fv_generator.FVGenerator(features, fill_na=None)

Bases: object

Attributes:

feature_names
features

Methods

build
generate_and_score_fvs
generate_fvs
release_resources

build(A, B=None)

property feature_names

property features

generate_and_score_fvs(pairs)

generate_fvs(pairs)

release_resources()

active_matcher.ml_model module

class active_matcher.ml_model.MLModel

Bases: ABC

Attributes:

nan_fill
use_floats
use_vectors

Methods

entropy
params_dict
predict
prediction_conf
prep_fvs
train

abstractmethod entropy(df, vector_col: str, output_col: str)

abstract property nan_fill

abstractmethod params_dict() → dict

abstractmethod predict(df, vector_col: str, output_col: str)

abstractmethod prediction_conf(df, vector_col: str, label_column: str)

prep_fvs(fvs, feature_col='features')

abstractmethod train(df, vector_col: str, label_column: str)

abstract property use_floats

abstract property use_vectors

class active_matcher.ml_model.SKLearnModel(model, nan_fill=None, use_floats=True, **model_args)

Bases: MLModel

Attributes:

nan_fill
use_floats
use_vectors

Methods

cross_val_scores
entropy
get_model
params_dict
predict
prediction_conf
prep_fvs
train

cross_val_scores(df, vector_col: str, label_column: str)

entropy(df, vector_col: str, output_col: str)

get_model()

property nan_fill

params_dict()

predict(df, vector_col: str, output_col: str)

prediction_conf(df, vector_col: str, output_col: str)

train(df, vector_col: str, label_column: str)

property use_floats

property use_vectors

class active_matcher.ml_model.SparkMLModel(model, nan_fill=0.0, **model_args)

Bases: MLModel

Attributes:

nan_fill
use_floats
use_vectors

Methods

entropy
get_model
params_dict
predict
prediction_conf
prep_fvs
train

entropy(df, vector_col: str, output_col: str)

get_model()

property nan_fill

params_dict()

predict(df, vector_col: str, output_col: str)

prediction_conf(df, vector_col: str, output_col: str)

train(df, vector_col: str, label_column: str)

property use_floats

property use_vectors

active_matcher.ml_model.convert_to_array(df, col)

active_matcher.ml_model.convert_to_vector(df, col)

active_matcher.storage module

class active_matcher.storage.DistributableHashMap(arr)

Bases: object

Attributes:

on_spark

Methods

init
to_spark

init()

property on_spark

to_spark()

class active_matcher.storage.LongIntHashMap(arr)

Bases: DistributableHashMap

Attributes:

on_spark

Methods

build
init
to_spark

classmethod build(longs, ints, load_factor=0.75)

class active_matcher.storage.MemmapArray(arr)

Bases: object

Attributes:

shape
values

Methods

delete
init
to_spark

delete()

init()

property shape

to_spark()

property values

class active_matcher.storage.MemmapDataFrame

Bases: object

Methods

compress
decompress
delete
fetch
from_spark_df
init
to_spark
write_chunk

static compress(o)

static decompress(o)

delete()

fetch(ids)

classmethod from_spark_df(spark_df, pickle_column, stored_columns, id_col='_id')

init()

to_spark()

write_chunk(fd, _id, pic)

class active_matcher.storage.SqliteDataFrame

Bases: object

Methods

fetch
from_spark_df
init
to_spark

fetch(ids)

classmethod from_spark_df(spark_df, pickle_column, stored_columns, id_col='_id')

init()

to_spark()

class active_matcher.storage.SqliteDict

Bases: object

Methods

deinit
from_dict
init
to_spark

deinit()

classmethod from_dict(d)

init()

to_spark()

active_matcher.storage.hash_map_get_key(arr, key)

active_matcher.storage.hash_map_get_keys(arr, keys)

active_matcher.storage.hash_map_insert_key(arr, key, val)

active_matcher.storage.hash_map_insert_keys(arr, keys, vals)

active_matcher.storage.spark_to_pandas_stream(df, chunk_size)

active_matcher.utils module

class active_matcher.utils.PerfectHashFunction(seed=None)

Bases: object

Methods

create_for_keys
hash

classmethod create_for_keys(keys)

hash(s)

class active_matcher.utils.SparseVec(size, indexes, values)

Bases: object

Attributes:

indexes
values

Methods

dot

dot(other)

property indexes

property values

active_matcher.utils.get_logger(name, level=10)

Get the logger for a module

Returns:

Logger

active_matcher.utils.is_null(o): check if the object is null, note that this is here to get rid of the weird behavior of np.isnan and pd.isnull

active_matcher.utils.is_persisted(df): check if the pyspark dataframe is persist

active_matcher.utils.persisted(df, storage_level=StorageLevel(True, True, False, False, 1)): context manager for presisting a dataframe in a with statement. This automatically unpersists the dataframe at the end of the context

active_matcher.utils.repartition_df(df, part_size, by=None): repartition the dataframe into chunk of size ‘part_size’ by column ‘by’

active_matcher.utils.type_check(var, var_name, expected): type checking utility, throw a type error if the var isn’t the expected type

active_matcher.utils.type_check_iterable(var, var_name, expected_var_type, expected_element_type): type checking utility for iterables, throw a type error if the var isn’t the expected type or any of the elements are not the expected type

active_matcher package

Subpackages

Submodules

active_matcher.active_learning module

active_matcher.algorithms module

active_matcher.example module

active_matcher.feature_selector module

active_matcher.fv_generator module

active_matcher.ml_model module

active_matcher.storage module

active_matcher.utils module

Module contents