active_matcher package

Subpackages

Submodules

active_matcher.active_learning module

active_matcher.algorithms module

active_matcher.algorithms.down_sample(fvs, percent, score_column='score', search_id_column='id2', bucket_size=25000)

down sample fvs by score_column, producing fvs.count() * percent rows

Parameters:
fvspyspark.sql.DataFrame

the feature vectors to be down sampled

percentfloat

the portion of the vectors to be output, (0.0, 1.0]

score_columnstr, pyspark.sql.Column

the column that will be used to score the vectors, should be positively correlated with the probability of the pair being a match

search_id_columnstr, pyspark.sql.Column

the column that will be used to hash the vectors into buckets, if fvs are the output of top-k blocking this should be the id of the serach record.

bucket_sizeint

the size of the buckets that the vectors will be hashed into for sampling

active_matcher.algorithms.select_seeds(fvs, nseeds, labeler, score_column='score')

down sample fvs by score_column, producing fvs.count() * percent rows

Parameters:
fvspyspark.sql.DataFrame

the feature vectors to be down sampled

score_columnstr, pyspark.sql.Column

the column that will be used to score the vectors, should be positively correlated with the probability of the pair being a match

nseedsint

the number of seeds to be selected

labelerLabeler

the labeler that will be used to label the seeds

active_matcher.example module

active_matcher.feature_selector module

class active_matcher.feature_selector.FeatureSelector(extra_features=False)

Bases: object

Methods

select_features(A, B[, null_threshold])

EXTRA_TOKENIZERS = [<active_matcher.tokenizer.tokenizer.AlphaNumericTokenizer object>, <active_matcher.tokenizer.tokenizer.QGramTokenizer object>, <active_matcher.tokenizer.tokenizer.StrippedQGramTokenizer object>, <active_matcher.tokenizer.tokenizer.StrippedQGramTokenizer object>]
EXTRA_TOKEN_FEATURES = []
TOKENIZERS = [<active_matcher.tokenizer.tokenizer.StrippedWhiteSpaceTokenizer object>, <active_matcher.tokenizer.tokenizer.NumericTokenizer object>, <active_matcher.tokenizer.tokenizer.QGramTokenizer object>]
TOKEN_FEATURES = [<class 'active_matcher.feature.vector_feature.TFIDFFeature'>, <class 'active_matcher.feature.token_feature.JaccardFeature'>, <class 'active_matcher.feature.vector_feature.SIFFeature'>, <class 'active_matcher.feature.token_feature.OverlapCoeffFeature'>, <class 'active_matcher.feature.token_feature.CosineFeature'>]
select_features(A, B, null_threshold=0.5)
Parameters:
Apyspark.sql.DataFrame

the raw data that will have FVS generated for it

Bpyspark.sql.DataFrame or None

the raw data that will have FVS generated for it

null_thresholdfloat

the portion of values that must be null in order for the column to be dropped and not considered for feature generation

active_matcher.fv_generator module

class active_matcher.fv_generator.BuildCache

Bases: object

Methods

add_or_get

clear

add_or_get(builder)
clear()
class active_matcher.fv_generator.FVGenerator(features, fill_na=None)

Bases: object

Attributes:
feature_names
features

Methods

build

generate_and_score_fvs

generate_fvs

release_resources

build(A, B=None)
property feature_names
property features
generate_and_score_fvs(pairs)
generate_fvs(pairs)
release_resources()

active_matcher.ml_model module

class active_matcher.ml_model.MLModel

Bases: ABC

Attributes:
nan_fill
use_floats
use_vectors

Methods

entropy

params_dict

predict

prediction_conf

prep_fvs

train

abstractmethod entropy(df, vector_col: str, output_col: str)
abstract property nan_fill
abstractmethod params_dict() dict
abstractmethod predict(df, vector_col: str, output_col: str)
abstractmethod prediction_conf(df, vector_col: str, label_column: str)
prep_fvs(fvs, feature_col='features')
abstractmethod train(df, vector_col: str, label_column: str)
abstract property use_floats
abstract property use_vectors
class active_matcher.ml_model.SKLearnModel(model, nan_fill=None, use_floats=True, **model_args)

Bases: MLModel

Attributes:
nan_fill
use_floats
use_vectors

Methods

cross_val_scores

entropy

get_model

params_dict

predict

prediction_conf

prep_fvs

train

cross_val_scores(df, vector_col: str, label_column: str)
entropy(df, vector_col: str, output_col: str)
get_model()
property nan_fill
params_dict()
predict(df, vector_col: str, output_col: str)
prediction_conf(df, vector_col: str, output_col: str)
train(df, vector_col: str, label_column: str)
property use_floats
property use_vectors
class active_matcher.ml_model.SparkMLModel(model, nan_fill=0.0, **model_args)

Bases: MLModel

Attributes:
nan_fill
use_floats
use_vectors

Methods

entropy

get_model

params_dict

predict

prediction_conf

prep_fvs

train

entropy(df, vector_col: str, output_col: str)
get_model()
property nan_fill
params_dict()
predict(df, vector_col: str, output_col: str)
prediction_conf(df, vector_col: str, output_col: str)
train(df, vector_col: str, label_column: str)
property use_floats
property use_vectors
active_matcher.ml_model.convert_to_array(df, col)
active_matcher.ml_model.convert_to_vector(df, col)

active_matcher.storage module

class active_matcher.storage.DistributableHashMap(arr)

Bases: object

Attributes:
on_spark

Methods

init

to_spark

init()
property on_spark
to_spark()
class active_matcher.storage.LongIntHashMap(arr)

Bases: DistributableHashMap

Attributes:
on_spark

Methods

build

init

to_spark

classmethod build(longs, ints, load_factor=0.75)
class active_matcher.storage.MemmapArray(arr)

Bases: object

Attributes:
shape
values

Methods

delete

init

to_spark

delete()
init()
property shape
to_spark()
property values
class active_matcher.storage.MemmapDataFrame

Bases: object

Methods

compress

decompress

delete

fetch

from_spark_df

init

to_spark

write_chunk

static compress(o)
static decompress(o)
delete()
fetch(ids)
classmethod from_spark_df(spark_df, pickle_column, stored_columns, id_col='_id')
init()
to_spark()
write_chunk(fd, _id, pic)
class active_matcher.storage.SqliteDataFrame

Bases: object

Methods

fetch

from_spark_df

init

to_spark

fetch(ids)
classmethod from_spark_df(spark_df, pickle_column, stored_columns, id_col='_id')
init()
to_spark()
class active_matcher.storage.SqliteDict

Bases: object

Methods

deinit

from_dict

init

to_spark

deinit()
classmethod from_dict(d)
init()
to_spark()
active_matcher.storage.hash_map_get_key(arr, key)
active_matcher.storage.hash_map_get_keys(arr, keys)
active_matcher.storage.hash_map_insert_key(arr, key, val)
active_matcher.storage.hash_map_insert_keys(arr, keys, vals)
active_matcher.storage.spark_to_pandas_stream(df, chunk_size)

active_matcher.utils module

class active_matcher.utils.PerfectHashFunction(seed=None)

Bases: object

Methods

create_for_keys

hash

classmethod create_for_keys(keys)
hash(s)
class active_matcher.utils.SparseVec(size, indexes, values)

Bases: object

Attributes:
indexes
values

Methods

dot

dot(other)
property indexes
property values
active_matcher.utils.get_logger(name, level=10)

Get the logger for a module

Returns:
Logger
active_matcher.utils.is_null(o)

check if the object is null, note that this is here to get rid of the weird behavior of np.isnan and pd.isnull

active_matcher.utils.is_persisted(df)

check if the pyspark dataframe is persist

active_matcher.utils.persisted(df, storage_level=StorageLevel(True, True, False, False, 1))

context manager for presisting a dataframe in a with statement. This automatically unpersists the dataframe at the end of the context

active_matcher.utils.repartition_df(df, part_size, by=None)

repartition the dataframe into chunk of size ‘part_size’ by column ‘by’

active_matcher.utils.type_check(var, var_name, expected)

type checking utility, throw a type error if the var isn’t the expected type

active_matcher.utils.type_check_iterable(var, var_name, expected_var_type, expected_element_type)

type checking utility for iterables, throw a type error if the var isn’t the expected type or any of the elements are not the expected type

Module contents