active_matcher package
Subpackages
- active_matcher.feature package
- active_matcher.tokenizer package
Submodules
active_matcher.active_learning module
active_matcher.algorithms module
- active_matcher.algorithms.down_sample(fvs, percent, score_column='score', search_id_column='id2', bucket_size=25000)
down sample fvs by score_column, producing fvs.count() * percent rows
- Parameters:
- fvspyspark.sql.DataFrame
the feature vectors to be down sampled
- percentfloat
the portion of the vectors to be output, (0.0, 1.0]
- score_columnstr, pyspark.sql.Column
the column that will be used to score the vectors, should be positively correlated with the probability of the pair being a match
- search_id_columnstr, pyspark.sql.Column
the column that will be used to hash the vectors into buckets, if fvs are the output of top-k blocking this should be the id of the serach record.
- bucket_sizeint
the size of the buckets that the vectors will be hashed into for sampling
- active_matcher.algorithms.select_seeds(fvs, nseeds, labeler, score_column='score')
down sample fvs by score_column, producing fvs.count() * percent rows
- Parameters:
- fvspyspark.sql.DataFrame
the feature vectors to be down sampled
- score_columnstr, pyspark.sql.Column
the column that will be used to score the vectors, should be positively correlated with the probability of the pair being a match
- nseedsint
the number of seeds to be selected
- labelerLabeler
the labeler that will be used to label the seeds
active_matcher.example module
active_matcher.feature_selector module
- class active_matcher.feature_selector.FeatureSelector(extra_features=False)
Bases:
object
Methods
select_features
(A, B[, null_threshold])- EXTRA_TOKENIZERS = [<active_matcher.tokenizer.tokenizer.AlphaNumericTokenizer object>, <active_matcher.tokenizer.tokenizer.QGramTokenizer object>, <active_matcher.tokenizer.tokenizer.StrippedQGramTokenizer object>, <active_matcher.tokenizer.tokenizer.StrippedQGramTokenizer object>]
- EXTRA_TOKEN_FEATURES = []
- TOKENIZERS = [<active_matcher.tokenizer.tokenizer.StrippedWhiteSpaceTokenizer object>, <active_matcher.tokenizer.tokenizer.NumericTokenizer object>, <active_matcher.tokenizer.tokenizer.QGramTokenizer object>]
- TOKEN_FEATURES = [<class 'active_matcher.feature.vector_feature.TFIDFFeature'>, <class 'active_matcher.feature.token_feature.JaccardFeature'>, <class 'active_matcher.feature.vector_feature.SIFFeature'>, <class 'active_matcher.feature.token_feature.OverlapCoeffFeature'>, <class 'active_matcher.feature.token_feature.CosineFeature'>]
- select_features(A, B, null_threshold=0.5)
- Parameters:
- Apyspark.sql.DataFrame
the raw data that will have FVS generated for it
- Bpyspark.sql.DataFrame or None
the raw data that will have FVS generated for it
- null_thresholdfloat
the portion of values that must be null in order for the column to be dropped and not considered for feature generation
active_matcher.fv_generator module
- class active_matcher.fv_generator.BuildCache
Bases:
object
Methods
add_or_get
clear
- add_or_get(builder)
- clear()
- class active_matcher.fv_generator.FVGenerator(features, fill_na=None)
Bases:
object
- Attributes:
- feature_names
- features
Methods
build
generate_and_score_fvs
generate_fvs
release_resources
- build(A, B=None)
- property feature_names
- property features
- generate_and_score_fvs(pairs)
- generate_fvs(pairs)
- release_resources()
active_matcher.ml_model module
- class active_matcher.ml_model.MLModel
Bases:
ABC
- Attributes:
- nan_fill
- use_floats
- use_vectors
Methods
entropy
params_dict
predict
prediction_conf
prep_fvs
train
- abstractmethod entropy(df, vector_col: str, output_col: str)
- abstract property nan_fill
- abstractmethod params_dict() dict
- abstractmethod predict(df, vector_col: str, output_col: str)
- abstractmethod prediction_conf(df, vector_col: str, label_column: str)
- prep_fvs(fvs, feature_col='features')
- abstractmethod train(df, vector_col: str, label_column: str)
- abstract property use_floats
- abstract property use_vectors
- class active_matcher.ml_model.SKLearnModel(model, nan_fill=None, use_floats=True, **model_args)
Bases:
MLModel
- Attributes:
- nan_fill
- use_floats
- use_vectors
Methods
cross_val_scores
entropy
get_model
params_dict
predict
prediction_conf
prep_fvs
train
- cross_val_scores(df, vector_col: str, label_column: str)
- entropy(df, vector_col: str, output_col: str)
- get_model()
- property nan_fill
- params_dict()
- predict(df, vector_col: str, output_col: str)
- prediction_conf(df, vector_col: str, output_col: str)
- train(df, vector_col: str, label_column: str)
- property use_floats
- property use_vectors
- class active_matcher.ml_model.SparkMLModel(model, nan_fill=0.0, **model_args)
Bases:
MLModel
- Attributes:
- nan_fill
- use_floats
- use_vectors
Methods
entropy
get_model
params_dict
predict
prediction_conf
prep_fvs
train
- entropy(df, vector_col: str, output_col: str)
- get_model()
- property nan_fill
- params_dict()
- predict(df, vector_col: str, output_col: str)
- prediction_conf(df, vector_col: str, output_col: str)
- train(df, vector_col: str, label_column: str)
- property use_floats
- property use_vectors
- active_matcher.ml_model.convert_to_array(df, col)
- active_matcher.ml_model.convert_to_vector(df, col)
active_matcher.storage module
- class active_matcher.storage.DistributableHashMap(arr)
Bases:
object
- Attributes:
- on_spark
Methods
init
to_spark
- init()
- property on_spark
- to_spark()
- class active_matcher.storage.LongIntHashMap(arr)
Bases:
DistributableHashMap
- Attributes:
- on_spark
Methods
build
init
to_spark
- classmethod build(longs, ints, load_factor=0.75)
- class active_matcher.storage.MemmapArray(arr)
Bases:
object
- Attributes:
- shape
- values
Methods
delete
init
to_spark
- delete()
- init()
- property shape
- to_spark()
- property values
- class active_matcher.storage.MemmapDataFrame
Bases:
object
Methods
compress
decompress
delete
fetch
from_spark_df
init
to_spark
write_chunk
- static compress(o)
- static decompress(o)
- delete()
- fetch(ids)
- classmethod from_spark_df(spark_df, pickle_column, stored_columns, id_col='_id')
- init()
- to_spark()
- write_chunk(fd, _id, pic)
- class active_matcher.storage.SqliteDataFrame
Bases:
object
Methods
fetch
from_spark_df
init
to_spark
- fetch(ids)
- classmethod from_spark_df(spark_df, pickle_column, stored_columns, id_col='_id')
- init()
- to_spark()
- class active_matcher.storage.SqliteDict
Bases:
object
Methods
deinit
from_dict
init
to_spark
- deinit()
- classmethod from_dict(d)
- init()
- to_spark()
- active_matcher.storage.hash_map_get_key(arr, key)
- active_matcher.storage.hash_map_get_keys(arr, keys)
- active_matcher.storage.hash_map_insert_key(arr, key, val)
- active_matcher.storage.hash_map_insert_keys(arr, keys, vals)
- active_matcher.storage.spark_to_pandas_stream(df, chunk_size)
active_matcher.utils module
- class active_matcher.utils.PerfectHashFunction(seed=None)
Bases:
object
Methods
create_for_keys
hash
- classmethod create_for_keys(keys)
- hash(s)
- class active_matcher.utils.SparseVec(size, indexes, values)
Bases:
object
- Attributes:
- indexes
- values
Methods
dot
- dot(other)
- property indexes
- property values
- active_matcher.utils.get_logger(name, level=10)
Get the logger for a module
- Returns:
- Logger
- active_matcher.utils.is_null(o)
check if the object is null, note that this is here to get rid of the weird behavior of np.isnan and pd.isnull
- active_matcher.utils.is_persisted(df)
check if the pyspark dataframe is persist
- active_matcher.utils.persisted(df, storage_level=StorageLevel(True, True, False, False, 1))
context manager for presisting a dataframe in a with statement. This automatically unpersists the dataframe at the end of the context
- active_matcher.utils.repartition_df(df, part_size, by=None)
repartition the dataframe into chunk of size ‘part_size’ by column ‘by’
- active_matcher.utils.type_check(var, var_name, expected)
type checking utility, throw a type error if the var isn’t the expected type
- active_matcher.utils.type_check_iterable(var, var_name, expected_var_type, expected_element_type)
type checking utility for iterables, throw a type error if the var isn’t the expected type or any of the elements are not the expected type