sparkly.index_optimizer package

Submodules

sparkly.index_optimizer.index_optimizer module

class sparkly.index_optimizer.index_optimizer.IndexOptimizer(is_dedupe: bool, scorer: QueryScorer | None = None, conf: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Lt(lt=1.0)])] = 0.99, init_top_k: int = 10, max_combination_size: int = 3, opt_query_limit: int = 250, sample_size: int = 10000, use_early_pruning: bool = True)

Bases: object

a class for optimizing the search columns and analyzers for indexes

Attributes:

index

Methods

`make_index_config`(df[, id_col])	create the starting index config which can then be used to for optimization throws out any columns where the average number of whitespace delimited tokens are >= 50
`optimize`(index, search_df)

property index

make_index_config(df: DataFrame, id_col='_id') → IndexConfig

create the starting index config which can then be used to for optimization throws out any columns where the average number of whitespace delimited tokens are >= 50

Parameters:

dfpyspark.sql.DataFrame: the dataframe that we want to generate a config for
id_colstr: the unique id column for the records in the dataframe

optimize(index: Index, search_df: DataFrame) → QuerySpec

Parameters:

indexIndex: the index that will have an optimzed query spec created for it
search_dfpyspark.sql.DataFrame:: the records that will be used to choose the query spec

Returns:

QuerySpec: a query spec optimized for searching for search_df using index

sparkly.index_optimizer.query_scorer module

class sparkly.index_optimizer.query_scorer.AUCQueryScorer

Bases: QueryScorer

Methods

score_query_result
score_query_results

score_query_result(query_result, query_spec, drop_first) → float

score_query_results(query_results, query_spec, drop_first) → list

class sparkly.index_optimizer.query_scorer.QueryScorer

Bases: ABC

Methods

score_query_results(query_results, query_spec)

score_query_result

abstractmethod score_query_result(query_result, query_spec) → float

abstractmethod score_query_results(query_results, query_spec) → list

class sparkly.index_optimizer.query_scorer.RankQueryScorer(threshold, k)

Bases: QueryScorer

Methods

score_query_result
score_query_results

score_query_result(query_result, query_spec) → float

score_query_results(query_results, query_spec) → list

sparkly.index_optimizer.query_scorer.compute_wilcoxon_score(x, y)

sparkly.index_optimizer.query_scorer.score_query_result(scores, drop_first=False)

sparkly.index_optimizer.query_scorer.score_query_result_sum(scores)

sparkly.index_optimizer.query_scorer.score_query_results(query_results)

sparkly.index_optimizer package

Submodules

sparkly.index_optimizer.index_optimizer module

sparkly.index_optimizer.query_scorer module

Module contents