delex.lang.predicate package

Submodules

delex.lang.predicate.bootleg_predicate module

class delex.lang.predicate.bootleg_predicate.BootlegPredicate(index_col: str, search_col: str, invert: bool = False)

Bases: ThresholdPredicate

an experimental user defined predicate for demonstration. In particular, does some simple preprocessing of person names to make exact match more liberal by handling name variations

Attributes:
index_col
indexable

True if the predicate can be efficiently indexed

invertable
is_topk

True if the self is Topk based, else False

op
search_col
sim

The simiarlity used by the predicate

streamable

True if the predicate can be evaluated over a single

val

Methods

build(for_search, index_table[, ...])

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

contains(other)

True if the set output by self is a superset (non-strict) of other

deinit()

release the resources acquired by self.init()

filter(itr)

perform filter_batch for each batch in itr

filter_batch(queries, id1_lists)

filter each id_list in id1_lists using this predicate.

index_component_sizes(for_search)

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None

index_size_in_bytes()

return the total size in bytes of all the files associated with this predicate

init()

initialize the predicate for searching or filtering

search(itr)

perform search_batch for each batch in itr

search_batch(queries)

perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

compute_scores

search_index

build(for_search, index_table, index_id_col='_id', cache=None)

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

Parameters:
for_searchbool

build the predicate for searching, otherwise streaming / filtering

index_tablepyspark.sql.DataFrame

the dataframe that will be preprocessed / indexed

index_id_colstr

the name of the unique id column in index_table

cacheOptional[BuildCache] = None

the cache for built indexes and hash tables

compute_scores(query: str, id1_list)
contains(other)

True if the set output by self is a superset (non-strict) of other

deinit()

release the resources acquired by self.init()

property index_col
index_component_sizes(for_search: bool) dict

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn’t been built yet, the sizes are None

Parameters:
for_searchbool

return the sizes for searching or for filtering

Returns:
dict[Any, int | None]
index_size_in_bytes() int

return the total size in bytes of all the files associated with this predicate

property indexable

True if the predicate can be efficiently indexed

init()

initialize the predicate for searching or filtering

property is_topk

True if the self is Topk based, else False

property op
property search_col
search_index(query)
property sim

The simiarlity used by the predicate

property streamable

True if the predicate can be evaluated over a single partition of the indexed table, otherwise False

property val
class delex.lang.predicate.bootleg_predicate.BootlegSim(index_col: str, search_col: str, invert: bool)

Bases: object

index_col: str
invert: bool
search_col: str
class delex.lang.predicate.bootleg_predicate.CachedNameIndexKey(index_col: str, lowercase: bool)

Bases: CachedObjectKey

index_col: str
lowercase: bool
class delex.lang.predicate.bootleg_predicate.CachedNamesKey(index_col: str)

Bases: CachedObjectKey

index_col: str

delex.lang.predicate.exact_match_predicate module

class delex.lang.predicate.exact_match_predicate.ExactMatchPredicate(index_col: str, search_col: str, invert: bool, lowercase: bool = False)

Bases: ThresholdPredicate

an exact match predicate, i.e. if x == y return 1.0 else 0.0

Attributes:
index_col
indexable

True if the predicate can be efficiently indexed

invertable
is_topk

True if the self is Topk based, else False

op
search_col
sim

The simiarlity used by the predicate

streamable

True if the predicate can be evaluated over a single

val

Methods

build(for_search, index_table[, ...])

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

contains(other)

True if the set output by self is a superset (non-strict) of other

deinit()

release the resources acquired by self.init()

filter(itr)

perform filter_batch for each batch in itr

filter_batch(queries, id1_lists)

filter each id_list in id1_lists using this predicate.

index_component_sizes(for_search)

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None

index_size_in_bytes()

return the total size in bytes of all the files associated with this predicate

init()

initialize the predicate for searching or filtering

search(itr)

perform search_batch for each batch in itr

search_batch(queries)

perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

Sim

compute_scores

search_index

class Sim(index_col: str, search_col: str, invert: bool, lowercase: bool)

Bases: object

index_col: str
invert: bool
lowercase: bool
search_col: str
build(for_search: bool, index_table: DataFrame, index_id_col: str = '_id', cache: BuildCache = None)

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

Parameters:
for_searchbool

build the predicate for searching, otherwise streaming / filtering

index_tablepyspark.sql.DataFrame

the dataframe that will be preprocessed / indexed

index_id_colstr

the name of the unique id column in index_table

cacheOptional[BuildCache] = None

the cache for built indexes and hash tables

compute_scores(query: str | int, id1_list) ndarray
contains(other: Predicate) bool

True if the set output by self is a superset (non-strict) of other

deinit()

release the resources acquired by self.init()

property index_col
index_component_sizes(for_search: bool) dict

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn’t been built yet, the sizes are None

Parameters:
for_searchbool

return the sizes for searching or for filtering

Returns:
dict[Any, int | None]
index_size_in_bytes() int

return the total size in bytes of all the files associated with this predicate

property indexable

True if the predicate can be efficiently indexed

init()

initialize the predicate for searching or filtering

property is_topk

True if the self is Topk based, else False

property op
property search_col
search_index(query) ndarray
property sim

The simiarlity used by the predicate

property streamable

True if the predicate can be evaluated over a single partition of the indexed table, otherwise False

property val

delex.lang.predicate.name_map module

delex.lang.predicate.predicate module

class delex.lang.predicate.predicate.Predicate

Bases: ABC

abstract base class for all Predicates to be used in writing blocking programs

Attributes:
indexable

True if the predicate can be efficiently indexed

is_topk

True if the self is Topk based, else False

sim

The simiarlity used by the predicate

streamable

True if the predicate can be evaluated over a single

Methods

build(for_search, index_table[, ...])

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

contains(other)

True if the set output by self is a superset (non-strict) of other

deinit()

release the resources acquired by self.init()

filter(itr)

perform filter_batch for each batch in itr

filter_batch(queries, id1_lists)

filter each id_list in id1_lists using this predicate.

index_component_sizes(for_search)

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None

index_size_in_bytes()

return the total size in bytes of all the files associated with this predicate

init()

initialize the predicate for searching or filtering

search(itr)

perform search_batch for each batch in itr

search_batch(queries)

perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

abstractmethod build(for_search: bool, index_table: DataFrame, index_id_col: str = '_id', cache: BuildCache | None = None)

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

Parameters:
for_searchbool

build the predicate for searching, otherwise streaming / filtering

index_tablepyspark.sql.DataFrame

the dataframe that will be preprocessed / indexed

index_id_colstr

the name of the unique id column in index_table

cacheOptional[BuildCache] = None

the cache for built indexes and hash tables

abstractmethod contains(other) bool

True if the set output by self is a superset (non-strict) of other

abstractmethod deinit()

release the resources acquired by self.init()

filter(itr: Iterator[Tuple[Series, Series]]) Iterator[DataFrame]

perform filter_batch for each batch in itr

abstractmethod filter_batch(queries: Series, id1_lists: Series) DataFrame

filter each id_list in id1_lists using this predicate. This is, for each query, id_list pair in zip(queries, id1_lists), return only the ids which satisfy predicate(query, id) for id in id_list. Return a dataframe with schema (ids array<long>, scores array<float>, time float)

abstractmethod index_component_sizes(for_search: bool) dict

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn’t been built yet, the sizes are None

Parameters:
for_searchbool

return the sizes for searching or for filtering

Returns:
dict[Any, int | None]
abstractmethod index_size_in_bytes() int

return the total size in bytes of all the files associated with this predicate

abstract property indexable

True if the predicate can be efficiently indexed

abstractmethod init()

initialize the predicate for searching or filtering

abstract property is_topk: bool

True if the self is Topk based, else False

search(itr: Iterator[Series]) Iterator[DataFrame]

perform search_batch for each batch in itr

abstractmethod search_batch(queries: Series) DataFrame

perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

abstract property sim

The simiarlity used by the predicate

abstract property streamable

True if the predicate can be evaluated over a single partition of the indexed table, otherwise False

delex.lang.predicate.set_sim_predicate module

class delex.lang.predicate.set_sim_predicate.CosinePredicate(index_col: str, search_col: str, tokenizer, op, val: float)

Bases: SetSimPredicate

Attributes:
index_col
indexable

True if the predicate can be efficiently indexed

invertable
is_topk

True if the self is Topk based, else False

op
search_col
sim

The simiarlity used by the predicate

streamable

True if the predicate can be evaluated over a single

val

Methods

build(for_search, index_table[, ...])

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

contains(other)

True if the set output by self is a superset (non-strict) of other

deinit()

release the resources acquired by self.init()

filter(itr)

perform filter_batch for each batch in itr

filter_batch(queries, id1_lists)

filter each id_list in id1_lists using this predicate.

index_component_sizes(for_search)

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None

index_size_in_bytes()

return the total size in bytes of all the files associated with this predicate

init()

initialize the predicate for searching or filtering

search(itr)

perform search_batch for each batch in itr

search_batch(queries)

perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

Sim

compute_scores

invert

search_index

compute_scores(query, id1_list)
class delex.lang.predicate.set_sim_predicate.JaccardPredicate(index_col: str, search_col: str, tokenizer, op, val: float)

Bases: SetSimPredicate

Attributes:
index_col
indexable

True if the predicate can be efficiently indexed

invertable
is_topk

True if the self is Topk based, else False

op
search_col
sim

The simiarlity used by the predicate

streamable

True if the predicate can be evaluated over a single

val

Methods

build(for_search, index_table[, ...])

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

contains(other)

True if the set output by self is a superset (non-strict) of other

deinit()

release the resources acquired by self.init()

filter(itr)

perform filter_batch for each batch in itr

filter_batch(queries, id1_lists)

filter each id_list in id1_lists using this predicate.

index_component_sizes(for_search)

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None

index_size_in_bytes()

return the total size in bytes of all the files associated with this predicate

init()

initialize the predicate for searching or filtering

search(itr)

perform search_batch for each batch in itr

search_batch(queries)

perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

Sim

compute_scores

invert

search_index

compute_scores(query, id1_list)
class delex.lang.predicate.set_sim_predicate.OverlapCoeffPredicate(index_col: str, search_col: str, tokenizer, op, val: float)

Bases: SetSimPredicate

Attributes:
index_col
indexable

True if the predicate can be efficiently indexed

invertable
is_topk

True if the self is Topk based, else False

op
search_col
sim

The simiarlity used by the predicate

streamable

True if the predicate can be evaluated over a single

val

Methods

build(for_search, index_table[, ...])

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

contains(other)

True if the set output by self is a superset (non-strict) of other

deinit()

release the resources acquired by self.init()

filter(itr)

perform filter_batch for each batch in itr

filter_batch(queries, id1_lists)

filter each id_list in id1_lists using this predicate.

index_component_sizes(for_search)

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None

index_size_in_bytes()

return the total size in bytes of all the files associated with this predicate

init()

initialize the predicate for searching or filtering

search(itr)

perform search_batch for each batch in itr

search_batch(queries)

perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

Sim

compute_scores

invert

search_index

compute_scores(query, id1_list)
property indexable

True if the predicate can be efficiently indexed

class delex.lang.predicate.set_sim_predicate.SetSimPredicate(index_col: str, search_col: str, tokenizer, op, val: float)

Bases: ThresholdPredicate

Attributes:
index_col
indexable

True if the predicate can be efficiently indexed

invertable
is_topk

True if the self is Topk based, else False

op
search_col
sim

The simiarlity used by the predicate

streamable

True if the predicate can be evaluated over a single

val

Methods

build(for_search, index_table[, ...])

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

contains(other)

True if the set output by self is a superset (non-strict) of other

deinit()

release the resources acquired by self.init()

filter(itr)

perform filter_batch for each batch in itr

filter_batch(queries, id1_lists)

filter each id_list in id1_lists using this predicate.

index_component_sizes(for_search)

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None

index_size_in_bytes()

return the total size in bytes of all the files associated with this predicate

init()

initialize the predicate for searching or filtering

search(itr)

perform search_batch for each batch in itr

search_batch(queries)

perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

Sim

compute_scores

invert

search_index

class Sim(index_col: str, search_col: str, sim_name: str, tokenizer_name: str)

Bases: object

index_col: str
search_col: str
sim_name: str
tokenizer_name: str
build(for_search, index_table, index_id_col='_id', cache=None)

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

Parameters:
for_searchbool

build the predicate for searching, otherwise streaming / filtering

index_tablepyspark.sql.DataFrame

the dataframe that will be preprocessed / indexed

index_id_colstr

the name of the unique id column in index_table

cacheOptional[BuildCache] = None

the cache for built indexes and hash tables

contains(other)

True if the set output by self is a superset (non-strict) of other

deinit()

release the resources acquired by self.init()

index_component_sizes(for_search: bool) dict

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn’t been built yet, the sizes are None

Parameters:
for_searchbool

return the sizes for searching or for filtering

Returns:
dict[Any, int | None]
index_size_in_bytes() int

return the total size in bytes of all the files associated with this predicate

property indexable

True if the predicate can be efficiently indexed

init()

initialize the predicate for searching or filtering

invert()
property is_topk

True if the self is Topk based, else False

search_index(query)
property sim

The simiarlity used by the predicate

property streamable

True if the predicate can be evaluated over a single partition of the indexed table, otherwise False

delex.lang.predicate.string_sim_predicate module

class delex.lang.predicate.string_sim_predicate.EditDistancePredicate(index_col: str, search_col: str, op, val)

Bases: StringSimPredicate

Attributes:
index_col
indexable

True if the predicate can be efficiently indexed

invertable
is_topk

True if the self is Topk based, else False

op
search_col
sim

The simiarlity used by the predicate

streamable

True if the predicate can be evaluated over a single

val

Methods

build(for_search, index_table[, ...])

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

contains(other)

True if the set output by self is a superset (non-strict) of other

deinit()

release the resources acquired by self.init()

filter(itr)

perform filter_batch for each batch in itr

filter_batch(queries, id1_lists)

filter each id_list in id1_lists using this predicate.

index_component_sizes(for_search)

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None

index_size_in_bytes()

return the total size in bytes of all the files associated with this predicate

init()

initialize the predicate for searching or filtering

search(itr)

perform search_batch for each batch in itr

search_batch(queries)

perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

Sim

compute_scores

invert

search_index

class delex.lang.predicate.string_sim_predicate.JaroPredicate(index_col: str, search_col: str, op, val)

Bases: StringSimPredicate

Attributes:
index_col
indexable

True if the predicate can be efficiently indexed

invertable
is_topk

True if the self is Topk based, else False

op
search_col
sim

The simiarlity used by the predicate

streamable

True if the predicate can be evaluated over a single

val

Methods

build(for_search, index_table[, ...])

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

contains(other)

True if the set output by self is a superset (non-strict) of other

deinit()

release the resources acquired by self.init()

filter(itr)

perform filter_batch for each batch in itr

filter_batch(queries, id1_lists)

filter each id_list in id1_lists using this predicate.

index_component_sizes(for_search)

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None

index_size_in_bytes()

return the total size in bytes of all the files associated with this predicate

init()

initialize the predicate for searching or filtering

search(itr)

perform search_batch for each batch in itr

search_batch(queries)

perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

Sim

compute_scores

invert

search_index

class delex.lang.predicate.string_sim_predicate.JaroWinklerPredicate(index_col: str, search_col: str, op, val, prefix_weight=0.1)

Bases: StringSimPredicate

Attributes:
index_col
indexable

True if the predicate can be efficiently indexed

invertable
is_topk

True if the self is Topk based, else False

op
search_col
sim

The simiarlity used by the predicate

streamable

True if the predicate can be evaluated over a single

val

Methods

build(for_search, index_table[, ...])

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

contains(o)

True if the set output by self is a superset (non-strict) of other

deinit()

release the resources acquired by self.init()

filter(itr)

perform filter_batch for each batch in itr

filter_batch(queries, id1_lists)

filter each id_list in id1_lists using this predicate.

index_component_sizes(for_search)

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None

index_size_in_bytes()

return the total size in bytes of all the files associated with this predicate

init()

initialize the predicate for searching or filtering

search(itr)

perform search_batch for each batch in itr

search_batch(queries)

perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

Sim

compute_scores

invert

search_index

class Sim(index_col: str, search_col: str, sim_name: str, prefix_weight: float)

Bases: Sim

prefix_weight: float
contains(o)

True if the set output by self is a superset (non-strict) of other

class delex.lang.predicate.string_sim_predicate.SmithWatermanPredicate(index_col: str, search_col: str, op, val, gap_cost=1.0)

Bases: StringSimPredicate

Attributes:
index_col
indexable

True if the predicate can be efficiently indexed

invertable
is_topk

True if the self is Topk based, else False

op
search_col
sim

The simiarlity used by the predicate

streamable

True if the predicate can be evaluated over a single

val

Methods

build(for_search, index_table[, ...])

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

contains(o)

True if the set output by self is a superset (non-strict) of other

deinit()

release the resources acquired by self.init()

filter(itr)

perform filter_batch for each batch in itr

filter_batch(queries, id1_lists)

filter each id_list in id1_lists using this predicate.

index_component_sizes(for_search)

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None

index_size_in_bytes()

return the total size in bytes of all the files associated with this predicate

init()

initialize the predicate for searching or filtering

search(itr)

perform search_batch for each batch in itr

search_batch(queries)

perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

Sim

compute_scores

invert

search_index

class Sim(index_col: str, search_col: str, sim_name: str, gap_cost: float)

Bases: Sim

gap_cost: float
contains(o)

True if the set output by self is a superset (non-strict) of other

class delex.lang.predicate.string_sim_predicate.StringSimPredicate(index_col: str, search_col: str, op, val)

Bases: ThresholdPredicate

Attributes:
index_col
indexable

True if the predicate can be efficiently indexed

invertable
is_topk

True if the self is Topk based, else False

op
search_col
sim

The simiarlity used by the predicate

streamable

True if the predicate can be evaluated over a single

val

Methods

build(for_search, index_table[, ...])

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

contains(other)

True if the set output by self is a superset (non-strict) of other

deinit()

release the resources acquired by self.init()

filter(itr)

perform filter_batch for each batch in itr

filter_batch(queries, id1_lists)

filter each id_list in id1_lists using this predicate.

index_component_sizes(for_search)

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None

index_size_in_bytes()

return the total size in bytes of all the files associated with this predicate

init()

initialize the predicate for searching or filtering

search(itr)

perform search_batch for each batch in itr

search_batch(queries)

perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

Sim

compute_scores

invert

search_index

class Sim(index_col: str, search_col: str, sim_name: str)

Bases: object

index_col: str
search_col: str
sim_name: str
build(for_search, index_table, index_id_col='_id', cache=None)

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

Parameters:
for_searchbool

build the predicate for searching, otherwise streaming / filtering

index_tablepyspark.sql.DataFrame

the dataframe that will be preprocessed / indexed

index_id_colstr

the name of the unique id column in index_table

cacheOptional[BuildCache] = None

the cache for built indexes and hash tables

compute_scores(query: str, id1_list)
deinit()

release the resources acquired by self.init()

index_component_sizes(for_search: bool) dict

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn’t been built yet, the sizes are None

Parameters:
for_searchbool

return the sizes for searching or for filtering

Returns:
dict[Any, int | None]
index_size_in_bytes() int

return the total size in bytes of all the files associated with this predicate

property indexable

True if the predicate can be efficiently indexed

init()

initialize the predicate for searching or filtering

invert()
property is_topk

True if the self is Topk based, else False

search(itr)

perform search_batch for each batch in itr

search_index(query)
property sim

The simiarlity used by the predicate

property streamable

True if the predicate can be evaluated over a single partition of the indexed table, otherwise False

delex.lang.predicate.threshold_predicate module

class delex.lang.predicate.threshold_predicate.ThresholdPredicate(index_col, search_col, op, val: float)

Bases: Predicate, ABC

Attributes:
index_col
indexable

True if the predicate can be efficiently indexed

invertable
is_topk

True if the self is Topk based, else False

op
search_col
sim

The simiarlity used by the predicate

streamable

True if the predicate can be evaluated over a single

val

Methods

build(for_search, index_table[, ...])

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

contains(other)

True if the set output by self is a superset (non-strict) of other

deinit()

release the resources acquired by self.init()

filter(itr)

perform filter_batch for each batch in itr

filter_batch(queries, id1_lists)

filter each id_list in id1_lists using this predicate.

index_component_sizes(for_search)

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None

index_size_in_bytes()

return the total size in bytes of all the files associated with this predicate

init()

initialize the predicate for searching or filtering

search(itr)

perform search_batch for each batch in itr

search_batch(queries)

perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

compute_scores

search_index

abstractmethod compute_scores(query, id1_list)
contains(other) bool

True if the set output by self is a superset (non-strict) of other

filter_batch(queries: Series, id1_lists: Series) Iterator[DataFrame]

filter each id_list in id1_lists using this predicate. This is, for each query, id_list pair in zip(queries, id1_lists), return only the ids which satisfy predicate(query, id) for id in id_list. Return a dataframe with schema (ids array<long>, scores array<float>, time float)

property index_col
property invertable: bool
property op: Callable
search_batch(queries: Series) DataFrame

perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

property search_col
abstractmethod search_index(query)
property val: float

delex.lang.predicate.topk_predicate module

class delex.lang.predicate.topk_predicate.BM25TopkPredicate(index_col, search_col, tokenizer: str, k: int)

Bases: Predicate

Attributes:
index_col
indexable

True if the predicate can be efficiently indexed

invertable
is_topk

True if the self is Topk based, else False

k
search_col
sim

The simiarlity used by the predicate

streamable

True if the predicate can be evaluated over a single

Methods

build(for_search, index_table[, ...])

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

contains(other)

True if the set output by self is a superset (non-strict) of other

deinit()

release the resources acquired by self.init()

filter(itr)

perform filter_batch for each batch in itr

filter_batch(queries, id1_lists)

filter each id_list in id1_lists using this predicate.

index_component_sizes(for_search)

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None

index_size_in_bytes()

return the total size in bytes of all the files associated with this predicate

init()

initialize the predicate for searching or filtering

search(itr)

perform search_batch for each batch in itr

search_batch(queries)

perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

Sim

class Sim(index_col: str, search_col: str, tokenizer_name: str)

Bases: object

index_col: str
search_col: str
tokenizer_name: str
build(for_search, index_table, index_id_col='_id', cache=None)

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

Parameters:
for_searchbool

build the predicate for searching, otherwise streaming / filtering

index_tablepyspark.sql.DataFrame

the dataframe that will be preprocessed / indexed

index_id_colstr

the name of the unique id column in index_table

cacheOptional[BuildCache] = None

the cache for built indexes and hash tables

contains(other) bool

True if the set output by self is a superset (non-strict) of other

deinit()

release the resources acquired by self.init()

filter_batch(queries: Series, id1_lists: Series) Iterator[DataFrame]

filter each id_list in id1_lists using this predicate. This is, for each query, id_list pair in zip(queries, id1_lists), return only the ids which satisfy predicate(query, id) for id in id_list. Return a dataframe with schema (ids array<long>, scores array<float>, time float)

property index_col
index_component_sizes(for_search: bool) dict

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn’t been built yet, the sizes are None

Parameters:
for_searchbool

return the sizes for searching or for filtering

Returns:
dict[Any, int | None]
index_size_in_bytes() int

return the total size in bytes of all the files associated with this predicate

property indexable

True if the predicate can be efficiently indexed

init()

initialize the predicate for searching or filtering

property invertable: bool
property is_topk: bool

True if the self is Topk based, else False

property k
search_batch(queries)

perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

property search_col
property sim

The simiarlity used by the predicate

property streamable

True if the predicate can be evaluated over a single partition of the indexed table, otherwise False

class delex.lang.predicate.topk_predicate.CachedBM25IndexKey(index_col: str, tokenizer: str)

Bases: CachedObjectKey

index_col: str
tokenizer: str

Module contents