delex.lang.predicate package

Submodules

delex.lang.predicate.bootleg_predicate module

class delex.lang.predicate.bootleg_predicate.BootlegPredicate(index_col: str, search_col: str, invert: bool = False)

Bases: ThresholdPredicate

an experimental user defined predicate for demonstration. In particular, does some simple preprocessing of person names to make exact match more liberal by handling name variations

Attributes:

index_col
indexable: True if the predicate can be efficiently indexed
invertable
is_topk: True if the self is Topk based, else False
op
search_col
sim: The simiarlity used by the predicate
streamable: True if the predicate can be evaluated over a single
val

Methods

`build`(for_search, index_table[, ...])	build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
`contains`(other)	True if the set output by self is a superset (non-strict) of other
`deinit`()	release the resources acquired by self.init()
`filter`(itr)	perform filter_batch for each batch in itr
`filter_batch`(queries, id1_lists)	filter each id_list in id1_lists using this predicate.
`index_component_sizes`(for_search)	return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
`index_size_in_bytes`()	return the total size in bytes of all the files associated with this predicate
`init`()	initialize the predicate for searching or filtering
`search`(itr)	perform search_batch for each batch in itr
`search_batch`(queries)	perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

compute_scores
search_index

build(for_search, index_table, index_id_col='_id', cache=None)

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

Parameters:

for_searchbool: build the predicate for searching, otherwise streaming / filtering
index_tablepyspark.sql.DataFrame: the dataframe that will be preprocessed / indexed
index_id_colstr: the name of the unique id column in index_table
cacheOptional[BuildCache] = None: the cache for built indexes and hash tables

compute_scores(query: str, id1_list)

contains(other): True if the set output by self is a superset (non-strict) of other

deinit(): release the resources acquired by self.init()

property index_col

index_component_sizes(for_search: bool) → dict

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn’t been built yet, the sizes are None

Parameters:

for_searchbool: return the sizes for searching or for filtering

Returns:

dict[Any, int | None]

index_size_in_bytes() → int: return the total size in bytes of all the files associated with this predicate

property indexable: True if the predicate can be efficiently indexed

init(): initialize the predicate for searching or filtering

property is_topk: True if the self is Topk based, else False

property op

property search_col

search_index(query)

property sim: The simiarlity used by the predicate

property streamable: True if the predicate can be evaluated over a single partition of the indexed table, otherwise False

property val

class delex.lang.predicate.bootleg_predicate.BootlegSim(index_col: str, search_col: str, invert: bool)

Bases: object

index_col: str

invert: bool

search_col: str

class delex.lang.predicate.bootleg_predicate.CachedNameIndexKey(index_col: str, lowercase: bool)

Bases: CachedObjectKey

index_col: str

lowercase: bool

class delex.lang.predicate.bootleg_predicate.CachedNamesKey(index_col: str)

Bases: CachedObjectKey

index_col: str

delex.lang.predicate.exact_match_predicate module

class delex.lang.predicate.exact_match_predicate.ExactMatchPredicate(index_col: str, search_col: str, invert: bool, lowercase: bool = False)

Bases: ThresholdPredicate

an exact match predicate, i.e. if x == y return 1.0 else 0.0

Attributes:

index_col
indexable: True if the predicate can be efficiently indexed
invertable
is_topk: True if the self is Topk based, else False
op
search_col
sim: The simiarlity used by the predicate
streamable: True if the predicate can be evaluated over a single
val

Methods

`build`(for_search, index_table[, ...])	build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
`contains`(other)	True if the set output by self is a superset (non-strict) of other
`deinit`()	release the resources acquired by self.init()
`filter`(itr)	perform filter_batch for each batch in itr
`filter_batch`(queries, id1_lists)	filter each id_list in id1_lists using this predicate.
`index_component_sizes`(for_search)	return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
`index_size_in_bytes`()	return the total size in bytes of all the files associated with this predicate
`init`()	initialize the predicate for searching or filtering
`search`(itr)	perform search_batch for each batch in itr
`search_batch`(queries)	perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

Sim
compute_scores
search_index

class Sim(index_col: str, search_col: str, invert: bool, lowercase: bool)

Bases: object

index_col: str

invert: bool

lowercase: bool

search_col: str

build(for_search: bool, index_table: DataFrame, index_id_col: str = '_id', cache: BuildCache = None)

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

Parameters:

for_searchbool: build the predicate for searching, otherwise streaming / filtering
index_tablepyspark.sql.DataFrame: the dataframe that will be preprocessed / indexed
index_id_colstr: the name of the unique id column in index_table
cacheOptional[BuildCache] = None: the cache for built indexes and hash tables

compute_scores(query: str | int, id1_list) → ndarray

contains(other: Predicate) → bool: True if the set output by self is a superset (non-strict) of other

deinit(): release the resources acquired by self.init()

property index_col

index_component_sizes(for_search: bool) → dict

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn’t been built yet, the sizes are None

Parameters:

for_searchbool: return the sizes for searching or for filtering

Returns:

dict[Any, int | None]

index_size_in_bytes() → int: return the total size in bytes of all the files associated with this predicate

property indexable: True if the predicate can be efficiently indexed

init(): initialize the predicate for searching or filtering

property is_topk: True if the self is Topk based, else False

property op

property search_col

search_index(query) → ndarray

property sim: The simiarlity used by the predicate

property streamable: True if the predicate can be evaluated over a single partition of the indexed table, otherwise False

property val

delex.lang.predicate.name_map module

delex.lang.predicate.predicate module

class delex.lang.predicate.predicate.Predicate

Bases: ABC

abstract base class for all Predicates to be used in writing blocking programs

Attributes:

indexable: True if the predicate can be efficiently indexed
is_topk: True if the self is Topk based, else False
sim: The simiarlity used by the predicate
streamable: True if the predicate can be evaluated over a single

Methods

`build`(for_search, index_table[, ...])	build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
`contains`(other)	True if the set output by self is a superset (non-strict) of other
`deinit`()	release the resources acquired by self.init()
`filter`(itr)	perform filter_batch for each batch in itr
`filter_batch`(queries, id1_lists)	filter each id_list in id1_lists using this predicate.
`index_component_sizes`(for_search)	return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
`index_size_in_bytes`()	return the total size in bytes of all the files associated with this predicate
`init`()	initialize the predicate for searching or filtering
`search`(itr)	perform search_batch for each batch in itr
`search_batch`(queries)	perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

abstractmethod build(for_search: bool, index_table: DataFrame, index_id_col: str = '_id', cache: BuildCache | None = None)

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

Parameters:

for_searchbool: build the predicate for searching, otherwise streaming / filtering
index_tablepyspark.sql.DataFrame: the dataframe that will be preprocessed / indexed
index_id_colstr: the name of the unique id column in index_table
cacheOptional[BuildCache] = None: the cache for built indexes and hash tables

abstractmethod contains(other) → bool: True if the set output by self is a superset (non-strict) of other

abstractmethod deinit(): release the resources acquired by self.init()

filter(itr: Iterator[Tuple[Series, Series]]) → Iterator[DataFrame]: perform filter_batch for each batch in itr

abstractmethod filter_batch(queries: Series, id1_lists: Series) → DataFrame: filter each id_list in id1_lists using this predicate. This is, for each query, id_list pair in zip(queries, id1_lists), return only the ids which satisfy predicate(query, id) for id in id_list. Return a dataframe with schema (ids array<long>, scores array<float>, time float)

abstractmethod index_component_sizes(for_search: bool) → dict

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn’t been built yet, the sizes are None

Parameters:

for_searchbool: return the sizes for searching or for filtering

Returns:

dict[Any, int | None]

abstractmethod index_size_in_bytes() → int: return the total size in bytes of all the files associated with this predicate

abstract property indexable: True if the predicate can be efficiently indexed

abstractmethod init(): initialize the predicate for searching or filtering

abstract property is_topk: bool: True if the self is Topk based, else False

search(itr: Iterator[Series]) → Iterator[DataFrame]: perform search_batch for each batch in itr

abstractmethod search_batch(queries: Series) → DataFrame: perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

abstract property sim: The simiarlity used by the predicate

abstract property streamable: True if the predicate can be evaluated over a single partition of the indexed table, otherwise False

delex.lang.predicate.set_sim_predicate module

class delex.lang.predicate.set_sim_predicate.CosinePredicate(index_col: str, search_col: str, tokenizer, op, val: float)

Bases: SetSimPredicate

Attributes:

index_col
indexable: True if the predicate can be efficiently indexed
invertable
is_topk: True if the self is Topk based, else False
op
search_col
sim: The simiarlity used by the predicate
streamable: True if the predicate can be evaluated over a single
val

Methods

`build`(for_search, index_table[, ...])	build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
`contains`(other)	True if the set output by self is a superset (non-strict) of other
`deinit`()	release the resources acquired by self.init()
`filter`(itr)	perform filter_batch for each batch in itr
`filter_batch`(queries, id1_lists)	filter each id_list in id1_lists using this predicate.
`index_component_sizes`(for_search)	return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
`index_size_in_bytes`()	return the total size in bytes of all the files associated with this predicate
`init`()	initialize the predicate for searching or filtering
`search`(itr)	perform search_batch for each batch in itr
`search_batch`(queries)	perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

Sim
compute_scores
invert
search_index

compute_scores(query, id1_list)

class delex.lang.predicate.set_sim_predicate.JaccardPredicate(index_col: str, search_col: str, tokenizer, op, val: float)

Bases: SetSimPredicate

Attributes:

index_col
indexable: True if the predicate can be efficiently indexed
invertable
is_topk: True if the self is Topk based, else False
op
search_col
sim: The simiarlity used by the predicate
streamable: True if the predicate can be evaluated over a single
val

Methods

`build`(for_search, index_table[, ...])	build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
`contains`(other)	True if the set output by self is a superset (non-strict) of other
`deinit`()	release the resources acquired by self.init()
`filter`(itr)	perform filter_batch for each batch in itr
`filter_batch`(queries, id1_lists)	filter each id_list in id1_lists using this predicate.
`index_component_sizes`(for_search)	return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
`index_size_in_bytes`()	return the total size in bytes of all the files associated with this predicate
`init`()	initialize the predicate for searching or filtering
`search`(itr)	perform search_batch for each batch in itr
`search_batch`(queries)	perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

Sim
compute_scores
invert
search_index

compute_scores(query, id1_list)

class delex.lang.predicate.set_sim_predicate.OverlapCoeffPredicate(index_col: str, search_col: str, tokenizer, op, val: float)

Bases: SetSimPredicate

Attributes:

index_col
indexable: True if the predicate can be efficiently indexed
invertable
is_topk: True if the self is Topk based, else False
op
search_col
sim: The simiarlity used by the predicate
streamable: True if the predicate can be evaluated over a single
val

Methods

`build`(for_search, index_table[, ...])	build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
`contains`(other)	True if the set output by self is a superset (non-strict) of other
`deinit`()	release the resources acquired by self.init()
`filter`(itr)	perform filter_batch for each batch in itr
`filter_batch`(queries, id1_lists)	filter each id_list in id1_lists using this predicate.
`index_component_sizes`(for_search)	return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
`index_size_in_bytes`()	return the total size in bytes of all the files associated with this predicate
`init`()	initialize the predicate for searching or filtering
`search`(itr)	perform search_batch for each batch in itr
`search_batch`(queries)	perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

Sim
compute_scores
invert
search_index

compute_scores(query, id1_list)

property indexable: True if the predicate can be efficiently indexed

class delex.lang.predicate.set_sim_predicate.SetSimPredicate(index_col: str, search_col: str, tokenizer, op, val: float)

Bases: ThresholdPredicate

Attributes:

index_col
indexable: True if the predicate can be efficiently indexed
invertable
is_topk: True if the self is Topk based, else False
op
search_col
sim: The simiarlity used by the predicate
streamable: True if the predicate can be evaluated over a single
val

Methods

`build`(for_search, index_table[, ...])	build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
`contains`(other)	True if the set output by self is a superset (non-strict) of other
`deinit`()	release the resources acquired by self.init()
`filter`(itr)	perform filter_batch for each batch in itr
`filter_batch`(queries, id1_lists)	filter each id_list in id1_lists using this predicate.
`index_component_sizes`(for_search)	return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
`index_size_in_bytes`()	return the total size in bytes of all the files associated with this predicate
`init`()	initialize the predicate for searching or filtering
`search`(itr)	perform search_batch for each batch in itr
`search_batch`(queries)	perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

Sim
compute_scores
invert
search_index

class Sim(index_col: str, search_col: str, sim_name: str, tokenizer_name: str)

Bases: object

index_col: str

search_col: str

sim_name: str

tokenizer_name: str

build(for_search, index_table, index_id_col='_id', cache=None)

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

Parameters:

for_searchbool: build the predicate for searching, otherwise streaming / filtering
index_tablepyspark.sql.DataFrame: the dataframe that will be preprocessed / indexed
index_id_colstr: the name of the unique id column in index_table
cacheOptional[BuildCache] = None: the cache for built indexes and hash tables

contains(other): True if the set output by self is a superset (non-strict) of other

deinit(): release the resources acquired by self.init()

index_component_sizes(for_search: bool) → dict

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn’t been built yet, the sizes are None

Parameters:

for_searchbool: return the sizes for searching or for filtering

Returns:

dict[Any, int | None]

index_size_in_bytes() → int: return the total size in bytes of all the files associated with this predicate

property indexable: True if the predicate can be efficiently indexed

init(): initialize the predicate for searching or filtering

invert()

property is_topk: True if the self is Topk based, else False

search_index(query)

property sim: The simiarlity used by the predicate

property streamable: True if the predicate can be evaluated over a single partition of the indexed table, otherwise False

delex.lang.predicate.string_sim_predicate module

class delex.lang.predicate.string_sim_predicate.EditDistancePredicate(index_col: str, search_col: str, op, val)

Bases: StringSimPredicate

Attributes:

index_col
indexable: True if the predicate can be efficiently indexed
invertable
is_topk: True if the self is Topk based, else False
op
search_col
sim: The simiarlity used by the predicate
streamable: True if the predicate can be evaluated over a single
val

Methods

`build`(for_search, index_table[, ...])	build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
`contains`(other)	True if the set output by self is a superset (non-strict) of other
`deinit`()	release the resources acquired by self.init()
`filter`(itr)	perform filter_batch for each batch in itr
`filter_batch`(queries, id1_lists)	filter each id_list in id1_lists using this predicate.
`index_component_sizes`(for_search)	return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
`index_size_in_bytes`()	return the total size in bytes of all the files associated with this predicate
`init`()	initialize the predicate for searching or filtering
`search`(itr)	perform search_batch for each batch in itr
`search_batch`(queries)	perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

Sim
compute_scores
invert
search_index

class delex.lang.predicate.string_sim_predicate.JaroPredicate(index_col: str, search_col: str, op, val)

Bases: StringSimPredicate

Attributes:

index_col
indexable: True if the predicate can be efficiently indexed
invertable
is_topk: True if the self is Topk based, else False
op
search_col
sim: The simiarlity used by the predicate
streamable: True if the predicate can be evaluated over a single
val

Methods

`build`(for_search, index_table[, ...])	build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
`contains`(other)	True if the set output by self is a superset (non-strict) of other
`deinit`()	release the resources acquired by self.init()
`filter`(itr)	perform filter_batch for each batch in itr
`filter_batch`(queries, id1_lists)	filter each id_list in id1_lists using this predicate.
`index_component_sizes`(for_search)	return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
`index_size_in_bytes`()	return the total size in bytes of all the files associated with this predicate
`init`()	initialize the predicate for searching or filtering
`search`(itr)	perform search_batch for each batch in itr
`search_batch`(queries)	perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

Sim
compute_scores
invert
search_index

class delex.lang.predicate.string_sim_predicate.JaroWinklerPredicate(index_col: str, search_col: str, op, val, prefix_weight=0.1)

Bases: StringSimPredicate

Attributes:

index_col
indexable: True if the predicate can be efficiently indexed
invertable
is_topk: True if the self is Topk based, else False
op
search_col
sim: The simiarlity used by the predicate
streamable: True if the predicate can be evaluated over a single
val

Methods

`build`(for_search, index_table[, ...])	build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
`contains`(o)	True if the set output by self is a superset (non-strict) of other
`deinit`()	release the resources acquired by self.init()
`filter`(itr)	perform filter_batch for each batch in itr
`filter_batch`(queries, id1_lists)	filter each id_list in id1_lists using this predicate.
`index_component_sizes`(for_search)	return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
`index_size_in_bytes`()	return the total size in bytes of all the files associated with this predicate
`init`()	initialize the predicate for searching or filtering
`search`(itr)	perform search_batch for each batch in itr
`search_batch`(queries)	perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

Sim
compute_scores
invert
search_index

class Sim(index_col: str, search_col: str, sim_name: str, prefix_weight: float)

Bases: Sim

prefix_weight: float

contains(o): True if the set output by self is a superset (non-strict) of other

class delex.lang.predicate.string_sim_predicate.SmithWatermanPredicate(index_col: str, search_col: str, op, val, gap_cost=1.0)

Bases: StringSimPredicate

Attributes:

index_col
indexable: True if the predicate can be efficiently indexed
invertable
is_topk: True if the self is Topk based, else False
op
search_col
sim: The simiarlity used by the predicate
streamable: True if the predicate can be evaluated over a single
val

Methods

`build`(for_search, index_table[, ...])	build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
`contains`(o)	True if the set output by self is a superset (non-strict) of other
`deinit`()	release the resources acquired by self.init()
`filter`(itr)	perform filter_batch for each batch in itr
`filter_batch`(queries, id1_lists)	filter each id_list in id1_lists using this predicate.
`index_component_sizes`(for_search)	return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
`index_size_in_bytes`()	return the total size in bytes of all the files associated with this predicate
`init`()	initialize the predicate for searching or filtering
`search`(itr)	perform search_batch for each batch in itr
`search_batch`(queries)	perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

Sim
compute_scores
invert
search_index

class Sim(index_col: str, search_col: str, sim_name: str, gap_cost: float)

Bases: Sim

gap_cost: float

contains(o): True if the set output by self is a superset (non-strict) of other

class delex.lang.predicate.string_sim_predicate.StringSimPredicate(index_col: str, search_col: str, op, val)

Bases: ThresholdPredicate

Attributes:

index_col
indexable: True if the predicate can be efficiently indexed
invertable
is_topk: True if the self is Topk based, else False
op
search_col
sim: The simiarlity used by the predicate
streamable: True if the predicate can be evaluated over a single
val

Methods

`build`(for_search, index_table[, ...])	build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
`contains`(other)	True if the set output by self is a superset (non-strict) of other
`deinit`()	release the resources acquired by self.init()
`filter`(itr)	perform filter_batch for each batch in itr
`filter_batch`(queries, id1_lists)	filter each id_list in id1_lists using this predicate.
`index_component_sizes`(for_search)	return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
`index_size_in_bytes`()	return the total size in bytes of all the files associated with this predicate
`init`()	initialize the predicate for searching or filtering
`search`(itr)	perform search_batch for each batch in itr
`search_batch`(queries)	perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

Sim
compute_scores
invert
search_index

class Sim(index_col: str, search_col: str, sim_name: str)

Bases: object

index_col: str

search_col: str

sim_name: str

build(for_search, index_table, index_id_col='_id', cache=None)

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

Parameters:

for_searchbool: build the predicate for searching, otherwise streaming / filtering
index_tablepyspark.sql.DataFrame: the dataframe that will be preprocessed / indexed
index_id_colstr: the name of the unique id column in index_table
cacheOptional[BuildCache] = None: the cache for built indexes and hash tables

compute_scores(query: str, id1_list)

deinit(): release the resources acquired by self.init()

index_component_sizes(for_search: bool) → dict

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn’t been built yet, the sizes are None

Parameters:

for_searchbool: return the sizes for searching or for filtering

Returns:

dict[Any, int | None]

index_size_in_bytes() → int: return the total size in bytes of all the files associated with this predicate

property indexable: True if the predicate can be efficiently indexed

init(): initialize the predicate for searching or filtering

invert()

property is_topk: True if the self is Topk based, else False

search(itr): perform search_batch for each batch in itr

search_index(query)

property sim: The simiarlity used by the predicate

property streamable: True if the predicate can be evaluated over a single partition of the indexed table, otherwise False

delex.lang.predicate.threshold_predicate module

class delex.lang.predicate.threshold_predicate.ThresholdPredicate(index_col, search_col, op, val: float)

Bases: Predicate, ABC

Attributes:

index_col
indexable: True if the predicate can be efficiently indexed
invertable
is_topk: True if the self is Topk based, else False
op
search_col
sim: The simiarlity used by the predicate
streamable: True if the predicate can be evaluated over a single
val

Methods

`build`(for_search, index_table[, ...])	build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
`contains`(other)	True if the set output by self is a superset (non-strict) of other
`deinit`()	release the resources acquired by self.init()
`filter`(itr)	perform filter_batch for each batch in itr
`filter_batch`(queries, id1_lists)	filter each id_list in id1_lists using this predicate.
`index_component_sizes`(for_search)	return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
`index_size_in_bytes`()	return the total size in bytes of all the files associated with this predicate
`init`()	initialize the predicate for searching or filtering
`search`(itr)	perform search_batch for each batch in itr
`search_batch`(queries)	perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

compute_scores
search_index

abstractmethod compute_scores(query, id1_list)

contains(other) → bool: True if the set output by self is a superset (non-strict) of other

filter_batch(queries: Series, id1_lists: Series) → Iterator[DataFrame]: filter each id_list in id1_lists using this predicate. This is, for each query, id_list pair in zip(queries, id1_lists), return only the ids which satisfy predicate(query, id) for id in id_list. Return a dataframe with schema (ids array<long>, scores array<float>, time float)

property index_col

property invertable: bool

property op: Callable

search_batch(queries: Series) → DataFrame: perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

property search_col

abstractmethod search_index(query)

property val: float

delex.lang.predicate.topk_predicate module

class delex.lang.predicate.topk_predicate.BM25TopkPredicate(index_col, search_col, tokenizer: str, k: int)

Bases: Predicate

Attributes:

index_col
indexable: True if the predicate can be efficiently indexed
invertable
is_topk: True if the self is Topk based, else False
k
search_col
sim: The simiarlity used by the predicate
streamable: True if the predicate can be evaluated over a single

Methods

`build`(for_search, index_table[, ...])	build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
`contains`(other)	True if the set output by self is a superset (non-strict) of other
`deinit`()	release the resources acquired by self.init()
`filter`(itr)	perform filter_batch for each batch in itr
`filter_batch`(queries, id1_lists)	filter each id_list in id1_lists using this predicate.
`index_component_sizes`(for_search)	return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
`index_size_in_bytes`()	return the total size in bytes of all the files associated with this predicate
`init`()	initialize the predicate for searching or filtering
`search`(itr)	perform search_batch for each batch in itr
`search_batch`(queries)	perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

Sim

class Sim(index_col: str, search_col: str, tokenizer_name: str)

Bases: object

index_col: str

search_col: str

tokenizer_name: str

build(for_search, index_table, index_id_col='_id', cache=None)

build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index

Parameters:

for_searchbool: build the predicate for searching, otherwise streaming / filtering
index_tablepyspark.sql.DataFrame: the dataframe that will be preprocessed / indexed
index_id_colstr: the name of the unique id column in index_table
cacheOptional[BuildCache] = None: the cache for built indexes and hash tables

contains(other) → bool: True if the set output by self is a superset (non-strict) of other

deinit(): release the resources acquired by self.init()

filter_batch(queries: Series, id1_lists: Series) → Iterator[DataFrame]: filter each id_list in id1_lists using this predicate. This is, for each query, id_list pair in zip(queries, id1_lists), return only the ids which satisfy predicate(query, id) for id in id_list. Return a dataframe with schema (ids array<long>, scores array<float>, time float)

property index_col

index_component_sizes(for_search: bool) → dict

return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn’t been built yet, the sizes are None

Parameters:

for_searchbool: return the sizes for searching or for filtering

Returns:

dict[Any, int | None]

index_size_in_bytes() → int: return the total size in bytes of all the files associated with this predicate

property indexable: True if the predicate can be efficiently indexed

init(): initialize the predicate for searching or filtering

property invertable: bool

property is_topk: bool: True if the self is Topk based, else False

property k

search_batch(queries): perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)

property search_col

property sim: The simiarlity used by the predicate

property streamable: True if the predicate can be evaluated over a single partition of the indexed table, otherwise False

class delex.lang.predicate.topk_predicate.CachedBM25IndexKey(index_col: str, tokenizer: str)

Bases: CachedObjectKey

index_col: str

tokenizer: str

delex.lang.predicate package

Submodules

delex.lang.predicate.bootleg_predicate module

delex.lang.predicate.exact_match_predicate module

delex.lang.predicate.name_map module

delex.lang.predicate.predicate module

delex.lang.predicate.set_sim_predicate module

delex.lang.predicate.string_sim_predicate module

delex.lang.predicate.threshold_predicate module

delex.lang.predicate.topk_predicate module

Module contents