delex.lang.predicate package
Submodules
delex.lang.predicate.bootleg_predicate module
- class delex.lang.predicate.bootleg_predicate.BootlegPredicate(index_col: str, search_col: str, invert: bool = False)
Bases:
ThresholdPredicate
an experimental user defined predicate for demonstration. In particular, does some simple preprocessing of person names to make exact match more liberal by handling name variations
- Attributes:
- index_col
indexable
True if the predicate can be efficiently indexed
- invertable
is_topk
True if the self is Topk based, else False
- op
- search_col
sim
The simiarlity used by the predicate
streamable
True if the predicate can be evaluated over a single
- val
Methods
build
(for_search, index_table[, ...])build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
contains
(other)True if the set output by self is a superset (non-strict) of other
deinit
()release the resources acquired by self.init()
filter
(itr)perform filter_batch for each batch in itr
filter_batch
(queries, id1_lists)filter each id_list in id1_lists using this predicate.
index_component_sizes
(for_search)return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
return the total size in bytes of all the files associated with this predicate
init
()initialize the predicate for searching or filtering
search
(itr)perform search_batch for each batch in itr
search_batch
(queries)perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)
compute_scores
search_index
- build(for_search, index_table, index_id_col='_id', cache=None)
build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
- Parameters:
- for_searchbool
build the predicate for searching, otherwise streaming / filtering
- index_tablepyspark.sql.DataFrame
the dataframe that will be preprocessed / indexed
- index_id_colstr
the name of the unique id column in index_table
- cacheOptional[BuildCache] = None
the cache for built indexes and hash tables
- compute_scores(query: str, id1_list)
- contains(other)
True if the set output by self is a superset (non-strict) of other
- deinit()
release the resources acquired by self.init()
- property index_col
- index_component_sizes(for_search: bool) dict
return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn’t been built yet, the sizes are None
- Parameters:
- for_searchbool
return the sizes for searching or for filtering
- Returns:
- dict[Any, int | None]
- index_size_in_bytes() int
return the total size in bytes of all the files associated with this predicate
- property indexable
True if the predicate can be efficiently indexed
- init()
initialize the predicate for searching or filtering
- property is_topk
True if the self is Topk based, else False
- property op
- property search_col
- search_index(query)
- property sim
The simiarlity used by the predicate
- property streamable
True if the predicate can be evaluated over a single partition of the indexed table, otherwise False
- property val
- class delex.lang.predicate.bootleg_predicate.BootlegSim(index_col: str, search_col: str, invert: bool)
Bases:
object
- index_col: str
- invert: bool
- search_col: str
- class delex.lang.predicate.bootleg_predicate.CachedNameIndexKey(index_col: str, lowercase: bool)
Bases:
CachedObjectKey
- index_col: str
- lowercase: bool
- class delex.lang.predicate.bootleg_predicate.CachedNamesKey(index_col: str)
Bases:
CachedObjectKey
- index_col: str
delex.lang.predicate.exact_match_predicate module
- class delex.lang.predicate.exact_match_predicate.ExactMatchPredicate(index_col: str, search_col: str, invert: bool, lowercase: bool = False)
Bases:
ThresholdPredicate
an exact match predicate, i.e. if x == y return 1.0 else 0.0
- Attributes:
- index_col
indexable
True if the predicate can be efficiently indexed
- invertable
is_topk
True if the self is Topk based, else False
- op
- search_col
sim
The simiarlity used by the predicate
streamable
True if the predicate can be evaluated over a single
- val
Methods
build
(for_search, index_table[, ...])build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
contains
(other)True if the set output by self is a superset (non-strict) of other
deinit
()release the resources acquired by self.init()
filter
(itr)perform filter_batch for each batch in itr
filter_batch
(queries, id1_lists)filter each id_list in id1_lists using this predicate.
index_component_sizes
(for_search)return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
return the total size in bytes of all the files associated with this predicate
init
()initialize the predicate for searching or filtering
search
(itr)perform search_batch for each batch in itr
search_batch
(queries)perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)
Sim
compute_scores
search_index
- class Sim(index_col: str, search_col: str, invert: bool, lowercase: bool)
Bases:
object
- index_col: str
- invert: bool
- lowercase: bool
- search_col: str
- build(for_search: bool, index_table: DataFrame, index_id_col: str = '_id', cache: BuildCache = None)
build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
- Parameters:
- for_searchbool
build the predicate for searching, otherwise streaming / filtering
- index_tablepyspark.sql.DataFrame
the dataframe that will be preprocessed / indexed
- index_id_colstr
the name of the unique id column in index_table
- cacheOptional[BuildCache] = None
the cache for built indexes and hash tables
- compute_scores(query: str | int, id1_list) ndarray
- contains(other: Predicate) bool
True if the set output by self is a superset (non-strict) of other
- deinit()
release the resources acquired by self.init()
- property index_col
- index_component_sizes(for_search: bool) dict
return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn’t been built yet, the sizes are None
- Parameters:
- for_searchbool
return the sizes for searching or for filtering
- Returns:
- dict[Any, int | None]
- index_size_in_bytes() int
return the total size in bytes of all the files associated with this predicate
- property indexable
True if the predicate can be efficiently indexed
- init()
initialize the predicate for searching or filtering
- property is_topk
True if the self is Topk based, else False
- property op
- property search_col
- search_index(query) ndarray
- property sim
The simiarlity used by the predicate
- property streamable
True if the predicate can be evaluated over a single partition of the indexed table, otherwise False
- property val
delex.lang.predicate.name_map module
delex.lang.predicate.predicate module
- class delex.lang.predicate.predicate.Predicate
Bases:
ABC
abstract base class for all Predicates to be used in writing blocking programs
- Attributes:
indexable
True if the predicate can be efficiently indexed
is_topk
True if the self is Topk based, else False
sim
The simiarlity used by the predicate
streamable
True if the predicate can be evaluated over a single
Methods
build
(for_search, index_table[, ...])build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
contains
(other)True if the set output by self is a superset (non-strict) of other
deinit
()release the resources acquired by self.init()
filter
(itr)perform filter_batch for each batch in itr
filter_batch
(queries, id1_lists)filter each id_list in id1_lists using this predicate.
index_component_sizes
(for_search)return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
return the total size in bytes of all the files associated with this predicate
init
()initialize the predicate for searching or filtering
search
(itr)perform search_batch for each batch in itr
search_batch
(queries)perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)
- abstractmethod build(for_search: bool, index_table: DataFrame, index_id_col: str = '_id', cache: BuildCache | None = None)
build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
- Parameters:
- for_searchbool
build the predicate for searching, otherwise streaming / filtering
- index_tablepyspark.sql.DataFrame
the dataframe that will be preprocessed / indexed
- index_id_colstr
the name of the unique id column in index_table
- cacheOptional[BuildCache] = None
the cache for built indexes and hash tables
- abstractmethod contains(other) bool
True if the set output by self is a superset (non-strict) of other
- abstractmethod deinit()
release the resources acquired by self.init()
- filter(itr: Iterator[Tuple[Series, Series]]) Iterator[DataFrame]
perform filter_batch for each batch in itr
- abstractmethod filter_batch(queries: Series, id1_lists: Series) DataFrame
filter each id_list in id1_lists using this predicate. This is, for each query, id_list pair in zip(queries, id1_lists), return only the ids which satisfy predicate(query, id) for id in id_list. Return a dataframe with schema (ids array<long>, scores array<float>, time float)
- abstractmethod index_component_sizes(for_search: bool) dict
return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn’t been built yet, the sizes are None
- Parameters:
- for_searchbool
return the sizes for searching or for filtering
- Returns:
- dict[Any, int | None]
- abstractmethod index_size_in_bytes() int
return the total size in bytes of all the files associated with this predicate
- abstract property indexable
True if the predicate can be efficiently indexed
- abstractmethod init()
initialize the predicate for searching or filtering
- abstract property is_topk: bool
True if the self is Topk based, else False
- search(itr: Iterator[Series]) Iterator[DataFrame]
perform search_batch for each batch in itr
- abstractmethod search_batch(queries: Series) DataFrame
perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)
- abstract property sim
The simiarlity used by the predicate
- abstract property streamable
True if the predicate can be evaluated over a single partition of the indexed table, otherwise False
delex.lang.predicate.set_sim_predicate module
- class delex.lang.predicate.set_sim_predicate.CosinePredicate(index_col: str, search_col: str, tokenizer, op, val: float)
Bases:
SetSimPredicate
- Attributes:
- index_col
indexable
True if the predicate can be efficiently indexed
- invertable
is_topk
True if the self is Topk based, else False
- op
- search_col
sim
The simiarlity used by the predicate
streamable
True if the predicate can be evaluated over a single
- val
Methods
build
(for_search, index_table[, ...])build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
contains
(other)True if the set output by self is a superset (non-strict) of other
deinit
()release the resources acquired by self.init()
filter
(itr)perform filter_batch for each batch in itr
filter_batch
(queries, id1_lists)filter each id_list in id1_lists using this predicate.
index_component_sizes
(for_search)return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
index_size_in_bytes
()return the total size in bytes of all the files associated with this predicate
init
()initialize the predicate for searching or filtering
search
(itr)perform search_batch for each batch in itr
search_batch
(queries)perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)
Sim
compute_scores
invert
search_index
- compute_scores(query, id1_list)
- class delex.lang.predicate.set_sim_predicate.JaccardPredicate(index_col: str, search_col: str, tokenizer, op, val: float)
Bases:
SetSimPredicate
- Attributes:
- index_col
indexable
True if the predicate can be efficiently indexed
- invertable
is_topk
True if the self is Topk based, else False
- op
- search_col
sim
The simiarlity used by the predicate
streamable
True if the predicate can be evaluated over a single
- val
Methods
build
(for_search, index_table[, ...])build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
contains
(other)True if the set output by self is a superset (non-strict) of other
deinit
()release the resources acquired by self.init()
filter
(itr)perform filter_batch for each batch in itr
filter_batch
(queries, id1_lists)filter each id_list in id1_lists using this predicate.
index_component_sizes
(for_search)return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
index_size_in_bytes
()return the total size in bytes of all the files associated with this predicate
init
()initialize the predicate for searching or filtering
search
(itr)perform search_batch for each batch in itr
search_batch
(queries)perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)
Sim
compute_scores
invert
search_index
- compute_scores(query, id1_list)
- class delex.lang.predicate.set_sim_predicate.OverlapCoeffPredicate(index_col: str, search_col: str, tokenizer, op, val: float)
Bases:
SetSimPredicate
- Attributes:
- index_col
indexable
True if the predicate can be efficiently indexed
- invertable
is_topk
True if the self is Topk based, else False
- op
- search_col
sim
The simiarlity used by the predicate
streamable
True if the predicate can be evaluated over a single
- val
Methods
build
(for_search, index_table[, ...])build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
contains
(other)True if the set output by self is a superset (non-strict) of other
deinit
()release the resources acquired by self.init()
filter
(itr)perform filter_batch for each batch in itr
filter_batch
(queries, id1_lists)filter each id_list in id1_lists using this predicate.
index_component_sizes
(for_search)return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
index_size_in_bytes
()return the total size in bytes of all the files associated with this predicate
init
()initialize the predicate for searching or filtering
search
(itr)perform search_batch for each batch in itr
search_batch
(queries)perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)
Sim
compute_scores
invert
search_index
- compute_scores(query, id1_list)
- property indexable
True if the predicate can be efficiently indexed
- class delex.lang.predicate.set_sim_predicate.SetSimPredicate(index_col: str, search_col: str, tokenizer, op, val: float)
Bases:
ThresholdPredicate
- Attributes:
- index_col
indexable
True if the predicate can be efficiently indexed
- invertable
is_topk
True if the self is Topk based, else False
- op
- search_col
sim
The simiarlity used by the predicate
streamable
True if the predicate can be evaluated over a single
- val
Methods
build
(for_search, index_table[, ...])build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
contains
(other)True if the set output by self is a superset (non-strict) of other
deinit
()release the resources acquired by self.init()
filter
(itr)perform filter_batch for each batch in itr
filter_batch
(queries, id1_lists)filter each id_list in id1_lists using this predicate.
index_component_sizes
(for_search)return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
return the total size in bytes of all the files associated with this predicate
init
()initialize the predicate for searching or filtering
search
(itr)perform search_batch for each batch in itr
search_batch
(queries)perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)
Sim
compute_scores
invert
search_index
- class Sim(index_col: str, search_col: str, sim_name: str, tokenizer_name: str)
Bases:
object
- index_col: str
- search_col: str
- sim_name: str
- tokenizer_name: str
- build(for_search, index_table, index_id_col='_id', cache=None)
build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
- Parameters:
- for_searchbool
build the predicate for searching, otherwise streaming / filtering
- index_tablepyspark.sql.DataFrame
the dataframe that will be preprocessed / indexed
- index_id_colstr
the name of the unique id column in index_table
- cacheOptional[BuildCache] = None
the cache for built indexes and hash tables
- contains(other)
True if the set output by self is a superset (non-strict) of other
- deinit()
release the resources acquired by self.init()
- index_component_sizes(for_search: bool) dict
return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn’t been built yet, the sizes are None
- Parameters:
- for_searchbool
return the sizes for searching or for filtering
- Returns:
- dict[Any, int | None]
- index_size_in_bytes() int
return the total size in bytes of all the files associated with this predicate
- property indexable
True if the predicate can be efficiently indexed
- init()
initialize the predicate for searching or filtering
- invert()
- property is_topk
True if the self is Topk based, else False
- search_index(query)
- property sim
The simiarlity used by the predicate
- property streamable
True if the predicate can be evaluated over a single partition of the indexed table, otherwise False
delex.lang.predicate.string_sim_predicate module
- class delex.lang.predicate.string_sim_predicate.EditDistancePredicate(index_col: str, search_col: str, op, val)
Bases:
StringSimPredicate
- Attributes:
- index_col
indexable
True if the predicate can be efficiently indexed
- invertable
is_topk
True if the self is Topk based, else False
- op
- search_col
sim
The simiarlity used by the predicate
streamable
True if the predicate can be evaluated over a single
- val
Methods
build
(for_search, index_table[, ...])build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
contains
(other)True if the set output by self is a superset (non-strict) of other
deinit
()release the resources acquired by self.init()
filter
(itr)perform filter_batch for each batch in itr
filter_batch
(queries, id1_lists)filter each id_list in id1_lists using this predicate.
index_component_sizes
(for_search)return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
index_size_in_bytes
()return the total size in bytes of all the files associated with this predicate
init
()initialize the predicate for searching or filtering
search
(itr)perform search_batch for each batch in itr
search_batch
(queries)perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)
Sim
compute_scores
invert
search_index
- class delex.lang.predicate.string_sim_predicate.JaroPredicate(index_col: str, search_col: str, op, val)
Bases:
StringSimPredicate
- Attributes:
- index_col
indexable
True if the predicate can be efficiently indexed
- invertable
is_topk
True if the self is Topk based, else False
- op
- search_col
sim
The simiarlity used by the predicate
streamable
True if the predicate can be evaluated over a single
- val
Methods
build
(for_search, index_table[, ...])build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
contains
(other)True if the set output by self is a superset (non-strict) of other
deinit
()release the resources acquired by self.init()
filter
(itr)perform filter_batch for each batch in itr
filter_batch
(queries, id1_lists)filter each id_list in id1_lists using this predicate.
index_component_sizes
(for_search)return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
index_size_in_bytes
()return the total size in bytes of all the files associated with this predicate
init
()initialize the predicate for searching or filtering
search
(itr)perform search_batch for each batch in itr
search_batch
(queries)perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)
Sim
compute_scores
invert
search_index
- class delex.lang.predicate.string_sim_predicate.JaroWinklerPredicate(index_col: str, search_col: str, op, val, prefix_weight=0.1)
Bases:
StringSimPredicate
- Attributes:
- index_col
indexable
True if the predicate can be efficiently indexed
- invertable
is_topk
True if the self is Topk based, else False
- op
- search_col
sim
The simiarlity used by the predicate
streamable
True if the predicate can be evaluated over a single
- val
Methods
build
(for_search, index_table[, ...])build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
contains
(o)True if the set output by self is a superset (non-strict) of other
deinit
()release the resources acquired by self.init()
filter
(itr)perform filter_batch for each batch in itr
filter_batch
(queries, id1_lists)filter each id_list in id1_lists using this predicate.
index_component_sizes
(for_search)return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
index_size_in_bytes
()return the total size in bytes of all the files associated with this predicate
init
()initialize the predicate for searching or filtering
search
(itr)perform search_batch for each batch in itr
search_batch
(queries)perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)
Sim
compute_scores
invert
search_index
- class Sim(index_col: str, search_col: str, sim_name: str, prefix_weight: float)
Bases:
Sim
- prefix_weight: float
- contains(o)
True if the set output by self is a superset (non-strict) of other
- class delex.lang.predicate.string_sim_predicate.SmithWatermanPredicate(index_col: str, search_col: str, op, val, gap_cost=1.0)
Bases:
StringSimPredicate
- Attributes:
- index_col
indexable
True if the predicate can be efficiently indexed
- invertable
is_topk
True if the self is Topk based, else False
- op
- search_col
sim
The simiarlity used by the predicate
streamable
True if the predicate can be evaluated over a single
- val
Methods
build
(for_search, index_table[, ...])build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
contains
(o)True if the set output by self is a superset (non-strict) of other
deinit
()release the resources acquired by self.init()
filter
(itr)perform filter_batch for each batch in itr
filter_batch
(queries, id1_lists)filter each id_list in id1_lists using this predicate.
index_component_sizes
(for_search)return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
index_size_in_bytes
()return the total size in bytes of all the files associated with this predicate
init
()initialize the predicate for searching or filtering
search
(itr)perform search_batch for each batch in itr
search_batch
(queries)perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)
Sim
compute_scores
invert
search_index
- class Sim(index_col: str, search_col: str, sim_name: str, gap_cost: float)
Bases:
Sim
- gap_cost: float
- contains(o)
True if the set output by self is a superset (non-strict) of other
- class delex.lang.predicate.string_sim_predicate.StringSimPredicate(index_col: str, search_col: str, op, val)
Bases:
ThresholdPredicate
- Attributes:
- index_col
indexable
True if the predicate can be efficiently indexed
- invertable
is_topk
True if the self is Topk based, else False
- op
- search_col
sim
The simiarlity used by the predicate
streamable
True if the predicate can be evaluated over a single
- val
Methods
build
(for_search, index_table[, ...])build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
contains
(other)True if the set output by self is a superset (non-strict) of other
deinit
()release the resources acquired by self.init()
filter
(itr)perform filter_batch for each batch in itr
filter_batch
(queries, id1_lists)filter each id_list in id1_lists using this predicate.
index_component_sizes
(for_search)return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
return the total size in bytes of all the files associated with this predicate
init
()initialize the predicate for searching or filtering
search
(itr)perform search_batch for each batch in itr
search_batch
(queries)perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)
Sim
compute_scores
invert
search_index
- class Sim(index_col: str, search_col: str, sim_name: str)
Bases:
object
- index_col: str
- search_col: str
- sim_name: str
- build(for_search, index_table, index_id_col='_id', cache=None)
build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
- Parameters:
- for_searchbool
build the predicate for searching, otherwise streaming / filtering
- index_tablepyspark.sql.DataFrame
the dataframe that will be preprocessed / indexed
- index_id_colstr
the name of the unique id column in index_table
- cacheOptional[BuildCache] = None
the cache for built indexes and hash tables
- compute_scores(query: str, id1_list)
- deinit()
release the resources acquired by self.init()
- index_component_sizes(for_search: bool) dict
return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn’t been built yet, the sizes are None
- Parameters:
- for_searchbool
return the sizes for searching or for filtering
- Returns:
- dict[Any, int | None]
- index_size_in_bytes() int
return the total size in bytes of all the files associated with this predicate
- property indexable
True if the predicate can be efficiently indexed
- init()
initialize the predicate for searching or filtering
- invert()
- property is_topk
True if the self is Topk based, else False
- search(itr)
perform search_batch for each batch in itr
- search_index(query)
- property sim
The simiarlity used by the predicate
- property streamable
True if the predicate can be evaluated over a single partition of the indexed table, otherwise False
delex.lang.predicate.threshold_predicate module
- class delex.lang.predicate.threshold_predicate.ThresholdPredicate(index_col, search_col, op, val: float)
Bases:
Predicate
,ABC
- Attributes:
- index_col
indexable
True if the predicate can be efficiently indexed
- invertable
is_topk
True if the self is Topk based, else False
- op
- search_col
sim
The simiarlity used by the predicate
streamable
True if the predicate can be evaluated over a single
- val
Methods
build
(for_search, index_table[, ...])build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
contains
(other)True if the set output by self is a superset (non-strict) of other
deinit
()release the resources acquired by self.init()
filter
(itr)perform filter_batch for each batch in itr
filter_batch
(queries, id1_lists)filter each id_list in id1_lists using this predicate.
index_component_sizes
(for_search)return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
index_size_in_bytes
()return the total size in bytes of all the files associated with this predicate
init
()initialize the predicate for searching or filtering
search
(itr)perform search_batch for each batch in itr
search_batch
(queries)perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)
compute_scores
search_index
- abstractmethod compute_scores(query, id1_list)
- contains(other) bool
True if the set output by self is a superset (non-strict) of other
- filter_batch(queries: Series, id1_lists: Series) Iterator[DataFrame]
filter each id_list in id1_lists using this predicate. This is, for each query, id_list pair in zip(queries, id1_lists), return only the ids which satisfy predicate(query, id) for id in id_list. Return a dataframe with schema (ids array<long>, scores array<float>, time float)
- property index_col
- property invertable: bool
- property op: Callable
- search_batch(queries: Series) DataFrame
perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)
- property search_col
- abstractmethod search_index(query)
- property val: float
delex.lang.predicate.topk_predicate module
- class delex.lang.predicate.topk_predicate.BM25TopkPredicate(index_col, search_col, tokenizer: str, k: int)
Bases:
Predicate
- Attributes:
- index_col
indexable
True if the predicate can be efficiently indexed
- invertable
is_topk
True if the self is Topk based, else False
- k
- search_col
sim
The simiarlity used by the predicate
streamable
True if the predicate can be evaluated over a single
Methods
build
(for_search, index_table[, ...])build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
contains
(other)True if the set output by self is a superset (non-strict) of other
deinit
()release the resources acquired by self.init()
filter
(itr)perform filter_batch for each batch in itr
filter_batch
(queries, id1_lists)filter each id_list in id1_lists using this predicate.
index_component_sizes
(for_search)return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn't been built yet, the sizes are None
return the total size in bytes of all the files associated with this predicate
init
()initialize the predicate for searching or filtering
search
(itr)perform search_batch for each batch in itr
search_batch
(queries)perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)
Sim
- class Sim(index_col: str, search_col: str, tokenizer_name: str)
Bases:
object
- index_col: str
- search_col: str
- tokenizer_name: str
- build(for_search, index_table, index_id_col='_id', cache=None)
build the Predicate over index_table using index_id_col as a unique id, optionally using cache to get or set the index
- Parameters:
- for_searchbool
build the predicate for searching, otherwise streaming / filtering
- index_tablepyspark.sql.DataFrame
the dataframe that will be preprocessed / indexed
- index_id_colstr
the name of the unique id column in index_table
- cacheOptional[BuildCache] = None
the cache for built indexes and hash tables
- contains(other) bool
True if the set output by self is a superset (non-strict) of other
- deinit()
release the resources acquired by self.init()
- filter_batch(queries: Series, id1_lists: Series) Iterator[DataFrame]
filter each id_list in id1_lists using this predicate. This is, for each query, id_list pair in zip(queries, id1_lists), return only the ids which satisfy predicate(query, id) for id in id_list. Return a dataframe with schema (ids array<long>, scores array<float>, time float)
- property index_col
- index_component_sizes(for_search: bool) dict
return a dictionary of file sizes for each data structure used by this predicate, if the predicate hasn’t been built yet, the sizes are None
- Parameters:
- for_searchbool
return the sizes for searching or for filtering
- Returns:
- dict[Any, int | None]
- index_size_in_bytes() int
return the total size in bytes of all the files associated with this predicate
- property indexable
True if the predicate can be efficiently indexed
- init()
initialize the predicate for searching or filtering
- property invertable: bool
- property is_topk: bool
True if the self is Topk based, else False
- property k
- search_batch(queries)
perform search with queries return a dataframe with schema (ids array<long>, scores array<float>, time float)
- property search_col
- property sim
The simiarlity used by the predicate
- property streamable
True if the predicate can be evaluated over a single partition of the indexed table, otherwise False
- class delex.lang.predicate.topk_predicate.CachedBM25IndexKey(index_col: str, tokenizer: str)
Bases:
CachedObjectKey
- index_col: str
- tokenizer: str