delex.storage package
Subpackages
Submodules
delex.storage.memmap_arr module
- class delex.storage.memmap_arr.MemmapArray(arr)
Bases:
SparkDistributable
- Attributes:
- shape
- values
Methods
deinit
()deinitialize the object, closing resources (e.g. file handles).
init
()initialize the object to be used on in a spark worker
to_spark
()send the obj to the spark cluster to be used on spark workers
delete
size_in_bytes
- deinit()
deinitialize the object, closing resources (e.g. file handles)
- delete()
- init()
initialize the object to be used on in a spark worker
- property shape
- size_in_bytes()
- to_spark()
send the obj to the spark cluster to be used on spark workers
- property values
delex.storage.memmap_seqs module
- class delex.storage.memmap_seqs.MemmapSeqs
Bases:
SparkDistributable
a class to hold arbitrary sequences of elements e.g. strings, arrays of ints, etc.
Methods
build
(df, seq_col, dtype[, id_col])create a MemmapSeqs instance from a spark dataframe
deinit
()deinitialize the object, closing resources (e.g. file handles).
fetch
(i, /)retrieve the sequence associated with i
init
()initialize the object to be used on in a spark worker
return the size in bytes on disk
to_spark
()send the obj to the spark cluster to be used on spark workers
delete
- classmethod build(df: DataFrame, seq_col: str, dtype: type, id_col: str = '_id')
create a MemmapSeqs instance from a spark dataframe
- Parameters:
- dfpyspark.sql.DataFrame
the dataframe containing the sequences and ids
- seq_colstr
the name of the column in df that contains the sequences, e.g. strings, arrays
- dtypetype
the dtype of the elements in seq_col
- id_colstr
the name of the column in df that contains the ids for retrieving the sequences
- Returns:
- MemmapSeqs
- deinit()
deinitialize the object, closing resources (e.g. file handles)
- delete()
- fetch(i: int, /) ndarray | None
retrieve the sequence associated with i
- Returns:
- np.ndarray if i is found, else None
- init()
initialize the object to be used on in a spark worker
- size_in_bytes() int
return the size in bytes on disk
- to_spark()
send the obj to the spark cluster to be used on spark workers
delex.storage.packed_memmap_arrs module
- class delex.storage.packed_memmap_arrs.PackedMemmapArrays(arrs)
Bases:
SparkDistributable
a container for many MemmapArrays. used to store many MemmapArrays in a single file
Methods
deinit
()deinitialize the object, closing resources (e.g. file handles).
init
()initialize the object to be used on in a spark worker
to_spark
()send the obj to the spark cluster to be used on spark workers
unpack
()read all of the memmap arrays and return as a list
delete
size_in_bytes
- deinit()
deinitialize the object, closing resources (e.g. file handles)
- delete()
- init()
initialize the object to be used on in a spark worker
- size_in_bytes() int
- to_spark()
send the obj to the spark cluster to be used on spark workers
- unpack() List[ndarray]
read all of the memmap arrays and return as a list
delex.storage.sorted_set module
- class delex.storage.sorted_set.MemmapSortedSets
Bases:
MemmapSeqs
a class for storing sorted sets of token ids (as arrays)
Methods
build
(df, col[, id_col])Create a new MemmapSortedSets over tokens in df[col] and writing to disk
cosine
(query, ids)compute cosine score between query and the sequences referenced by ids
deinit
()deinitialize the object, closing resources (e.g. file handles).
fetch
(i, /)retrieve the sequence associated with i
init
()initialize the object to be used on in a spark worker
jaccard
(query, ids)compute jaccard score between query and the sequences referenced by ids
overlap_coeff
(query, ids)compute overlap_coefficient score between query and the sequences referenced by ids
size_in_bytes
()return the size in bytes on disk
to_spark
()send the obj to the spark cluster to be used on spark workers
CacheKey
delete
- class CacheKey(index_col: str, search_col: str | None, tokenizer_type: str)
Bases:
CachedObjectKey
- index_col: str
- search_col: str | None
- tokenizer_type: str
- classmethod build(df: DataFrame, col: str, id_col: str = '_id')
Create a new MemmapSortedSets over tokens in df[col] and writing to disk
- cosine(query: ndarray, ids: ndarray) ndarray
compute cosine score between query and the sequences referenced by ids
- Parameters:
- querynp.ndarray
a sorted unique array of token ids
- idsnp.ndarray
an array of ids of token sets in self
- Returns:
- an array of scores where
scores[i] = cosine(query, token_sets[ids[i]]) if ids[i] is in token_sets else scores[i] = np.nan
- jaccard(query: ndarray, ids: ndarray) ndarray
compute jaccard score between query and the sequences referenced by ids
- Parameters:
- querynp.ndarray
a sorted unique array of token ids
- idsnp.ndarray
an array of ids of token sets in self
- Returns:
- an array of scores where
scores[i] = jaccard(query, token_sets[ids[i]]) if ids[i] is in token_sets else scores[i] = np.nan
- overlap_coeff(query: ndarray, ids: ndarray) ndarray
compute overlap_coefficient score between query and the sequences referenced by ids
- Parameters:
- querynp.ndarray
a sorted unique array of token ids
- idsnp.ndarray
an array of ids of token sets in self
- Returns:
- an array of scores where
scores[i] = overlap_coefficient(query, token_sets[ids[i]]) if ids[i] is in token_sets else scores[i] = np.nan
delex.storage.span_map module
- delex.storage.span_map.create_span_map(keys, offsets, lengths, load_factor=0.75)
create a new span map of for keys, offsets, and lengths
- Returns:
- np.ndarray
- delex.storage.span_map.span_map_get_key(arr, key)
get the entry from the span map, return the offset and length as a tuple
- delex.storage.span_map.span_map_insert_key(arr, key, offset, length)
insert a single key into the span_map arr
- delex.storage.span_map.span_map_insert_keys(arr, keys, offsets, lengths)
insert many keys into the span_map arr
delex.storage.string_store module
- class delex.storage.string_store.MemmapStrings
Bases:
MemmapSeqs
Methods
build
(df, col[, id_col])create a MemmapSeqs instance from a spark dataframe
deinit
()deinitialize the object, closing resources (e.g. file handles).
fetch
(i)retrieve the sequence associated with i
init
()initialize the object to be used on in a spark worker
size_in_bytes
()return the size in bytes on disk
to_spark
()send the obj to the spark cluster to be used on spark workers
CacheKey
delete
fetch_bytes
- class CacheKey(index_col: str)
Bases:
CachedObjectKey
- index_col: str
- classmethod build(df, col, id_col='_id')
create a MemmapSeqs instance from a spark dataframe
- Parameters:
- dfpyspark.sql.DataFrame
the dataframe containing the sequences and ids
- seq_colstr
the name of the column in df that contains the sequences, e.g. strings, arrays
- dtypetype
the dtype of the elements in seq_col
- id_colstr
the name of the column in df that contains the ids for retrieving the sequences
- Returns:
- MemmapSeqs
- fetch(i)
retrieve the sequence associated with i
- Returns:
- np.ndarray if i is found, else None
- fetch_bytes(i)
delex.storage.vector_store module
- class delex.storage.vector_store.MemmapVectorStore
Bases:
MemmapSeqs
a class for storing sorted sets of token ids (as arrays)
Methods
build
(df, seq_col[, id_col])create a MemmapSeqs instance from a spark dataframe
deinit
()deinitialize the object, closing resources (e.g. file handles).
dot
(query, ids)compute cosine score between query and the sequences referenced by ids
fetch
(i, /)retrieve the sequence associated with i
init
()initialize the object to be used on in a spark worker
size_in_bytes
()return the size in bytes on disk
to_spark
()send the obj to the spark cluster to be used on spark workers
CacheKey
arrays_to_encoded_sparse_vector
decode_sparse_vector
delete
- class CacheKey(index_col: str, search_col: str | None, tokenizer_type: str)
Bases:
CachedObjectKey
- index_col: str
- search_col: str | None
- tokenizer_type: str
- static arrays_to_encoded_sparse_vector(ind: ndarray, val: ndarray) bytes
- classmethod build(df: DataFrame, seq_col: str, id_col: str = '_id')
create a MemmapSeqs instance from a spark dataframe
- Parameters:
- dfpyspark.sql.DataFrame
the dataframe containing the sequences and ids
- seq_colstr
the name of the column in df that contains the sequences, e.g. strings, arrays
- dtypetype
the dtype of the elements in seq_col
- id_colstr
the name of the column in df that contains the ids for retrieving the sequences
- Returns:
- MemmapSeqs
- static decode_sparse_vector(bin: bytes) ndarray
- dot(query: ndarray, ids: ndarray) ndarray
compute cosine score between query and the sequences referenced by ids
- Parameters:
- querynp.ndarray
a sorted unique array of token ids
- idsnp.ndarray
an array of ids of token sets in self
- Returns:
- an array of scores where
scores[i] = cosine(query, token_sets[ids[i]]) if ids[i] is in token_sets else scores[i] = np.nan
- fetch(i: int, /) ndarray | None
retrieve the sequence associated with i
- Returns:
- np.ndarray if i is found, else None
- vector_dtype = dtype([('ind', '<i4'), ('val', '<f4')])
- delex.storage.vector_store.iter_spark_rows(df, prefetch_size: int)