delex.storage package

Subpackages

Submodules

delex.storage.memmap_arr module

class delex.storage.memmap_arr.MemmapArray(arr)

Bases: SparkDistributable

Attributes:
shape
values

Methods

deinit()

deinitialize the object, closing resources (e.g. file handles).

init()

initialize the object to be used on in a spark worker

to_spark()

send the obj to the spark cluster to be used on spark workers

delete

size_in_bytes

deinit()

deinitialize the object, closing resources (e.g. file handles)

delete()
init()

initialize the object to be used on in a spark worker

property shape
size_in_bytes()
to_spark()

send the obj to the spark cluster to be used on spark workers

property values

delex.storage.memmap_seqs module

class delex.storage.memmap_seqs.MemmapSeqs

Bases: SparkDistributable

a class to hold arbitrary sequences of elements e.g. strings, arrays of ints, etc.

Methods

build(df, seq_col, dtype[, id_col])

create a MemmapSeqs instance from a spark dataframe

deinit()

deinitialize the object, closing resources (e.g. file handles).

fetch(i, /)

retrieve the sequence associated with i

init()

initialize the object to be used on in a spark worker

size_in_bytes()

return the size in bytes on disk

to_spark()

send the obj to the spark cluster to be used on spark workers

delete

classmethod build(df: DataFrame, seq_col: str, dtype: type, id_col: str = '_id')

create a MemmapSeqs instance from a spark dataframe

Parameters:
dfpyspark.sql.DataFrame

the dataframe containing the sequences and ids

seq_colstr

the name of the column in df that contains the sequences, e.g. strings, arrays

dtypetype

the dtype of the elements in seq_col

id_colstr

the name of the column in df that contains the ids for retrieving the sequences

Returns:
MemmapSeqs
deinit()

deinitialize the object, closing resources (e.g. file handles)

delete()
fetch(i: int, /) ndarray | None

retrieve the sequence associated with i

Returns:
np.ndarray if i is found, else None
init()

initialize the object to be used on in a spark worker

size_in_bytes() int

return the size in bytes on disk

to_spark()

send the obj to the spark cluster to be used on spark workers

delex.storage.packed_memmap_arrs module

class delex.storage.packed_memmap_arrs.PackedMemmapArrays(arrs)

Bases: SparkDistributable

a container for many MemmapArrays. used to store many MemmapArrays in a single file

Methods

deinit()

deinitialize the object, closing resources (e.g. file handles).

init()

initialize the object to be used on in a spark worker

to_spark()

send the obj to the spark cluster to be used on spark workers

unpack()

read all of the memmap arrays and return as a list

delete

size_in_bytes

deinit()

deinitialize the object, closing resources (e.g. file handles)

delete()
init()

initialize the object to be used on in a spark worker

size_in_bytes() int
to_spark()

send the obj to the spark cluster to be used on spark workers

unpack() List[ndarray]

read all of the memmap arrays and return as a list

delex.storage.sorted_set module

class delex.storage.sorted_set.MemmapSortedSets

Bases: MemmapSeqs

a class for storing sorted sets of token ids (as arrays)

Methods

build(df, col[, id_col])

Create a new MemmapSortedSets over tokens in df[col] and writing to disk

cosine(query, ids)

compute cosine score between query and the sequences referenced by ids

deinit()

deinitialize the object, closing resources (e.g. file handles).

fetch(i, /)

retrieve the sequence associated with i

init()

initialize the object to be used on in a spark worker

jaccard(query, ids)

compute jaccard score between query and the sequences referenced by ids

overlap_coeff(query, ids)

compute overlap_coefficient score between query and the sequences referenced by ids

size_in_bytes()

return the size in bytes on disk

to_spark()

send the obj to the spark cluster to be used on spark workers

CacheKey

delete

class CacheKey(index_col: str, search_col: str | None, tokenizer_type: str)

Bases: CachedObjectKey

index_col: str
search_col: str | None
tokenizer_type: str
classmethod build(df: DataFrame, col: str, id_col: str = '_id')

Create a new MemmapSortedSets over tokens in df[col] and writing to disk

cosine(query: ndarray, ids: ndarray) ndarray

compute cosine score between query and the sequences referenced by ids

Parameters:
querynp.ndarray

a sorted unique array of token ids

idsnp.ndarray

an array of ids of token sets in self

Returns:
an array of scores where

scores[i] = cosine(query, token_sets[ids[i]]) if ids[i] is in token_sets else scores[i] = np.nan

jaccard(query: ndarray, ids: ndarray) ndarray

compute jaccard score between query and the sequences referenced by ids

Parameters:
querynp.ndarray

a sorted unique array of token ids

idsnp.ndarray

an array of ids of token sets in self

Returns:
an array of scores where

scores[i] = jaccard(query, token_sets[ids[i]]) if ids[i] is in token_sets else scores[i] = np.nan

overlap_coeff(query: ndarray, ids: ndarray) ndarray

compute overlap_coefficient score between query and the sequences referenced by ids

Parameters:
querynp.ndarray

a sorted unique array of token ids

idsnp.ndarray

an array of ids of token sets in self

Returns:
an array of scores where

scores[i] = overlap_coefficient(query, token_sets[ids[i]]) if ids[i] is in token_sets else scores[i] = np.nan

delex.storage.span_map module

delex.storage.span_map.create_span_map(keys, offsets, lengths, load_factor=0.75)

create a new span map of for keys, offsets, and lengths

Returns:
np.ndarray
delex.storage.span_map.span_map_get_key(arr, key)

get the entry from the span map, return the offset and length as a tuple

delex.storage.span_map.span_map_insert_key(arr, key, offset, length)

insert a single key into the span_map arr

delex.storage.span_map.span_map_insert_keys(arr, keys, offsets, lengths)

insert many keys into the span_map arr

delex.storage.string_store module

class delex.storage.string_store.MemmapStrings

Bases: MemmapSeqs

Methods

build(df, col[, id_col])

create a MemmapSeqs instance from a spark dataframe

deinit()

deinitialize the object, closing resources (e.g. file handles).

fetch(i)

retrieve the sequence associated with i

init()

initialize the object to be used on in a spark worker

size_in_bytes()

return the size in bytes on disk

to_spark()

send the obj to the spark cluster to be used on spark workers

CacheKey

delete

fetch_bytes

class CacheKey(index_col: str)

Bases: CachedObjectKey

index_col: str
classmethod build(df, col, id_col='_id')

create a MemmapSeqs instance from a spark dataframe

Parameters:
dfpyspark.sql.DataFrame

the dataframe containing the sequences and ids

seq_colstr

the name of the column in df that contains the sequences, e.g. strings, arrays

dtypetype

the dtype of the elements in seq_col

id_colstr

the name of the column in df that contains the ids for retrieving the sequences

Returns:
MemmapSeqs
fetch(i)

retrieve the sequence associated with i

Returns:
np.ndarray if i is found, else None
fetch_bytes(i)

delex.storage.vector_store module

class delex.storage.vector_store.MemmapVectorStore

Bases: MemmapSeqs

a class for storing sorted sets of token ids (as arrays)

Methods

build(df, seq_col[, id_col])

create a MemmapSeqs instance from a spark dataframe

deinit()

deinitialize the object, closing resources (e.g. file handles).

dot(query, ids)

compute cosine score between query and the sequences referenced by ids

fetch(i, /)

retrieve the sequence associated with i

init()

initialize the object to be used on in a spark worker

size_in_bytes()

return the size in bytes on disk

to_spark()

send the obj to the spark cluster to be used on spark workers

CacheKey

arrays_to_encoded_sparse_vector

decode_sparse_vector

delete

class CacheKey(index_col: str, search_col: str | None, tokenizer_type: str)

Bases: CachedObjectKey

index_col: str
search_col: str | None
tokenizer_type: str
static arrays_to_encoded_sparse_vector(ind: ndarray, val: ndarray) bytes
classmethod build(df: DataFrame, seq_col: str, id_col: str = '_id')

create a MemmapSeqs instance from a spark dataframe

Parameters:
dfpyspark.sql.DataFrame

the dataframe containing the sequences and ids

seq_colstr

the name of the column in df that contains the sequences, e.g. strings, arrays

dtypetype

the dtype of the elements in seq_col

id_colstr

the name of the column in df that contains the ids for retrieving the sequences

Returns:
MemmapSeqs
static decode_sparse_vector(bin: bytes) ndarray
dot(query: ndarray, ids: ndarray) ndarray

compute cosine score between query and the sequences referenced by ids

Parameters:
querynp.ndarray

a sorted unique array of token ids

idsnp.ndarray

an array of ids of token sets in self

Returns:
an array of scores where

scores[i] = cosine(query, token_sets[ids[i]]) if ids[i] is in token_sets else scores[i] = np.nan

fetch(i: int, /) ndarray | None

retrieve the sequence associated with i

Returns:
np.ndarray if i is found, else None
vector_dtype = dtype([('ind', '<i4'), ('val', '<f4')])
delex.storage.vector_store.iter_spark_rows(df, prefetch_size: int)

Module contents