delex.storage package

Submodules

delex.storage.memmap_arr module

class delex.storage.memmap_arr.MemmapArray(arr)

Bases: SparkDistributable

Attributes:

shape
values

Methods

`deinit`()	deinitialize the object, closing resources (e.g. file handles).
`init`()	initialize the object to be used on in a spark worker
`to_spark`()	send the obj to the spark cluster to be used on spark workers

delete
size_in_bytes

deinit(): deinitialize the object, closing resources (e.g. file handles)

delete()

init(): initialize the object to be used on in a spark worker

property shape

size_in_bytes()

to_spark(): send the obj to the spark cluster to be used on spark workers

property values

delex.storage.memmap_seqs module

class delex.storage.memmap_seqs.MemmapSeqs

Bases: SparkDistributable

a class to hold arbitrary sequences of elements e.g. strings, arrays of ints, etc.

Methods

`build`(df, seq_col, dtype[, id_col])	create a MemmapSeqs instance from a spark dataframe
`deinit`()	deinitialize the object, closing resources (e.g. file handles).
`fetch`(i, /)	retrieve the sequence associated with i
`init`()	initialize the object to be used on in a spark worker
`size_in_bytes`()	return the size in bytes on disk
`to_spark`()	send the obj to the spark cluster to be used on spark workers

delete

classmethod build(df: DataFrame, seq_col: str, dtype: type, id_col: str = '_id')

create a MemmapSeqs instance from a spark dataframe

Parameters:

dfpyspark.sql.DataFrame: the dataframe containing the sequences and ids
seq_colstr: the name of the column in df that contains the sequences, e.g. strings, arrays
dtypetype: the dtype of the elements in seq_col
id_colstr: the name of the column in df that contains the ids for retrieving the sequences

Returns:

MemmapSeqs

deinit(): deinitialize the object, closing resources (e.g. file handles)

delete()

fetch(i: int, /) → ndarray | None

retrieve the sequence associated with i

Returns:

np.ndarray if i is found, else None

init(): initialize the object to be used on in a spark worker

size_in_bytes() → int: return the size in bytes on disk

to_spark(): send the obj to the spark cluster to be used on spark workers

delex.storage.packed_memmap_arrs module

class delex.storage.packed_memmap_arrs.PackedMemmapArrays(arrs)

Bases: SparkDistributable

a container for many MemmapArrays. used to store many MemmapArrays in a single file

Methods

`deinit`()	deinitialize the object, closing resources (e.g. file handles).
`init`()	initialize the object to be used on in a spark worker
`to_spark`()	send the obj to the spark cluster to be used on spark workers
`unpack`()	read all of the memmap arrays and return as a list

delete
size_in_bytes

deinit(): deinitialize the object, closing resources (e.g. file handles)

delete()

init(): initialize the object to be used on in a spark worker

size_in_bytes() → int

to_spark(): send the obj to the spark cluster to be used on spark workers

unpack() → List[ndarray]: read all of the memmap arrays and return as a list

delex.storage.sorted_set module

class delex.storage.sorted_set.MemmapSortedSets

Bases: MemmapSeqs

a class for storing sorted sets of token ids (as arrays)

Methods

`build`(df, col[, id_col])	Create a new MemmapSortedSets over tokens in df[col] and writing to disk
`cosine`(query, ids)	compute cosine score between query and the sequences referenced by ids
`deinit`()	deinitialize the object, closing resources (e.g. file handles).
`fetch`(i, /)	retrieve the sequence associated with i
`init`()	initialize the object to be used on in a spark worker
`jaccard`(query, ids)	compute jaccard score between query and the sequences referenced by ids
`overlap_coeff`(query, ids)	compute overlap_coefficient score between query and the sequences referenced by ids
`size_in_bytes`()	return the size in bytes on disk
`to_spark`()	send the obj to the spark cluster to be used on spark workers

CacheKey
delete

class CacheKey(index_col: str, search_col: str | None, tokenizer_type: str)

Bases: CachedObjectKey

index_col: str

search_col: str | None

tokenizer_type: str

classmethod build(df: DataFrame, col: str, id_col: str = '_id'): Create a new MemmapSortedSets over tokens in df[col] and writing to disk

cosine(query: ndarray, ids: ndarray) → ndarray

compute cosine score between query and the sequences referenced by ids

Parameters:

querynp.ndarray: a sorted unique array of token ids
idsnp.ndarray: an array of ids of token sets in self

Returns:

an array of scores where: scores[i] = cosine(query, token_sets[ids[i]]) if ids[i] is in token_sets else scores[i] = np.nan

jaccard(query: ndarray, ids: ndarray) → ndarray

compute jaccard score between query and the sequences referenced by ids

Parameters:

querynp.ndarray: a sorted unique array of token ids
idsnp.ndarray: an array of ids of token sets in self

Returns:

an array of scores where: scores[i] = jaccard(query, token_sets[ids[i]]) if ids[i] is in token_sets else scores[i] = np.nan

overlap_coeff(query: ndarray, ids: ndarray) → ndarray

compute overlap_coefficient score between query and the sequences referenced by ids

Parameters:

querynp.ndarray: a sorted unique array of token ids
idsnp.ndarray: an array of ids of token sets in self

Returns:

an array of scores where: scores[i] = overlap_coefficient(query, token_sets[ids[i]]) if ids[i] is in token_sets else scores[i] = np.nan

delex.storage.span_map module

delex.storage.span_map.create_span_map(keys, offsets, lengths, load_factor=0.75)

create a new span map of for keys, offsets, and lengths

Returns:

np.ndarray

delex.storage.span_map.span_map_get_key(arr, key): get the entry from the span map, return the offset and length as a tuple

delex.storage.span_map.span_map_insert_key(arr, key, offset, length): insert a single key into the span_map arr

delex.storage.span_map.span_map_insert_keys(arr, keys, offsets, lengths): insert many keys into the span_map arr

delex.storage.string_store module

class delex.storage.string_store.MemmapStrings

Bases: MemmapSeqs

Methods

`build`(df, col[, id_col])	create a MemmapSeqs instance from a spark dataframe
`deinit`()	deinitialize the object, closing resources (e.g. file handles).
`fetch`(i)	retrieve the sequence associated with i
`init`()	initialize the object to be used on in a spark worker
`size_in_bytes`()	return the size in bytes on disk
`to_spark`()	send the obj to the spark cluster to be used on spark workers

CacheKey
delete
fetch_bytes

class CacheKey(index_col: str)

Bases: CachedObjectKey

index_col: str

classmethod build(df, col, id_col='_id')

create a MemmapSeqs instance from a spark dataframe

Parameters:

dfpyspark.sql.DataFrame: the dataframe containing the sequences and ids
seq_colstr: the name of the column in df that contains the sequences, e.g. strings, arrays
dtypetype: the dtype of the elements in seq_col
id_colstr: the name of the column in df that contains the ids for retrieving the sequences

Returns:

MemmapSeqs

fetch(i)

retrieve the sequence associated with i

Returns:

np.ndarray if i is found, else None

fetch_bytes(i)

delex.storage.vector_store module

class delex.storage.vector_store.MemmapVectorStore

Bases: MemmapSeqs

a class for storing sorted sets of token ids (as arrays)

Methods

`build`(df, seq_col[, id_col])	create a MemmapSeqs instance from a spark dataframe
`deinit`()	deinitialize the object, closing resources (e.g. file handles).
`dot`(query, ids)	compute cosine score between query and the sequences referenced by ids
`fetch`(i, /)	retrieve the sequence associated with i
`init`()	initialize the object to be used on in a spark worker
`size_in_bytes`()	return the size in bytes on disk
`to_spark`()	send the obj to the spark cluster to be used on spark workers

CacheKey
arrays_to_encoded_sparse_vector
decode_sparse_vector
delete

class CacheKey(index_col: str, search_col: str | None, tokenizer_type: str)

Bases: CachedObjectKey

index_col: str

search_col: str | None

tokenizer_type: str

static arrays_to_encoded_sparse_vector(ind: ndarray, val: ndarray) → bytes

classmethod build(df: DataFrame, seq_col: str, id_col: str = '_id')

create a MemmapSeqs instance from a spark dataframe

Parameters:

dfpyspark.sql.DataFrame: the dataframe containing the sequences and ids
seq_colstr: the name of the column in df that contains the sequences, e.g. strings, arrays
dtypetype: the dtype of the elements in seq_col
id_colstr: the name of the column in df that contains the ids for retrieving the sequences

Returns:

MemmapSeqs

static decode_sparse_vector(bin: bytes) → ndarray

dot(query: ndarray, ids: ndarray) → ndarray

compute cosine score between query and the sequences referenced by ids

Parameters:

querynp.ndarray: a sorted unique array of token ids
idsnp.ndarray: an array of ids of token sets in self

Returns:

an array of scores where: scores[i] = cosine(query, token_sets[ids[i]]) if ids[i] is in token_sets else scores[i] = np.nan

fetch(i: int, /) → ndarray | None

retrieve the sequence associated with i

Returns:

np.ndarray if i is found, else None

vector_dtype = dtype([('ind', '<i4'), ('val', '<f4')])

delex.storage.vector_store.iter_spark_rows(df, prefetch_size: int)

delex.storage package

Subpackages

Submodules

delex.storage.memmap_arr module

delex.storage.memmap_seqs module

delex.storage.packed_memmap_arrs module

delex.storage.sorted_set module

delex.storage.span_map module

delex.storage.string_store module

delex.storage.vector_store module

Module contents