Skip to main content

Python bindings for the sshash compressed k-mer dictionary

Project description

sshash

Python bindings for sshash-rs — a compressed dictionary for DNA k-mers based on Sparse and Skew Hashing.

sshash stores a set of k-mers (strings of length k over {A, C, G, T}) compactly using minimal perfect hashing and succinct data structures (Elias-Fano, BitFieldVec), and supports fast individual and streaming lookups. It is the k-mer index underlying the piscem read mapper.

Installation

pip install sshash

Building an index

From a FASTA/FASTQ file

import sshash

config = sshash.BuildConfig(k=31, m=19)
config.canonical = True   # k-mer and its reverse complement map to the same entry
config.threads = 8        # parallel build (0 = all cores)
config.verbose = False

dict = config.build_from_file("sequences.fa.gz")
dict.save("my_index")

From a list of sequences in memory

config = sshash.BuildConfig(k=31, m=19)
sequences = ["ACGTACGTACGTACGTACGTACGTACGTACG",
             "TTGCAACCGTTAGCAACGTACGTACGTACGT"]
dict = config.build(sequences)

From a Cuttlefish .cf_seg file

When sequences come from Cuttlefish, build_from_cf_seg also returns a mapping from sshash string IDs back to the original Cuttlefish node IDs:

config = sshash.BuildConfig(k=31, m=19)
dict, segment_ids = config.build_from_cf_seg("unitigs.cf_seg")
# segment_ids[i] is the Cuttlefish node ID for sshash string_id i

Loading and saving

# Save to disk (writes <prefix>.ssi and <prefix>.ssi.mphf)
dict.save("my_index")

# Load from disk
dict = sshash.Dictionary.load("my_index")

Querying

Single k-mer lookup

# Returns a Hit object, or None if not found
hit = dict.query("ACGTACGTACGTACGTACGTACGTACGTACG")
if hit is not None:
    print(hit.kmer_id)           # global k-mer ID
    print(hit.string_id)         # unitig containing this k-mer
    print(hit.kmer_id_in_string) # position within that unitig
    print(hit.orientation)       # +1 forward, -1 reverse complement

# Just the k-mer ID (faster if location info isn't needed)
kmer_id = dict.lookup("ACGTACGTACGTACGTACGTACGTACGTACG")  # None if absent

# Membership test
present = dict.contains("ACGTACGTACGTACGTACGTACGTACGTACG")

Streaming queries over a sequence

The streaming engine maintains minimizer state across consecutive k-mers, avoiding redundant MPHF lookups for adjacent positions. This is significantly faster than calling query in a loop when processing full reads or contigs.

engine = dict.streaming_query()

# Query all k-mers in a sequence at once (returns list)
hits = engine.query_sequence("ACGTACGTACGTACGTACGTACGTACGTACGTACGT")
for hit in hits:
    if hit is not None:
        print(hit.kmer_id, hit.string_id)

# Lazy iterator (memory-efficient for long sequences)
for hit in engine.iter_sequence(b"ACGTACGT..."):
    if hit is not None:
        print(hit.kmer_id)

# Efficiency statistics
print(engine.num_searches)    # full MPHF lookups performed
print(engine.num_extensions)  # k-mers resolved by sliding-window extension

Index properties

print(dict.k)           # k-mer length
print(dict.m)           # minimizer length
print(dict.canonical)   # canonical mode flag
print(dict.num_strings) # number of unitigs
print(dict.num_bits)    # total index size in bits

BuildConfig options

Property Default Description
canonical False Map each k-mer and its reverse complement to the same entry
threads 0 Worker threads during build (0 = all available cores)
ram_limit_gib 8 RAM budget (GiB) before switching to external sort
seed internal Seed for internal hash functions
verbose True Print progress during building
tmp_dir "sshash_tmp" Directory for temporary files during external sort

k and m are set at construction time and cannot be changed afterwards.

References

License

BSD 3-Clause

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sshash-0.2.0.tar.gz (97.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sshash-0.2.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

sshash-0.2.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

sshash-0.2.0-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.2 MB view details)

Uploaded CPython 3.8+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file sshash-0.2.0.tar.gz.

File metadata

  • Download URL: sshash-0.2.0.tar.gz
  • Upload date:
  • Size: 97.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sshash-0.2.0.tar.gz
Algorithm Hash digest
SHA256 dfcd5294a4862dd5f6290a8ed07ae2ed3ea02d5ee904078a29e63e61e0ef9db6
MD5 f74c17c04d928630820c5cc634675f73
BLAKE2b-256 76410129cec80867592b461c171f49e839814ac80481b7eb37d965a4a502e9fd

See more details on using hashes here.

Provenance

The following attestation bundles were made for sshash-0.2.0.tar.gz:

Publisher: publish-sshash-py.yml on COMBINE-lab/sshash-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sshash-0.2.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sshash-0.2.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bbcd1b324b4c2ee2185f94bd3c7f311c1898cae8026fdbf5438539a2b3b744e2
MD5 e807d1a4fbc9e71c77d47f6997b78d08
BLAKE2b-256 33d6fd5265a03d0b6903b5a0cab25a4915fd888f84530b2b5c56aabdfefb9d27

See more details on using hashes here.

Provenance

The following attestation bundles were made for sshash-0.2.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish-sshash-py.yml on COMBINE-lab/sshash-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sshash-0.2.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sshash-0.2.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 04ea911195e528355069b6ecf03eb830785543cf43239562ff3d19e3b3c9efea
MD5 9fcc9b6b29cff452f4d19353d9500e7c
BLAKE2b-256 2a67261da2a2b8e13a953f778740fd539892dbc1f32c40cea6cb1353049569d4

See more details on using hashes here.

Provenance

The following attestation bundles were made for sshash-0.2.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: publish-sshash-py.yml on COMBINE-lab/sshash-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sshash-0.2.0-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for sshash-0.2.0-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 2db3a8da939ef69c1d9c1068939bee32f37f890d05212b0b5261e0cfcf2c0d4d
MD5 9276a25ecd78ac39bf15fea2acc32d33
BLAKE2b-256 d8ad5494c0913c54eaf3cfe2e8fdeae6d1d403a7f484a7e35b972a5c578d31ce

See more details on using hashes here.

Provenance

The following attestation bundles were made for sshash-0.2.0-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: publish-sshash-py.yml on COMBINE-lab/sshash-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page