Skip to main content

Python bindings for the sshash compressed k-mer dictionary

Project description

sshash

Python bindings for sshash-rs — a compressed dictionary for DNA k-mers based on Sparse and Skew Hashing.

sshash stores a set of k-mers (strings of length k over {A, C, G, T}) compactly using minimal perfect hashing and succinct data structures (Elias-Fano, BitFieldVec), and supports fast individual and streaming lookups. It is the k-mer index underlying the piscem read mapper.

Installation

pip install sshash

Building an index

From a FASTA/FASTQ file

import sshash

config = sshash.BuildConfig(k=31, m=19)
config.canonical = True   # k-mer and its reverse complement map to the same entry
config.threads = 8        # parallel build (0 = all cores)
config.verbose = False

dict = config.build_from_file("sequences.fa.gz")
dict.save("my_index")

From a list of sequences in memory

config = sshash.BuildConfig(k=31, m=19)
sequences = ["ACGTACGTACGTACGTACGTACGTACGTACG",
             "TTGCAACCGTTAGCAACGTACGTACGTACGT"]
dict = config.build(sequences)

From a Cuttlefish .cf_seg file

When sequences come from Cuttlefish, build_from_cf_seg also returns a mapping from sshash string IDs back to the original Cuttlefish node IDs:

config = sshash.BuildConfig(k=31, m=19)
dict, segment_ids = config.build_from_cf_seg("unitigs.cf_seg")
# segment_ids[i] is the Cuttlefish node ID for sshash string_id i

Loading and saving

# Save to disk (writes <prefix>.ssi and <prefix>.ssi.mphf)
dict.save("my_index")

# Load from disk
dict = sshash.Dictionary.load("my_index")

Querying

Single k-mer lookup

# Returns a Hit object, or None if not found
hit = dict.query("ACGTACGTACGTACGTACGTACGTACGTACG")
if hit is not None:
    print(hit.kmer_id)           # global k-mer ID
    print(hit.string_id)         # unitig containing this k-mer
    print(hit.kmer_id_in_string) # position within that unitig
    print(hit.orientation)       # +1 forward, -1 reverse complement

# Just the k-mer ID (faster if location info isn't needed)
kmer_id = dict.lookup("ACGTACGTACGTACGTACGTACGTACGTACG")  # None if absent

# Membership test
present = dict.contains("ACGTACGTACGTACGTACGTACGTACGTACG")

Streaming queries over a sequence

The streaming engine maintains minimizer state across consecutive k-mers, avoiding redundant MPHF lookups for adjacent positions. This is significantly faster than calling query in a loop when processing full reads or contigs.

engine = dict.streaming_query()

# Query all k-mers in a sequence at once (returns list)
hits = engine.query_sequence("ACGTACGTACGTACGTACGTACGTACGTACGTACGT")
for hit in hits:
    if hit is not None:
        print(hit.kmer_id, hit.string_id)

# Lazy iterator (memory-efficient for long sequences)
for hit in engine.iter_sequence(b"ACGTACGT..."):
    if hit is not None:
        print(hit.kmer_id)

# Efficiency statistics
print(engine.num_searches)    # full MPHF lookups performed
print(engine.num_extensions)  # k-mers resolved by sliding-window extension

Index properties

print(dict.k)           # k-mer length
print(dict.m)           # minimizer length
print(dict.canonical)   # canonical mode flag
print(dict.num_strings) # number of unitigs
print(dict.num_bits)    # total index size in bits

BuildConfig options

Property Default Description
canonical False Map each k-mer and its reverse complement to the same entry
threads 0 Worker threads during build (0 = all available cores)
ram_limit_gib 8 RAM budget (GiB) before switching to external sort
seed internal Seed for internal hash functions
verbose True Print progress during building
tmp_dir "sshash_tmp" Directory for temporary files during external sort

k and m are set at construction time and cannot be changed afterwards.

References

License

BSD 3-Clause

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sshash-0.3.0.tar.gz (110.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sshash-0.3.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

sshash-0.3.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

sshash-0.3.0-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.3 MB view details)

Uploaded CPython 3.8+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file sshash-0.3.0.tar.gz.

File metadata

  • Download URL: sshash-0.3.0.tar.gz
  • Upload date:
  • Size: 110.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sshash-0.3.0.tar.gz
Algorithm Hash digest
SHA256 365ea9c8f52f3ef217d4ec273f4727d5d9f9b0fe04d1562f96714f33020c96b9
MD5 726618a49ce7823410e105977e3ec716
BLAKE2b-256 7316fc02b0892805301fb6360dafbc3d1040ead39d4455274f37d09f3f5d20c7

See more details on using hashes here.

Provenance

The following attestation bundles were made for sshash-0.3.0.tar.gz:

Publisher: publish-sshash-py.yml on COMBINE-lab/sshash-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sshash-0.3.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sshash-0.3.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3e552a26b407c1ce2f675ef20a51441a26aa6bcce3d8a7e755b624cc8b6f0549
MD5 094752eb37785cfc5e6ff28185c64b44
BLAKE2b-256 8af6738e6abf9d8c5aa01ae8cc83c9cd564d95d07602b24f6d03dca6a869379a

See more details on using hashes here.

Provenance

The following attestation bundles were made for sshash-0.3.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish-sshash-py.yml on COMBINE-lab/sshash-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sshash-0.3.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sshash-0.3.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 ea9f67ea255e94eb07ec3eb4310ed4cf972d9a01ef96e13a29dfab360292c442
MD5 2992e1f3dc657ff601882c0a370a5c51
BLAKE2b-256 9843acae16fd191da730bfc6c1352f1118164d2b19fec7af437d7d7177887a93

See more details on using hashes here.

Provenance

The following attestation bundles were made for sshash-0.3.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: publish-sshash-py.yml on COMBINE-lab/sshash-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sshash-0.3.0-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for sshash-0.3.0-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 01c14ac05cfff2a99eeb6445ea08a7e7a020640316997fe772a457d5a6e21510
MD5 7ab20eed0f0314c08f85aacea22024ec
BLAKE2b-256 0d7027db6dfc94af576046cbe5c419ab89aa415d2ca2839eb13e0a16a677f9ce

See more details on using hashes here.

Provenance

The following attestation bundles were made for sshash-0.3.0-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: publish-sshash-py.yml on COMBINE-lab/sshash-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page