Skip to main content

Python bindings for the sshash compressed k-mer dictionary

Project description

sshash

Python bindings for sshash-rs — a compressed dictionary for DNA k-mers based on Sparse and Skew Hashing.

sshash stores a set of k-mers (strings of length k over {A, C, G, T}) compactly using minimal perfect hashing and succinct data structures (Elias-Fano, BitFieldVec), and supports fast individual and streaming lookups. It is the k-mer index underlying the piscem read mapper.

Installation

pip install sshash

Building an index

From a FASTA/FASTQ file

import sshash

config = sshash.BuildConfig(k=31, m=19)
config.canonical = True   # k-mer and its reverse complement map to the same entry
config.threads = 8        # parallel build (0 = all cores)
config.verbose = False

dict = config.build_from_file("sequences.fa.gz")
dict.save("my_index")

From a list of sequences in memory

config = sshash.BuildConfig(k=31, m=19)
sequences = ["ACGTACGTACGTACGTACGTACGTACGTACG",
             "TTGCAACCGTTAGCAACGTACGTACGTACGT"]
dict = config.build(sequences)

From a Cuttlefish .cf_seg file

When sequences come from Cuttlefish, build_from_cf_seg also returns a mapping from sshash string IDs back to the original Cuttlefish node IDs:

config = sshash.BuildConfig(k=31, m=19)
dict, segment_ids = config.build_from_cf_seg("unitigs.cf_seg")
# segment_ids[i] is the Cuttlefish node ID for sshash string_id i

Loading and saving

# Save to disk (writes <prefix>.ssi and <prefix>.ssi.mphf)
dict.save("my_index")

# Load from disk
dict = sshash.Dictionary.load("my_index")

Querying

Single k-mer lookup

# Returns a Hit object, or None if not found
hit = dict.query("ACGTACGTACGTACGTACGTACGTACGTACG")
if hit is not None:
    print(hit.kmer_id)           # global k-mer ID
    print(hit.string_id)         # unitig containing this k-mer
    print(hit.kmer_id_in_string) # position within that unitig
    print(hit.orientation)       # +1 forward, -1 reverse complement

# Just the k-mer ID (faster if location info isn't needed)
kmer_id = dict.lookup("ACGTACGTACGTACGTACGTACGTACGTACG")  # None if absent

# Membership test
present = dict.contains("ACGTACGTACGTACGTACGTACGTACGTACG")

Streaming queries over a sequence

The streaming engine maintains minimizer state across consecutive k-mers, avoiding redundant MPHF lookups for adjacent positions. This is significantly faster than calling query in a loop when processing full reads or contigs.

engine = dict.streaming_query()

# Query all k-mers in a sequence at once (returns list)
hits = engine.query_sequence("ACGTACGTACGTACGTACGTACGTACGTACGTACGT")
for hit in hits:
    if hit is not None:
        print(hit.kmer_id, hit.string_id)

# Lazy iterator (memory-efficient for long sequences)
for hit in engine.iter_sequence(b"ACGTACGT..."):
    if hit is not None:
        print(hit.kmer_id)

# Efficiency statistics
print(engine.num_searches)    # full MPHF lookups performed
print(engine.num_extensions)  # k-mers resolved by sliding-window extension

Index properties

print(dict.k)           # k-mer length
print(dict.m)           # minimizer length
print(dict.canonical)   # canonical mode flag
print(dict.num_strings) # number of unitigs
print(dict.num_bits)    # total index size in bits

BuildConfig options

Property Default Description
canonical False Map each k-mer and its reverse complement to the same entry
threads 0 Worker threads during build (0 = all available cores)
ram_limit_gib 8 RAM budget (GiB) before switching to external sort
seed internal Seed for internal hash functions
verbose True Print progress during building
tmp_dir "sshash_tmp" Directory for temporary files during external sort

k and m are set at construction time and cannot be changed afterwards.

References

License

BSD 3-Clause

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sshash-0.2.1.tar.gz (102.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sshash-0.2.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

sshash-0.2.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

sshash-0.2.1-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.4 MB view details)

Uploaded CPython 3.8+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file sshash-0.2.1.tar.gz.

File metadata

  • Download URL: sshash-0.2.1.tar.gz
  • Upload date:
  • Size: 102.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sshash-0.2.1.tar.gz
Algorithm Hash digest
SHA256 e4746e24be92a22f27bb4446de9e8e7a1493d858b0800f725d16b31b7e22b300
MD5 599ba19b329a13aa4e6ebd724b9af7e1
BLAKE2b-256 a4fed6aeae43e385e5065cdd6956feebcd7c19d8b35526bdbd7f0547e279b200

See more details on using hashes here.

Provenance

The following attestation bundles were made for sshash-0.2.1.tar.gz:

Publisher: publish-sshash-py.yml on COMBINE-lab/sshash-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sshash-0.2.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sshash-0.2.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6927a8fda444eda82ebe4f8838bcfe435396252fcb9c6e862f333a26bdfcaa20
MD5 bafc0b59ba8a384a2aa41c9694bad64d
BLAKE2b-256 7ca426076d8d5bef54241b82f47b8652d8c22fdd3c354a6ec9b74aaa2cd3b114

See more details on using hashes here.

Provenance

The following attestation bundles were made for sshash-0.2.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish-sshash-py.yml on COMBINE-lab/sshash-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sshash-0.2.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sshash-0.2.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 9b909019cbbbfe45c683d823a06450f8b855068b2633887e3c8f4d33b4014efe
MD5 cec8778f7a23cbd9446d5a2c54d4e146
BLAKE2b-256 7856feb8a039140ee312a2e963aadf9ae614f3ea3981bd3022012be17ce35e36

See more details on using hashes here.

Provenance

The following attestation bundles were made for sshash-0.2.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: publish-sshash-py.yml on COMBINE-lab/sshash-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sshash-0.2.1-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for sshash-0.2.1-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 f7056a5c20393239381daaec6e116034408a87d7c900e0c0f4f49d0d82dd7fed
MD5 79b482cf3f9b3da27aef2a4592b9e409
BLAKE2b-256 eb47e93e45b598201269f6e5c56d45043d784072f9c63e1648267f02a9f16457

See more details on using hashes here.

Provenance

The following attestation bundles were made for sshash-0.2.1-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: publish-sshash-py.yml on COMBINE-lab/sshash-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page