Python bindings for the sshash compressed k-mer dictionary
Project description
sshash
Python bindings for sshash-rs — a compressed dictionary for DNA k-mers based on Sparse and Skew Hashing.
sshash stores a set of k-mers (strings of length k over {A, C, G, T}) compactly using minimal perfect hashing and succinct data structures (Elias-Fano, BitFieldVec), and supports fast individual and streaming lookups. It is the k-mer index underlying the piscem read mapper.
Installation
pip install sshash
Building an index
From a FASTA/FASTQ file
import sshash
config = sshash.BuildConfig(k=31, m=19)
config.canonical = True # k-mer and its reverse complement map to the same entry
config.threads = 8 # parallel build (0 = all cores)
config.verbose = False
dict = config.build_from_file("sequences.fa.gz")
dict.save("my_index")
From a list of sequences in memory
config = sshash.BuildConfig(k=31, m=19)
sequences = ["ACGTACGTACGTACGTACGTACGTACGTACG",
"TTGCAACCGTTAGCAACGTACGTACGTACGT"]
dict = config.build(sequences)
From a Cuttlefish .cf_seg file
When sequences come from Cuttlefish, build_from_cf_seg also returns a mapping from sshash string IDs back to the original Cuttlefish node IDs:
config = sshash.BuildConfig(k=31, m=19)
dict, segment_ids = config.build_from_cf_seg("unitigs.cf_seg")
# segment_ids[i] is the Cuttlefish node ID for sshash string_id i
Loading and saving
# Save to disk (writes <prefix>.ssi and <prefix>.ssi.mphf)
dict.save("my_index")
# Load from disk
dict = sshash.Dictionary.load("my_index")
Querying
Single k-mer lookup
# Returns a Hit object, or None if not found
hit = dict.query("ACGTACGTACGTACGTACGTACGTACGTACG")
if hit is not None:
print(hit.kmer_id) # global k-mer ID
print(hit.string_id) # unitig containing this k-mer
print(hit.kmer_id_in_string) # position within that unitig
print(hit.orientation) # +1 forward, -1 reverse complement
# Just the k-mer ID (faster if location info isn't needed)
kmer_id = dict.lookup("ACGTACGTACGTACGTACGTACGTACGTACG") # None if absent
# Membership test
present = dict.contains("ACGTACGTACGTACGTACGTACGTACGTACG")
Streaming queries over a sequence
The streaming engine maintains minimizer state across consecutive k-mers, avoiding redundant MPHF lookups for adjacent positions. This is significantly faster than calling query in a loop when processing full reads or contigs.
engine = dict.streaming_query()
# Query all k-mers in a sequence at once (returns list)
hits = engine.query_sequence("ACGTACGTACGTACGTACGTACGTACGTACGTACGT")
for hit in hits:
if hit is not None:
print(hit.kmer_id, hit.string_id)
# Lazy iterator (memory-efficient for long sequences)
for hit in engine.iter_sequence(b"ACGTACGT..."):
if hit is not None:
print(hit.kmer_id)
# Efficiency statistics
print(engine.num_searches) # full MPHF lookups performed
print(engine.num_extensions) # k-mers resolved by sliding-window extension
Index properties
print(dict.k) # k-mer length
print(dict.m) # minimizer length
print(dict.canonical) # canonical mode flag
print(dict.num_strings) # number of unitigs
print(dict.num_bits) # total index size in bits
BuildConfig options
| Property | Default | Description |
|---|---|---|
canonical |
False |
Map each k-mer and its reverse complement to the same entry |
threads |
0 |
Worker threads during build (0 = all available cores) |
ram_limit_gib |
8 |
RAM budget (GiB) before switching to external sort |
seed |
internal | Seed for internal hash functions |
verbose |
True |
Print progress during building |
tmp_dir |
"sshash_tmp" |
Directory for temporary files during external sort |
k and m are set at construction time and cannot be changed afterwards.
References
- Giulio Ermanno Pibiri. "Sparse and Skew Hashing of K-Mers." Bioinformatics, 2022.
- Giulio Ermanno Pibiri and Rob Patro. "Optimizing sparse and skew hashing: faster k-mer dictionaries." bioRxiv, 2026.
License
BSD 3-Clause
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sshash-0.2.1.tar.gz.
File metadata
- Download URL: sshash-0.2.1.tar.gz
- Upload date:
- Size: 102.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e4746e24be92a22f27bb4446de9e8e7a1493d858b0800f725d16b31b7e22b300
|
|
| MD5 |
599ba19b329a13aa4e6ebd724b9af7e1
|
|
| BLAKE2b-256 |
a4fed6aeae43e385e5065cdd6956feebcd7c19d8b35526bdbd7f0547e279b200
|
Provenance
The following attestation bundles were made for sshash-0.2.1.tar.gz:
Publisher:
publish-sshash-py.yml on COMBINE-lab/sshash-rs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sshash-0.2.1.tar.gz -
Subject digest:
e4746e24be92a22f27bb4446de9e8e7a1493d858b0800f725d16b31b7e22b300 - Sigstore transparency entry: 977776766
- Sigstore integration time:
-
Permalink:
COMBINE-lab/sshash-rs@c2b8c4312709d956c16d49dc7e25cc56797c3265 -
Branch / Tag:
refs/tags/sshash-py-v0.2.1 - Owner: https://github.com/COMBINE-lab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-sshash-py.yml@c2b8c4312709d956c16d49dc7e25cc56797c3265 -
Trigger Event:
push
-
Statement type:
File details
Details for the file sshash-0.2.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: sshash-0.2.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.5 MB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6927a8fda444eda82ebe4f8838bcfe435396252fcb9c6e862f333a26bdfcaa20
|
|
| MD5 |
bafc0b59ba8a384a2aa41c9694bad64d
|
|
| BLAKE2b-256 |
7ca426076d8d5bef54241b82f47b8652d8c22fdd3c354a6ec9b74aaa2cd3b114
|
Provenance
The following attestation bundles were made for sshash-0.2.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
publish-sshash-py.yml on COMBINE-lab/sshash-rs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sshash-0.2.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
6927a8fda444eda82ebe4f8838bcfe435396252fcb9c6e862f333a26bdfcaa20 - Sigstore transparency entry: 977776771
- Sigstore integration time:
-
Permalink:
COMBINE-lab/sshash-rs@c2b8c4312709d956c16d49dc7e25cc56797c3265 -
Branch / Tag:
refs/tags/sshash-py-v0.2.1 - Owner: https://github.com/COMBINE-lab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-sshash-py.yml@c2b8c4312709d956c16d49dc7e25cc56797c3265 -
Trigger Event:
push
-
Statement type:
File details
Details for the file sshash-0.2.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: sshash-0.2.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b909019cbbbfe45c683d823a06450f8b855068b2633887e3c8f4d33b4014efe
|
|
| MD5 |
cec8778f7a23cbd9446d5a2c54d4e146
|
|
| BLAKE2b-256 |
7856feb8a039140ee312a2e963aadf9ae614f3ea3981bd3022012be17ce35e36
|
Provenance
The following attestation bundles were made for sshash-0.2.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:
Publisher:
publish-sshash-py.yml on COMBINE-lab/sshash-rs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sshash-0.2.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl -
Subject digest:
9b909019cbbbfe45c683d823a06450f8b855068b2633887e3c8f4d33b4014efe - Sigstore transparency entry: 977776775
- Sigstore integration time:
-
Permalink:
COMBINE-lab/sshash-rs@c2b8c4312709d956c16d49dc7e25cc56797c3265 -
Branch / Tag:
refs/tags/sshash-py-v0.2.1 - Owner: https://github.com/COMBINE-lab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-sshash-py.yml@c2b8c4312709d956c16d49dc7e25cc56797c3265 -
Trigger Event:
push
-
Statement type:
File details
Details for the file sshash-0.2.1-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.
File metadata
- Download URL: sshash-0.2.1-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
- Upload date:
- Size: 2.4 MB
- Tags: CPython 3.8+, macOS 10.12+ universal2 (ARM64, x86-64), macOS 10.12+ x86-64, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7056a5c20393239381daaec6e116034408a87d7c900e0c0f4f49d0d82dd7fed
|
|
| MD5 |
79b482cf3f9b3da27aef2a4592b9e409
|
|
| BLAKE2b-256 |
eb47e93e45b598201269f6e5c56d45043d784072f9c63e1648267f02a9f16457
|
Provenance
The following attestation bundles were made for sshash-0.2.1-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:
Publisher:
publish-sshash-py.yml on COMBINE-lab/sshash-rs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sshash-0.2.1-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl -
Subject digest:
f7056a5c20393239381daaec6e116034408a87d7c900e0c0f4f49d0d82dd7fed - Sigstore transparency entry: 977776769
- Sigstore integration time:
-
Permalink:
COMBINE-lab/sshash-rs@c2b8c4312709d956c16d49dc7e25cc56797c3265 -
Branch / Tag:
refs/tags/sshash-py-v0.2.1 - Owner: https://github.com/COMBINE-lab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-sshash-py.yml@c2b8c4312709d956c16d49dc7e25cc56797c3265 -
Trigger Event:
push
-
Statement type: