Fast, vectorized genomic sequences with first-class mutation tracking.
Project description
SeqMat
Fast, vectorized genomic sequences with first-class mutation tracking.
SeqMat treats a DNA sequence as a NumPy-backed matrix of (nucleotide, genomic position) — so slicing, mutation, complement, and splicing are all vectorized array operations. It ships with a compact gene/transcript model that loads from a single indexed SQLite file built from Ensembl annotations, plus a position → gene lookup that resolves overlapping genes in microseconds.
from seqmat import Gene, SeqMat
# What gene is at chr12:25,245,350?
Gene.from_position("12", 25_245_350) # [Gene(KRAS)]
# Load it, assemble the mature mRNA, translate.
kras = Gene.from_file("KRAS")
tx = kras.transcript()
tx.generate_mature_mrna() # 0.2 ms
tx.generate_protein()
# Mutate a sequence with full history and conflict detection.
seq = SeqMat("ATCGATCGATCG")
seq.apply_mutations([(3, "C", "G"), (6, "-", "AAA"), (10, "TC", "-")])
seq.mutations # [(SNP, 3, C, G), (INS, 6, -, AAA), (DEL, 10, TC, -)]
Install
pip install seqmat
seqmat setup # one-time: downloads hg38 genes.db + FASTA (~4 GB)
seqmat setup writes a small env block (SEQMAT_DATA_DIR, SEQMAT_DEFAULT_ORGANISM) to your shell rc so the data is found automatically next session. Mouse: seqmat setup --organism mm39. No config file needed.
If you only want
SeqMatfor in-memory sequence work, you can skipseqmat setupentirely —Gene/Transcriptare the only things that need the gene database.
Quick start
Sequence operations
from seqmat import SeqMat
import numpy as np
seq = SeqMat("ATCGATCGATCG", indices=np.arange(1000, 1012))
seq[1005] # base at genomic position 1005
seq[1003:1008].seq # "GATCG"
seq.reverse_complement() # in place
seq.remove_regions([(1003, 1005), (1008, 1009)]) # splice out introns
Genes and transcripts
from seqmat import Gene
kras = Gene.from_file("KRAS")
kras # Gene: KRAS, ID: ENSG00000133703, Chr: 12, Transcripts: 14
tx = kras.transcript() # primary transcript
tx.generate_mature_mrna()
tx.generate_protein()
tx.protein[:20] # 'MTEYKLVVVGAGGVGKSALT'
acceptors, donors = kras.splice_sites() # Counter across all transcripts
Position → gene (new in 1.4.0)
from seqmat import Gene, gene_names_at_position
Gene.from_position("12", 25_245_350) # [Gene(KRAS)] — point query
Gene.from_position("chr12", (25_200_000, 25_300_000)) # all overlapping genes in a range
gene_names_at_position("X", 100_000) # names only (no BLOB load) — ~17 us
Backed by a per-chromosome sorted NumPy index persisted as a sidecar gene_locations.npz next to genes.db. Built lazily on first call; fresh seqmat setup builds also emit it for free.
Loading from FASTA
seq = SeqMat.from_fasta_file("chr12.fasta", "chr12", start=25398284, end=25398384)
seq.apply_mutations([(25398290, "G", "A")]) # G12D, the most famous KRAS variant
Performance
Numbers from an M-series Mac on hg38 (one core, warm caches):
| Operation | Time |
|---|---|
gene_names_at_position(chrm, pos) |
17 µs |
Gene.from_file("KRAS") (SQLite + unpickle) |
24 ms |
Gene.from_position(chrm, pos) end-to-end |
24 ms |
| KRAS mature mRNA assembly | 0.2 ms |
| 1,000-SNP batch on 4 kb sequence | 19 ms |
The hot paths use NumPy structured arrays, LUT-based complement, np.searchsorted on sorted starts, FASTA range-scoped reads, and copy-on-write clone(). See seqmat/seqmat.py and seqmat/locator.py.
Command-line interface
seqmat setup [--organism hg38|mm39] [--path PATH] [--build-from-sources]
seqmat summary # what's installed
seqmat info --organism hg38
seqmat search --organism hg38 --query KRAS
seqmat list --organism hg38 --biotype protein_coding --limit 20
seqmat count --organism hg38
Data setup
By default seqmat setup downloads prebuilt genes.db and FASTA from the SeqMat S3 bucket — no build step. To regenerate genes.db from a specific Ensembl release or custom GTF:
seqmat setup --organism hg38 --build-from-sources
For custom organisms, mirroring the prebuilt bucket, ephemeral environments (Docker / Run.ai), shared multi-user installs, and the full configuration system — see docs/SETUP.md.
API at a glance
from seqmat import (
SeqMat, # vectorized sequence with mutation tracking
Gene, Transcript, # gene/transcript model
gene_names_at_position, # fast name-only positional lookup
build_location_index, # force-rebuild the position index
setup_genomics_data, # programmatic setup
search_genes, available_genes, # discovery helpers
)
Key classes:
SeqMat—apply_mutations,clone,complement,reverse_complement,remove_regions,from_fasta_fileGene—from_file,from_position,transcript,splice_sites,primary_transcriptTranscript—generate_pre_mrna,generate_mature_mrna,generate_protein,exons,introns
Requirements
Python ≥ 3.10. Core deps: numpy, pandas, pyarrow, pysam, requests, tqdm, platformdirs. Optional: lmdb (faster gene loading on large workloads — install with pip install seqmat[lmdb]).
Contributing
PRs welcome. Run the test suite with pytest tests/. Benchmarks live under tests/bench_*.py.
License
MIT — see LICENSE.
Citation
If SeqMat is useful in your research, please cite:
Lynn Vila, N. (2025). SeqMat: a fast, vectorized genomic sequence library
with mutation tracking. https://github.com/nicolasalynn/seqmat
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file seqmat-1.5.0.tar.gz.
File metadata
- Download URL: seqmat-1.5.0.tar.gz
- Upload date:
- Size: 56.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad2ef7f0d401febf1926c4c6fe024843d347bbac40944b1b1e130566da48ac01
|
|
| MD5 |
40d4369e70b68c2c05e002093f66fb17
|
|
| BLAKE2b-256 |
4c2feb10b126386e1108333ddcf298fb180bde719a81ca07bb4997500481513b
|
File details
Details for the file seqmat-1.5.0-py3-none-any.whl.
File metadata
- Download URL: seqmat-1.5.0-py3-none-any.whl
- Upload date:
- Size: 51.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15c5a58600d8ea6df9c750404528b4b2e376214c9fbdb1cb1ad7f7cfe5548876
|
|
| MD5 |
927d60020d7850824d7178f7047bebc6
|
|
| BLAKE2b-256 |
132a2278535c0e77c47713339ad86545e0f24501bbbeff2a59700b874cd1f9b6
|