Fast, vectorized genomic sequences with first-class mutation tracking.
Project description
A Python library for genomic sequences that keeps genomic coordinates and mutation history attached to the bases.
SeqMat stores a DNA sequence as a NumPy structured array — (nt, ref, index, mut_type, valid) in parallel columns rather than a byte string. Slicing, mutation, complement, and splicing preserve genomic coordinates and the reference state through every transform, and the mutation history is recorded as a side product. The library also includes a gene/transcript model loaded from a single SQLite file built from Ensembl annotations, and a position → gene lookup over a sidecar index.
from seqmat import Gene, SeqMat
# What gene is at chr12:25,245,350?
Gene.from_position("12", 25_245_350) # [Gene(KRAS)]
# Load it, assemble the mature mRNA, translate.
kras = Gene.from_file("KRAS")
tx = kras.transcript()
tx.generate_mature_mrna()
tx.generate_protein()
# Apply mutations; history is kept on the SeqMat.
seq = SeqMat("ATCGATCGATCG")
seq.apply_mutations([(3, "C", "G"), (6, "-", "AAA"), (10, "TC", "-")])
seq.mutations
A worked example, end to end: KRAS G12D — coordinate to protein.
Install
pip install seqmat
seqmat setup # one-time: downloads hg38 genes.db + FASTA (~4 GB)
seqmat setup appends SEQMAT_DATA_DIR and SEQMAT_DEFAULT_ORGANISM to your shell rc so the data is found automatically next session. Mouse: seqmat setup --organism mm39.
If you only want SeqMat for in-memory sequence work, you can skip seqmat setup — Gene and Transcript are the only objects that need the gene database.
Quick start
Sequence operations
from seqmat import SeqMat
import numpy as np
seq = SeqMat("ATCGATCGATCG", indices=np.arange(1000, 1012))
seq[1005] # base at genomic position 1005
seq[1003:1008].seq # "GATCG"
seq.reverse_complement() # in place
seq.remove_regions([(1003, 1005), (1008, 1009)]) # splice out introns
Genes and transcripts
from seqmat import Gene
kras = Gene.from_file("KRAS")
kras # Gene: KRAS, ID: ENSG00000133703, Chr: 12, Transcripts: 14
tx = kras.transcript() # primary transcript
tx.generate_mature_mrna()
tx.generate_protein()
tx.protein[:20] # 'MTEYKLVVVGAGGVGKSALT'
acceptors, donors = kras.splice_sites() # Counter across all transcripts
Position → gene (added in 1.4.0)
from seqmat import Gene, gene_names_at_position
Gene.from_position("12", 25_245_350) # [Gene(KRAS)] — point query
Gene.from_position("chr12", (25_200_000, 25_300_000)) # all overlapping genes in a range
gene_names_at_position("X", 100_000) # names only (no BLOB load)
Backed by a per-chromosome sorted NumPy index persisted as a sidecar gene_locations.npz next to genes.db. Built lazily on first call; fresh seqmat setup builds also emit it.
Loading from FASTA
seq = SeqMat.from_fasta_file("chr12.fasta", "chr12", start=25398284, end=25398384)
seq.apply_mutations([(25398290, "G", "A")]) # G12D, a common KRAS variant
Performance
Measured on an M-series Mac, hg38, one core, warm caches. Reproducible via the scripts in benchmarks/.
Headline numbers
| Operation | Time |
|---|---|
gene_names_at_position(chrm, pos) |
2.5 µs |
| KRAS mature mRNA assembly (splice + translate) | 0.2 ms |
| 1,000-SNP batch on 45 kb sequence (with history) | 0.5 ms |
Gene.from_file("KRAS") (SQLite + unpickle) |
24 ms |
Gene.from_position(chrm, pos) end-to-end |
24 ms |
Position → gene
Same 63,241 hg38 gene intervals, same 10,000 random point queries. Different libraries are designed for different access patterns, so we report two workloads. Reproduce with python benchmarks/bench_position_lookup.py.
Per-query (one coordinate, one answer, in a loop — the Gene.from_position pattern):
| Implementation | Per query |
|---|---|
| SeqMat (locator) | 2.5 µs |
Python dict + bisect |
21 µs |
pandas (groupby chrm + boolean mask) |
79 µs |
| PyRanges constructed per call | 2.0 ms |
Batched (the whole query set in one call — PyRanges' native idiom):
| Implementation | Per query |
|---|---|
PyRanges .join |
2.07 µs |
| SeqMat (serial loop) | 2.59 µs |
PyRanges and SeqMat are in roughly the same range when each is used the way it was designed for. Constructing a PyRanges per call is the slow path and not the PyRanges authors' intent — included here only because it's a common mistake.
Sequence operations vs Biopython
50 kb sequence, 1,000 SNPs / 10 introns; 1 Mb reverse-complement. Reproduce with python benchmarks/bench_sequence_ops.py.
| Workload | Biopython / native | SeqMat |
|---|---|---|
| 1,000 SNPs (no history) | MutableSeq: 0.21 ms |
— |
| 1,000 SNPs with full mutation history | — | 0.5 ms |
| 10-intron splice (50 kb) | str.join: 0.002 ms |
0.8 ms |
| Reverse-complement (1 Mb) | Seq.reverse_complement: 0.57 ms |
9.4 ms |
For raw byte-string throughput, Bio.Seq and bytearray are faster. SeqMat's structured array adds bookkeeping that those types don't carry — genomic coordinates that survive indels, the reference held alongside the current sequence, and a mutation history list. The trade-off makes sense when you need that bookkeeping anyway (the KRAS G12D walkthrough is a typical case) and doesn't when you don't.
Implementation notes
NumPy structured arrays, LUT-based complement, np.searchsorted on sorted exon and gene-start coordinates, FASTA range-scoped reads, copy-on-write clone(). See seqmat/seqmat.py and seqmat/locator.py.
Command-line interface
seqmat setup [--organism hg38|mm39] [--path PATH] [--build-from-sources]
seqmat summary
seqmat info --organism hg38
seqmat search --organism hg38 --query KRAS
seqmat list --organism hg38 --biotype protein_coding --limit 20
seqmat count --organism hg38
Data setup
By default seqmat setup downloads prebuilt genes.db and FASTA from the SeqMat S3 bucket. To regenerate genes.db from a specific Ensembl release or a custom GTF:
seqmat setup --organism hg38 --build-from-sources
Custom organisms, mirror buckets, ephemeral environments (Docker / Run.ai), shared multi-user installs, and the full configuration system are documented in docs/SETUP.md.
API at a glance
from seqmat import (
SeqMat, # coordinate-tracked sequence with mutation history
Gene, Transcript, # gene/transcript model
gene_names_at_position, # name-only positional lookup
build_location_index, # force-rebuild the position index
setup_genomics_data, # programmatic setup
search_genes, available_genes, # discovery helpers
)
Key classes:
SeqMat—apply_mutations,clone,complement,reverse_complement,remove_regions,from_fasta_fileGene—from_file,from_position,get,transcript,splice_sites,primary_transcriptTranscript—generate_pre_mrna,generate_mature_mrna,generate_protein,exons,introns
Full API reference (auto-generated from docstrings): nicolasalynn.github.io/seqmat.
Requirements
Python ≥ 3.10. Core deps: numpy, pandas, pyarrow, pysam, requests, tqdm, platformdirs. Optional: lmdb for faster gene loading on large workloads (pip install seqmat[lmdb]).
A note on indexing
SeqMat defaults to 1-based indices to match genomic conventions (UCSC, GenBank). If you'd rather work in 0-based offsets — e.g. for a comparison with bytearray or Bio.Seq — pass indices=np.arange(len(seq)) at construction.
Contributing
PRs welcome. Run the test suite with pytest tests/. Comparative benchmarks are under benchmarks/.
License
MIT — see LICENSE.
Citation
If SeqMat is useful in your research:
Lynn Vila, N. (2025). SeqMat: a genomic sequence library with
coordinate tracking and mutation history.
https://github.com/nicolasalynn/seqmat
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file seqmat-1.5.1.tar.gz.
File metadata
- Download URL: seqmat-1.5.1.tar.gz
- Upload date:
- Size: 61.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4eecbfd78212b222e9ee1270f4e11734de8b80f43fd1d01a3f76ed7a075b1080
|
|
| MD5 |
103a97836ff3bca89cf54db07befda1f
|
|
| BLAKE2b-256 |
fcf37c8445897827c6b31c6d33fb1619fb9def6e6b236937bbd8f76baa8eb737
|
File details
Details for the file seqmat-1.5.1-py3-none-any.whl.
File metadata
- Download URL: seqmat-1.5.1-py3-none-any.whl
- Upload date:
- Size: 54.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1751cf0df7ea3c6493bba77f9e212cdb1bcc9652b50a139c763f0dfc90d6973c
|
|
| MD5 |
d70021e9ae2d5a3c8aee15c1d1c3b252
|
|
| BLAKE2b-256 |
90cc04543315dec11fbb525eb6ba99d98ad632ca0c267091c4ff4c0ca226d2a4
|