Skip to main content

Fast, vectorized genomic sequences with first-class mutation tracking.

Project description

SeqMat

PyPI Python Tests codecov License: MIT Downloads

A Python library for genomic sequences that keeps genomic coordinates and mutation history attached to the bases.

SeqMat stores a DNA sequence as a NumPy structured array — (nt, ref, index, mut_type, valid) in parallel columns rather than a byte string. Slicing, mutation, complement, and splicing preserve genomic coordinates and the reference state through every transform, and the mutation history is recorded as a side product. The library also includes a gene/transcript model loaded from a single SQLite file built from Ensembl annotations, and a position → gene lookup over a sidecar index.

from seqmat import Gene, SeqMat

# What gene is at chr12:25,245,350?
Gene.from_position("12", 25_245_350)            # [Gene(KRAS)]

# Load it, assemble the mature mRNA, translate.
kras = Gene.from_file("KRAS")
tx   = kras.transcript()
tx.generate_mature_mrna()
tx.generate_protein()

# Apply mutations; history is kept on the SeqMat.
seq = SeqMat("ATCGATCGATCG")
seq.apply_mutations([(3, "C", "G"), (6, "-", "AAA"), (10, "TC", "-")])
seq.mutations

A worked example, end to end: KRAS G12D — coordinate to protein.


Install

pip install seqmat
seqmat setup                                    # one-time: downloads hg38 genes.db + FASTA (~4 GB)

seqmat setup appends SEQMAT_DATA_DIR and SEQMAT_DEFAULT_ORGANISM to your shell rc so the data is found automatically next session. Mouse: seqmat setup --organism mm39.

If you only want SeqMat for in-memory sequence work, you can skip seqmat setupGene and Transcript are the only objects that need the gene database.

Quick start

Sequence operations

from seqmat import SeqMat
import numpy as np

seq = SeqMat("ATCGATCGATCG", indices=np.arange(1000, 1012))
seq[1005]                                       # base at genomic position 1005
seq[1003:1008].seq                              # "GATCG"
seq.reverse_complement()                        # in place
seq.remove_regions([(1003, 1005), (1008, 1009)])  # splice out introns

Genes and transcripts

from seqmat import Gene

kras = Gene.from_file("KRAS")
kras                                            # Gene: KRAS, ID: ENSG00000133703, Chr: 12, Transcripts: 14

tx = kras.transcript()                          # primary transcript
tx.generate_mature_mrna()
tx.generate_protein()
tx.protein[:20]                                 # 'MTEYKLVVVGAGGVGKSALT'

acceptors, donors = kras.splice_sites()         # Counter across all transcripts

Position → gene (added in 1.4.0)

from seqmat import Gene, gene_names_at_position

Gene.from_position("12", 25_245_350)            # [Gene(KRAS)] — point query
Gene.from_position("chr12", (25_200_000, 25_300_000))  # all overlapping genes in a range
gene_names_at_position("X", 100_000)            # names only (no BLOB load)

Backed by a per-chromosome sorted NumPy index persisted as a sidecar gene_locations.npz next to genes.db. Built lazily on first call; fresh seqmat setup builds also emit it.

Loading from FASTA

seq = SeqMat.from_fasta_file("chr12.fasta", "chr12", start=25398284, end=25398384)
seq.apply_mutations([(25398290, "G", "A")])     # G12D, a common KRAS variant

Performance

Measured on an M-series Mac, hg38, one core, warm caches. Reproducible via the scripts in benchmarks/.

Headline numbers

Operation Time
gene_names_at_position(chrm, pos) 2.5 µs
KRAS mature mRNA assembly (splice + translate) 0.2 ms
1,000-SNP batch on 45 kb sequence (with history) 0.5 ms
Gene.from_file("KRAS") (SQLite + unpickle) 24 ms
Gene.from_position(chrm, pos) end-to-end 24 ms

Position → gene

Same 63,241 hg38 gene intervals, same 10,000 random point queries. Different libraries are designed for different access patterns, so we report two workloads. Reproduce with python benchmarks/bench_position_lookup.py.

Per-query (one coordinate, one answer, in a loop — the Gene.from_position pattern):

Implementation Per query
SeqMat (locator) 2.5 µs
Python dict + bisect 21 µs
pandas (groupby chrm + boolean mask) 79 µs
PyRanges constructed per call 2.0 ms

Batched (the whole query set in one call — PyRanges' native idiom):

Implementation Per query
PyRanges .join 2.07 µs
SeqMat (serial loop) 2.59 µs

PyRanges and SeqMat are in roughly the same range when each is used the way it was designed for. Constructing a PyRanges per call is the slow path and not the PyRanges authors' intent — included here only because it's a common mistake.

Sequence operations vs Biopython

50 kb sequence, 1,000 SNPs / 10 introns; 1 Mb reverse-complement. Reproduce with python benchmarks/bench_sequence_ops.py.

Workload Biopython / native SeqMat
1,000 SNPs (no history) MutableSeq: 0.21 ms
1,000 SNPs with full mutation history 0.5 ms
10-intron splice (50 kb) str.join: 0.002 ms 0.8 ms
Reverse-complement (1 Mb) Seq.reverse_complement: 0.57 ms 9.4 ms

For raw byte-string throughput, Bio.Seq and bytearray are faster. SeqMat's structured array adds bookkeeping that those types don't carry — genomic coordinates that survive indels, the reference held alongside the current sequence, and a mutation history list. The trade-off makes sense when you need that bookkeeping anyway (the KRAS G12D walkthrough is a typical case) and doesn't when you don't.

Implementation notes

NumPy structured arrays, LUT-based complement, np.searchsorted on sorted exon and gene-start coordinates, FASTA range-scoped reads, copy-on-write clone(). See seqmat/seqmat.py and seqmat/locator.py.

Command-line interface

seqmat setup [--organism hg38|mm39] [--path PATH] [--build-from-sources]
seqmat summary
seqmat info --organism hg38
seqmat search --organism hg38 --query KRAS
seqmat list   --organism hg38 --biotype protein_coding --limit 20
seqmat count  --organism hg38

Data setup

By default seqmat setup downloads prebuilt genes.db and FASTA from the SeqMat S3 bucket. To regenerate genes.db from a specific Ensembl release or a custom GTF:

seqmat setup --organism hg38 --build-from-sources

Custom organisms, mirror buckets, ephemeral environments (Docker / Run.ai), shared multi-user installs, and the full configuration system are documented in docs/SETUP.md.

API at a glance

from seqmat import (
    SeqMat,                           # coordinate-tracked sequence with mutation history
    Gene, Transcript,                 # gene/transcript model
    gene_names_at_position,           # name-only positional lookup
    build_location_index,             # force-rebuild the position index
    setup_genomics_data,              # programmatic setup
    search_genes, available_genes,    # discovery helpers
)

Key classes:

  • SeqMatapply_mutations, clone, complement, reverse_complement, remove_regions, from_fasta_file
  • Genefrom_file, from_position, get, transcript, splice_sites, primary_transcript
  • Transcriptgenerate_pre_mrna, generate_mature_mrna, generate_protein, exons, introns

Full API reference (auto-generated from docstrings): nicolasalynn.github.io/seqmat.

Requirements

Python ≥ 3.10. Core deps: numpy, pandas, pyarrow, pysam, requests, tqdm, platformdirs. Optional: lmdb for faster gene loading on large workloads (pip install seqmat[lmdb]).

A note on indexing

SeqMat defaults to 1-based indices to match genomic conventions (UCSC, GenBank). If you'd rather work in 0-based offsets — e.g. for a comparison with bytearray or Bio.Seq — pass indices=np.arange(len(seq)) at construction.

Contributing

PRs welcome. Run the test suite with pytest tests/. Comparative benchmarks are under benchmarks/.

License

MIT — see LICENSE.

Citation

If SeqMat is useful in your research:

Lynn Vila, N. (2025). SeqMat: a genomic sequence library with
coordinate tracking and mutation history.
https://github.com/nicolasalynn/seqmat

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seqmat-1.5.1.tar.gz (61.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

seqmat-1.5.1-py3-none-any.whl (54.1 kB view details)

Uploaded Python 3

File details

Details for the file seqmat-1.5.1.tar.gz.

File metadata

  • Download URL: seqmat-1.5.1.tar.gz
  • Upload date:
  • Size: 61.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for seqmat-1.5.1.tar.gz
Algorithm Hash digest
SHA256 4eecbfd78212b222e9ee1270f4e11734de8b80f43fd1d01a3f76ed7a075b1080
MD5 103a97836ff3bca89cf54db07befda1f
BLAKE2b-256 fcf37c8445897827c6b31c6d33fb1619fb9def6e6b236937bbd8f76baa8eb737

See more details on using hashes here.

File details

Details for the file seqmat-1.5.1-py3-none-any.whl.

File metadata

  • Download URL: seqmat-1.5.1-py3-none-any.whl
  • Upload date:
  • Size: 54.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for seqmat-1.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1751cf0df7ea3c6493bba77f9e212cdb1bcc9652b50a139c763f0dfc90d6973c
MD5 d70021e9ae2d5a3c8aee15c1d1c3b252
BLAKE2b-256 90cc04543315dec11fbb525eb6ba99d98ad632ca0c267091c4ff4c0ca226d2a4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page