seqmat

Fast, vectorized genomic sequences with first-class mutation tracking.

These details have not been verified by PyPI

Project links

Project description

SeqMat

A Python library for genomic sequences that keeps genomic coordinates and mutation history attached to the bases.

SeqMat stores a DNA sequence as a NumPy structured array — (nt, ref, index, mut_type, valid) in parallel columns rather than a byte string. Slicing, mutation, complement, and splicing preserve genomic coordinates and the reference state through every transform, and the mutation history is recorded as a side product. The library also includes a gene/transcript model loaded from a single SQLite file built from Ensembl annotations, and a position → gene lookup over a sidecar index.

from seqmat import Gene, SeqMat

# What gene is at chr12:25,245,350?
Gene.from_position("12", 25_245_350)            # [Gene(KRAS)]

# Load it, assemble the mature mRNA, translate.
kras = Gene.from_file("KRAS")
tx   = kras.transcript()
tx.generate_mature_mrna()
tx.generate_protein()

# Apply mutations; history is kept on the SeqMat.
seq = SeqMat("ATCGATCGATCG")
seq.apply_mutations([(3, "C", "G"), (6, "-", "AAA"), (10, "TC", "-")])
seq.mutations

A worked example, end to end: KRAS G12D — coordinate to protein.

Install

pip install seqmat
seqmat setup                                    # one-time: downloads hg38 genes.db + FASTA (~4 GB)

seqmat setup appends SEQMAT_DATA_DIR and SEQMAT_DEFAULT_ORGANISM to your shell rc so the data is found automatically next session. Mouse: seqmat setup --organism mm39.

If you only want SeqMat for in-memory sequence work, you can skip seqmat setup — Gene and Transcript are the only objects that need the gene database.

Quick start

Sequence operations

from seqmat import SeqMat
import numpy as np

seq = SeqMat("ATCGATCGATCG", indices=np.arange(1000, 1012))
seq[1005]                                       # base at genomic position 1005
seq[1003:1008].seq                              # "GATCG"
seq.reverse_complement()                        # in place
seq.remove_regions([(1003, 1005), (1008, 1009)])  # splice out introns

Genes and transcripts

from seqmat import Gene

kras = Gene.from_file("KRAS")
kras                                            # Gene: KRAS, ID: ENSG00000133703, Chr: 12, Transcripts: 14

tx = kras.transcript()                          # primary transcript
tx.generate_mature_mrna()
tx.generate_protein()
tx.protein[:20]                                 # 'MTEYKLVVVGAGGVGKSALT'

acceptors, donors = kras.splice_sites()         # Counter across all transcripts

Position → gene (added in 1.4.0)

from seqmat import Gene, gene_names_at_position

Gene.from_position("12", 25_245_350)            # [Gene(KRAS)] — point query
Gene.from_position("chr12", (25_200_000, 25_300_000))  # all overlapping genes in a range
gene_names_at_position("X", 100_000)            # names only (no BLOB load)

Backed by a per-chromosome sorted NumPy index persisted as a sidecar gene_locations.npz next to genes.db. Built lazily on first call; fresh seqmat setup builds also emit it.

Loading from FASTA

seq = SeqMat.from_fasta_file("chr12.fasta", "chr12", start=25398284, end=25398384)
seq.apply_mutations([(25398290, "G", "A")])     # G12D, a common KRAS variant

Performance

Measured on an M-series Mac, hg38, one core, warm caches. Reproducible via the scripts in benchmarks/.

Headline numbers

Operation	Time
`gene_names_at_position(chrm, pos)`	2.5 µs
KRAS mature mRNA assembly (splice + translate)	0.2 ms
1,000-SNP batch on 45 kb sequence (with history)	0.5 ms
`Gene.from_file("KRAS")` (SQLite + unpickle)	24 ms
`Gene.from_position(chrm, pos)` end-to-end	24 ms

Position → gene

Same 63,241 hg38 gene intervals, same 10,000 random point queries. Different libraries are designed for different access patterns, so we report two workloads. Reproduce with python benchmarks/bench_position_lookup.py.

Per-query (one coordinate, one answer, in a loop — the Gene.from_position pattern):

Implementation	Per query
SeqMat (locator)	2.5 µs
Python `dict` + `bisect`	21 µs
pandas (`groupby` chrm + boolean mask)	79 µs
PyRanges constructed per call	2.0 ms

Batched (the whole query set in one call — PyRanges' native idiom):

Implementation	Per query
PyRanges `.join`	2.07 µs
SeqMat (serial loop)	2.59 µs

PyRanges and SeqMat are in roughly the same range when each is used the way it was designed for. Constructing a PyRanges per call is the slow path and not the PyRanges authors' intent — included here only because it's a common mistake.

Sequence operations vs Biopython

50 kb sequence, 1,000 SNPs / 10 introns; 1 Mb reverse-complement. Reproduce with python benchmarks/bench_sequence_ops.py.

Workload	Biopython / native	SeqMat
1,000 SNPs (no history)	`MutableSeq`: 0.21 ms	—
1,000 SNPs with full mutation history	—	0.5 ms
10-intron splice (50 kb)	`str.join`: 0.002 ms	0.8 ms
Reverse-complement (1 Mb)	`Seq.reverse_complement`: 0.57 ms	9.4 ms

For raw byte-string throughput, Bio.Seq and bytearray are faster. SeqMat's structured array adds bookkeeping that those types don't carry — genomic coordinates that survive indels, the reference held alongside the current sequence, and a mutation history list. The trade-off makes sense when you need that bookkeeping anyway (the KRAS G12D walkthrough is a typical case) and doesn't when you don't.

Implementation notes

NumPy structured arrays, LUT-based complement, np.searchsorted on sorted exon and gene-start coordinates, FASTA range-scoped reads, copy-on-write clone(). See seqmat/seqmat.py and seqmat/locator.py.

Command-line interface

seqmat setup [--organism hg38|mm39] [--path PATH] [--build-from-sources]
seqmat summary
seqmat info --organism hg38
seqmat search --organism hg38 --query KRAS
seqmat list   --organism hg38 --biotype protein_coding --limit 20
seqmat count  --organism hg38

Data setup

By default seqmat setup downloads prebuilt genes.db and FASTA from the SeqMat S3 bucket. To regenerate genes.db from a specific Ensembl release or a custom GTF:

seqmat setup --organism hg38 --build-from-sources

Custom organisms, mirror buckets, ephemeral environments (Docker / Run.ai), shared multi-user installs, and the full configuration system are documented in docs/SETUP.md.

API at a glance

from seqmat import (
    SeqMat,                           # coordinate-tracked sequence with mutation history
    Gene, Transcript,                 # gene/transcript model
    gene_names_at_position,           # name-only positional lookup
    build_location_index,             # force-rebuild the position index
    setup_genomics_data,              # programmatic setup
    search_genes, available_genes,    # discovery helpers
)

Key classes:

SeqMat — apply_mutations, clone, complement, reverse_complement, remove_regions, from_fasta_file
Gene — from_file, from_position, get, transcript, splice_sites, primary_transcript
Transcript — generate_pre_mrna, generate_mature_mrna, generate_protein, exons, introns

Full API reference (auto-generated from docstrings): nicolasalynn.github.io/seqmat.

Requirements

Python ≥ 3.10. Core deps: numpy, pandas, pyarrow, pysam, requests, tqdm, platformdirs. Optional: lmdb for faster gene loading on large workloads (pip install seqmat[lmdb]).

A note on indexing

SeqMat defaults to 1-based indices to match genomic conventions (UCSC, GenBank). If you'd rather work in 0-based offsets — e.g. for a comparison with bytearray or Bio.Seq — pass indices=np.arange(len(seq)) at construction.

Contributing

PRs welcome. Run the test suite with pytest tests/. Comparative benchmarks are under benchmarks/.

License

MIT — see LICENSE.

Citation

If SeqMat is useful in your research:

Lynn Vila, N. (2025). SeqMat: a genomic sequence library with
coordinate tracking and mutation history.
https://github.com/nicolasalynn/seqmat

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.5.1

May 15, 2026

1.5.0

May 15, 2026

1.4.0

May 15, 2026

1.3.1

Mar 9, 2026

1.3.0

Mar 9, 2026

1.2.2

Mar 9, 2026

1.2.1

Mar 9, 2026

1.2.0

Mar 9, 2026

1.1.1

Mar 7, 2026

1.1.0

Mar 3, 2026

1.0.0

Mar 1, 2026

0.1.59

Feb 28, 2026

0.1.58

Feb 28, 2026

0.1.57

Feb 28, 2026

0.1.56

Feb 28, 2026

0.1.55

Feb 28, 2026

0.1.54

Feb 21, 2026

0.1.53

Feb 21, 2026

0.1.52

Feb 21, 2026

0.1.51

Feb 21, 2026

0.1.50

Feb 21, 2026

0.1.49

Feb 21, 2026

0.1.48

Feb 20, 2026

0.1.47

Feb 2, 2026

0.1.46

Feb 1, 2026

0.1.45

Feb 1, 2026

0.1.44

Jan 31, 2026

0.1.43

Dec 5, 2025

0.1.42

Dec 5, 2025

0.1.41

Nov 18, 2025

0.1.40

Nov 17, 2025

0.1.39

Nov 17, 2025

0.1.38

Nov 17, 2025

0.1.37

Nov 17, 2025

0.1.36

Nov 17, 2025

0.1.35

Nov 17, 2025

0.1.34

Nov 17, 2025

0.1.33

Nov 17, 2025

0.1.32

Nov 1, 2025

0.1.31

Oct 31, 2025

0.1.30

Aug 26, 2025

0.1.29

Aug 26, 2025

0.1.28

Aug 7, 2025

0.1.27

Aug 7, 2025

0.1.26

Aug 7, 2025

0.1.25

Jul 27, 2025

0.1.24

Jul 27, 2025

0.1.22

Jul 26, 2025

0.1.21

Jul 26, 2025

0.1.20

Jul 26, 2025

0.1.19

Jul 26, 2025

0.1.18

Jul 26, 2025

0.1.17

Jul 26, 2025

0.1.16

Jul 26, 2025

0.1.15

Jul 26, 2025

0.1.14

Jul 26, 2025

0.1.13

Jul 25, 2025

0.1.12

Jul 25, 2025

0.1.11

Jul 25, 2025

0.1.10

Jul 25, 2025

0.1.9

Jul 25, 2025

0.1.8

Jul 25, 2025

0.1.7

Jul 25, 2025

0.1.6

Jul 25, 2025

0.1.5

Jul 25, 2025

0.1.4

Jul 24, 2025

0.1.3

Jul 24, 2025

0.1.2

Jul 24, 2025

0.1.1

Jul 24, 2025

0.1.0

Jul 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seqmat-1.5.1.tar.gz (61.3 kB view details)

Uploaded May 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

seqmat-1.5.1-py3-none-any.whl (54.1 kB view details)

Uploaded May 15, 2026 Python 3

File details

Details for the file seqmat-1.5.1.tar.gz.

File metadata

Download URL: seqmat-1.5.1.tar.gz
Upload date: May 15, 2026
Size: 61.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for seqmat-1.5.1.tar.gz
Algorithm	Hash digest
SHA256	`4eecbfd78212b222e9ee1270f4e11734de8b80f43fd1d01a3f76ed7a075b1080`
MD5	`103a97836ff3bca89cf54db07befda1f`
BLAKE2b-256	`fcf37c8445897827c6b31c6d33fb1619fb9def6e6b236937bbd8f76baa8eb737`

See more details on using hashes here.

File details

Details for the file seqmat-1.5.1-py3-none-any.whl.

File metadata

Download URL: seqmat-1.5.1-py3-none-any.whl
Upload date: May 15, 2026
Size: 54.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for seqmat-1.5.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1751cf0df7ea3c6493bba77f9e212cdb1bcc9652b50a139c763f0dfc90d6973c`
MD5	`d70021e9ae2d5a3c8aee15c1d1c3b252`
BLAKE2b-256	`90cc04543315dec11fbb525eb6ba99d98ad632ca0c267091c4ff4c0ca226d2a4`

See more details on using hashes here.

seqmat 1.5.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Install

Quick start

Sequence operations

Genes and transcripts

Position → gene (added in 1.4.0)

Loading from FASTA

Performance

Headline numbers

Position → gene

Sequence operations vs Biopython

Implementation notes

Command-line interface

Data setup

API at a glance

Requirements

A note on indexing

Contributing

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes