MHC sequence curation and binding groove extraction

These details have not been verified by PyPI

Project description

mhcseqs

Self-contained pipeline for downloading, curating, and extracting binding grooves from MHC (Major Histocompatibility Complex) protein sequences.

Install

pip install mhcseqs

For development:

git clone https://github.com/openvax/mhcseqs.git
cd mhcseqs
./develop.sh          # uv pip install -e ".[dev]"
./test.sh             # pytest
./lint.sh             # ruff

Quick start

CLI

# Build all output CSVs (writes to ~/.cache/mhcseqs/)
mhcseqs build

# Build to a specific directory instead
mhcseqs build --output-dir output/

# Look up a specific allele
mhcseqs lookup "HLA-A*02:01"

# Check version
mhcseqs --version

Python API

import mhcseqs

# Build the database (downloads to ~/.cache/mhcseqs/, only needed once)
paths = mhcseqs.build()  # BuildPaths dataclass

# Look up any allele → AlleleRecord with everything
r = mhcseqs.lookup("HLA-A*02:01")
r.sequence          # full protein (with signal peptide)
r.mature_sequence   # signal peptide removed (computed property)
r.mature_start      # signal peptide length (24 for HLA-A*02:01)
r.groove1           # α1 domain
r.groove2           # α2 domain
r.ig_domain         # α3 Ig-fold
r.tail              # TM + cytoplasmic
r.species_category  # "human"

# Apply mutations (IEDB-style, e.g. "K66A")
m = mhcseqs.lookup("HLA-A*02:01", mutations=["K66A", "D77S"])

Load as a DataFrame

import mhcseqs

# As a DataFrame (full sequence + groove decomposition + metadata)
df = mhcseqs.load_sequences_dataframe()

# Or as a list of dicts (no pandas dependency)
rows = mhcseqs.load_sequences_dict()

Data directory

By default, mhcseqs build downloads FASTA files and writes output CSVs to ~/.cache/mhcseqs/ (override with $MHCSEQS_DATA or --output-dir).

~/.cache/mhcseqs/
├── fasta/                     # Downloaded FASTA source files
│   ├── hla_prot.fasta         # IMGT/HLA (human)
│   └── ipd_mhc_prot.fasta    # IPD-MHC (non-human)
├── mhc-seqs-raw.csv           # Every protein entry from all sources
├── mhc-full-seqs.csv          # One representative per two-field allele (with grooves)
├── mhc-merge-report.txt       # Deduplication decisions
└── mhc-validation-report.txt  # Sanity checks

Pre-built CSVs are also attached to each GitHub release.

Output files

File	Description
`mhc-seqs-raw.csv`	Every protein entry from all sources
`mhc-full-seqs.csv`	One representative per two-field allele: full sequence, groove decomposition, and metadata

Current data summary

All sources (IMGT/HLA, IPD-MHC, UniProt curated references, and 15,860 diverse MHC sequences from UniProt) are merged into a single dataset:

Category	Class I	Class II	Total
human	17,462	7,878	25,340
nhp	4,639	2,486	7,125
murine	59	29	88
ungulate	638	1,128	1,766
carnivore	166	318	484
other_mammal	469	386	855
bird	5,961	3,351	9,312
fish	1,292	3,569	4,861
other_vertebrate	470	666	1,136
total	31,156	19,811	50,967

Covering 466+ species prefixes. Groove parse success rate on IMGT/IPD-MHC entries: 99.6%.

Structural decomposition

Each protein chain is decomposed into four contiguous regions:

Column	Class I alpha	Class II alpha	Class II beta
`groove1`	α1 domain (~90 aa)	α1 domain (~83 aa)	—
`groove2`	α2 domain (~93 aa)	—	β1 domain (~93 aa)
`ig_domain`	α3 Ig-fold (~95 aa)	α2 Ig-fold (~95 aa)	β2 Ig-fold (~95 aa)
`tail`	TM + cytoplasmic	TM + cytoplasmic	TM + cytoplasmic

For a class I chain: mature_protein = groove1 + groove2 + ig_domain + tail

Key columns

Both CSVs share: gene, mhc_class, chain, species, species_category, species_prefix, source, source_id.

source is one of: imgt, ipd_mhc, uniprot_curated, uniprot_reference, uniprot_diverse.

source_id is the database accession for provenance tracking (e.g., HLA00001 for IMGT, NHP00001 for IPD-MHC, P01901 for UniProt).

species_category is one of: human, nhp, murine, ungulate, carnivore, cetacean, other_mammal, bird, fish, other_vertebrate.

Data sources

Source	`source` value	Species	Data
IMGT/HLA	`imgt`	Human	Downloaded at build time
IPD-MHC	`ipd_mhc`	Non-human	Downloaded at build time
UniProt	`uniprot_reference`	Multi-species	B2M references (shipped in package)
UniProt	`uniprot_curated`	Mouse	30 H-2 alleles (shipped in package)
UniProt	`uniprot_diverse`	614 species	16,208 diverse MHC sequences (shipped in package)

IMGT/HLA and IPD-MHC FASTA files are downloaded on first build and cached. The UniProt curated CSVs (b2m_sequences.csv, mouse_h2_sequences.csv, diverse_mhc_sequences.csv) ship inside the mhcseqs package — no download needed.

To refresh the diverse MHC dataset from UniProt:

python scripts/fetch_diverse_mhc.py    # Download raw data → data/diverse_mhc_raw.csv
python scripts/curate_diverse_mhc.py   # Curate → mhcseqs/diverse_mhc_sequences.csv

Species prefixes

Species	Latin name	MHC prefix
human	Homo sapiens	HLA
macaque	Macaca mulatta	Mamu
chimpanzee	Pan troglodytes	Patr
gorilla	Gorilla gorilla	Gogo
mouse	Mus musculus	H2
rat	Rattus norvegicus	RT1
cattle	Bos taurus	BoLA
pig	Sus scrofa	SLA
horse	Equus caballus	ELA
sheep	Ovis aries	OLA
dog	Canis lupus familiaris	DLA
cat	Felis catus	FLA
chicken	Gallus gallus	Gaga
salmon	Salmo salar	Sasa
zebrafish	Danio rerio	Dare

Groove extraction algorithm

The groove parser is alignment-free — it uses conserved Cys-Cys disulfide pairs in Ig-fold domains as structural landmarks to slice domain boundaries without multiple sequence alignment.

See groove.py module docstring for detailed algorithm documentation with ASCII structural diagrams.

Dependencies

Python 3.10+
mhcgnomes >= 3.1.0 — allele name parsing

No alignment tools, BLAST, or structure databases are required.

Repository structure

mhcseqs/
├── mhcseqs/
│   ├── __init__.py        # Public API
│   ├── __main__.py        # CLI entry point
│   ├── version.py         # Package version
│   ├── download.py        # FASTA source downloading
│   ├── species.py         # Species taxonomy (29-class → 10-class)
│   ├── alleles.py         # Allele name parsing (mhcgnomes wrapper)
│   ├── groove.py          # Binding groove extraction + mutation support
│   ├── imgt.py            # IMGT G-DOMAIN position numbering
│   ├── pipeline.py        # Two-step build pipeline
│   ├── validate.py        # Post-build validation
│   ├── b2m_sequences.csv                # Reference B2M sequences (UniProt)
│   ├── mouse_h2_sequences.csv          # Mouse H-2 sequences (UniProt)
│   └── diverse_mhc_sequences.csv      # 16k diverse MHC sequences (UniProt)
├── tests/                 # pytest test suite
├── scripts/
│   ├── fetch_diverse_mhc.py      # Download diverse MHC from UniProt
│   ├── curate_diverse_mhc.py     # Curate into shipped CSV
│   └── validate_signal_peptides.py
├── data/                          # Intermediate data (not shipped)
├── build.py               # Convenience shim
├── pyproject.toml         # Package metadata
├── develop.sh             # Install in dev mode
├── lint.sh                # Run ruff
├── test.sh                # Run pytest
└── deploy.sh              # Build + publish to PyPI

License

Apache 2.0

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.5.9

Apr 18, 2026

2.5.8

Apr 15, 2026

2.5.7

Apr 15, 2026

2.5.6

Apr 15, 2026

2.5.5

Apr 15, 2026

2.5.4

Apr 14, 2026

2.5.3

Apr 14, 2026

2.5.2

Apr 14, 2026

2.5.1

Apr 13, 2026

2.5.0

Apr 12, 2026

2.4.1

Apr 9, 2026

2.4.0

Apr 9, 2026

2.3.6

Apr 8, 2026

2.3.5

Apr 7, 2026

2.3.4

Apr 7, 2026

2.3.3

Apr 7, 2026

2.3.2

Apr 7, 2026

2.3.1

Apr 7, 2026

2.3.0

Apr 5, 2026

2.2.2

Apr 5, 2026

2.2.1

Apr 2, 2026

2.2.0

Mar 30, 2026

1.2.0

Mar 23, 2026

This version

1.1.0

Mar 23, 2026

1.0.0

Mar 23, 2026

0.11.0

Mar 22, 2026

0.10.0

Mar 20, 2026

0.9.0

Mar 17, 2026

0.8.0

Mar 17, 2026

0.6.1

Mar 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mhcseqs-1.1.0.tar.gz (785.6 kB view details)

Uploaded Mar 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mhcseqs-1.1.0-py3-none-any.whl (785.6 kB view details)

Uploaded Mar 23, 2026 Python 3

File details

Details for the file mhcseqs-1.1.0.tar.gz.

File metadata

Download URL: mhcseqs-1.1.0.tar.gz
Upload date: Mar 23, 2026
Size: 785.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for mhcseqs-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`666d2b38b3935594dd20dbbe6ac42dc5116e7a578ee6b6221f5f4c0d74781787`
MD5	`ccb9ed298f0dc27a90bbc4f1c4410039`
BLAKE2b-256	`6a449c670bd3150904bb3d0b8c3125f8eb82719ab1eddf3923f464d77a20b1b3`

See more details on using hashes here.

File details

Details for the file mhcseqs-1.1.0-py3-none-any.whl.

File metadata

Download URL: mhcseqs-1.1.0-py3-none-any.whl
Upload date: Mar 23, 2026
Size: 785.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for mhcseqs-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b1728da6950df5dc77644f076d98600918274e222d545d5dca09bbf169d6c5fb`
MD5	`33b68fabf412fb8bed4dfe6fa48b8a25`
BLAKE2b-256	`e1b2f6e2df55088006817e586194db720954ec51cbaa1b72a536f0f1f19dc2da`

See more details on using hashes here.

mhcseqs 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

mhcseqs

Install

Quick start

CLI

Python API

Load as a DataFrame

Data directory

Output files

Current data summary

Structural decomposition

Key columns

Data sources

Species prefixes

Groove extraction algorithm

Dependencies

Repository structure

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes