MHC sequence curation and binding groove extraction
Project description
mhcseqs
Self-contained pipeline for downloading, curating, and extracting binding grooves from MHC (Major Histocompatibility Complex) protein sequences.
Install
pip install mhcseqs
For development:
git clone https://github.com/openvax/mhcseqs.git
cd mhcseqs
./develop.sh # uv pip install -e ".[dev]"
./test.sh # pytest
./lint.sh # ruff
Quick start
CLI
# Build all output CSVs (writes to ~/.cache/mhcseqs/)
mhcseqs build
# Build to a specific directory instead
mhcseqs build --output-dir output/
# Look up a specific allele
mhcseqs lookup "HLA-A*02:01"
# Check version
mhcseqs --version
Python API
import mhcseqs
# Build the database (downloads to ~/.cache/mhcseqs/, only needed once)
paths = mhcseqs.build() # BuildPaths dataclass
# Look up any allele → AlleleRecord with everything
r = mhcseqs.lookup("HLA-A*02:01")
r.sequence # full protein (with signal peptide)
r.mature_sequence # signal peptide removed (computed property)
r.mature_start # signal peptide length (24 for HLA-A*02:01)
r.groove1 # α1 domain
r.groove2 # α2 domain
r.ig_domain # α3 Ig-fold
r.tail # TM + cytoplasmic
r.species_category # "human"
# Apply mutations (IEDB-style, e.g. "K66A")
m = mhcseqs.lookup("HLA-A*02:01", mutations=["K66A", "D77S"])
Load as a DataFrame
import mhcseqs
# As a DataFrame (full sequence + groove decomposition + metadata)
df = mhcseqs.load_sequences_dataframe()
# Or as a list of dicts (no pandas dependency)
rows = mhcseqs.load_sequences_dict()
Data directory
By default, mhcseqs build downloads FASTA files and writes output CSVs to
~/.cache/mhcseqs/ (override with $MHCSEQS_DATA or --output-dir).
~/.cache/mhcseqs/
├── fasta/ # Downloaded FASTA source files
│ ├── hla_prot.fasta # IMGT/HLA (human)
│ └── ipd_mhc_prot.fasta # IPD-MHC (non-human)
├── mhc-seqs-raw.csv # Every protein entry from all sources
├── mhc-full-seqs.csv # One representative per two-field allele (with grooves)
├── mhc-merge-report.txt # Deduplication decisions
└── mhc-validation-report.txt # Sanity checks
Pre-built CSVs are also attached to each GitHub release.
Output files
| File | Description |
|---|---|
mhc-seqs-raw.csv |
Every protein entry from all sources |
mhc-full-seqs.csv |
One representative per two-field allele: full sequence, groove decomposition, and metadata |
Current data summary
All sources (IMGT/HLA, IPD-MHC, UniProt curated references, and 16,208 diverse MHC sequences from UniProt) are merged into a single dataset:
| Category | Class I | Class II | Total |
|---|---|---|---|
| human | 17,462 | 7,878 | 25,340 |
| nhp | 4,639 | 2,486 | 7,125 |
| murine | 59 | 29 | 88 |
| ungulate | 638 | 1,128 | 1,766 |
| carnivore | 166 | 318 | 484 |
| cetacean | 3 | 98 | 101 |
| other_mammal | 483 | 288 | 771 |
| bird | 6,135 | 3,351 | 9,486 |
| fish | 1,416 | 3,569 | 4,985 |
| other_vertebrate | 502 | 666 | 1,168 |
| total | 31,503 | 19,811 | 51,314 |
Covering 614+ species prefixes. Groove parse success rate on IMGT/IPD-MHC entries: 99.6%.
Structural decomposition
Each protein chain is decomposed into four contiguous regions:
| Column | Class I alpha | Class II alpha | Class II beta |
|---|---|---|---|
groove1 |
α1 domain (~90 aa) | α1 domain (~83 aa) | — |
groove2 |
α2 domain (~93 aa) | — | β1 domain (~93 aa) |
ig_domain |
α3 Ig-fold (~95 aa) | α2 Ig-fold (~95 aa) | β2 Ig-fold (~95 aa) |
tail |
TM + cytoplasmic | TM + cytoplasmic | TM + cytoplasmic |
For a class I chain: mature_protein = groove1 + groove2 + ig_domain + tail
Key columns
Both CSVs share: gene, mhc_class, chain, species,
species_category, species_prefix, source, source_id.
source is one of: imgt, ipd_mhc, uniprot_curated, uniprot_reference,
uniprot_diverse.
source_id is the database accession for provenance tracking (e.g.,
HLA00001 for IMGT, NHP00001 for IPD-MHC, P01901 for UniProt).
species_category is one of: human, nhp, murine, ungulate,
carnivore, cetacean, other_mammal, bird, fish, other_vertebrate.
Data sources
| Source | source value |
Species | Data |
|---|---|---|---|
| IMGT/HLA | imgt |
Human | Downloaded at build time |
| IPD-MHC | ipd_mhc |
Non-human | Downloaded at build time |
| UniProt | uniprot_reference |
Multi-species | B2M references (shipped in package) |
| UniProt | uniprot_curated |
Mouse | 30 H-2 alleles (shipped in package) |
| UniProt | uniprot_diverse |
614 species | 16,208 diverse MHC sequences (shipped in package) |
IMGT/HLA and IPD-MHC FASTA files are downloaded on first build and cached.
The UniProt curated CSVs (b2m_sequences.csv, mouse_h2_sequences.csv,
diverse_mhc_sequences.csv) ship inside the mhcseqs package — no download needed.
To refresh the diverse MHC dataset from UniProt:
python scripts/fetch_diverse_mhc.py # Download raw data → data/diverse_mhc_raw.csv
python scripts/curate_diverse_mhc.py # Curate → mhcseqs/diverse_mhc_sequences.csv
Species prefixes
| Species | Latin name | MHC prefix |
|---|---|---|
| human | Homo sapiens | HLA |
| macaque | Macaca mulatta | Mamu |
| chimpanzee | Pan troglodytes | Patr |
| gorilla | Gorilla gorilla | Gogo |
| mouse | Mus musculus | H2 |
| rat | Rattus norvegicus | RT1 |
| cattle | Bos taurus | BoLA |
| pig | Sus scrofa | SLA |
| horse | Equus caballus | ELA |
| sheep | Ovis aries | OLA |
| dog | Canis lupus familiaris | DLA |
| cat | Felis catus | FLA |
| chicken | Gallus gallus | Gaga |
| salmon | Salmo salar | Sasa |
| zebrafish | Danio rerio | Dare |
Groove extraction algorithm
The groove parser is alignment-free — it uses conserved Cys-Cys disulfide pairs in Ig-fold domains as structural landmarks to slice domain boundaries without multiple sequence alignment.
See groove.py module docstring for detailed algorithm documentation with ASCII structural diagrams.
Dependencies
- Python 3.10+
- mhcgnomes >= 3.1.0 — allele name parsing
No alignment tools, BLAST, or structure databases are required.
Repository structure
mhcseqs/
├── mhcseqs/
│ ├── __init__.py # Public API
│ ├── __main__.py # CLI entry point
│ ├── version.py # Package version
│ ├── download.py # FASTA source downloading
│ ├── species.py # Species taxonomy (29-class → 10-class)
│ ├── alleles.py # Allele name parsing (mhcgnomes wrapper)
│ ├── groove.py # Binding groove extraction + mutation support
│ ├── imgt.py # IMGT G-DOMAIN position numbering
│ ├── pipeline.py # Two-step build pipeline
│ ├── validate.py # Post-build validation
│ ├── b2m_sequences.csv # Reference B2M sequences (UniProt)
│ ├── mouse_h2_sequences.csv # Mouse H-2 sequences (UniProt)
│ └── diverse_mhc_sequences.csv # 16k diverse MHC sequences (UniProt)
├── tests/ # pytest test suite
├── scripts/
│ ├── fetch_diverse_mhc.py # Download diverse MHC from UniProt
│ ├── curate_diverse_mhc.py # Curate into shipped CSV
│ └── validate_signal_peptides.py
├── data/ # Intermediate data (not shipped)
├── build.py # Convenience shim
├── pyproject.toml # Package metadata
├── develop.sh # Install in dev mode
├── lint.sh # Run ruff
├── test.sh # Run pytest
└── deploy.sh # Build + publish to PyPI
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mhcseqs-0.6.1.tar.gz.
File metadata
- Download URL: mhcseqs-0.6.1.tar.gz
- Upload date:
- Size: 512.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39f37a9639da21f37041e85c4310d700d14c0da5509eaa036544d1fb98d28be3
|
|
| MD5 |
3646866e1b6670bc446c1fbd662cce6f
|
|
| BLAKE2b-256 |
498dd4e117133d92f12b5ef092c4f03f44c7f994435f4b445c277a064bb451e5
|
File details
Details for the file mhcseqs-0.6.1-py3-none-any.whl.
File metadata
- Download URL: mhcseqs-0.6.1-py3-none-any.whl
- Upload date:
- Size: 513.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fff2d18c1e8fbf3f8bd08cf1e0ac98aef75d66260a26db62534b963ef2fb5059
|
|
| MD5 |
5ac88852ef58b3c365922a663db55db4
|
|
| BLAKE2b-256 |
f59beb4bb6c230c9479b0810e6dc159de24ed235c2d8c32c94ccadb02fd78bc1
|