MHC sequence curation and binding groove extraction
Project description
mhcseqs
Self-contained pipeline for downloading, curating, and extracting binding grooves from MHC (Major Histocompatibility Complex) protein sequences.
Install
pip install mhcseqs
For development:
git clone https://github.com/pirl-unc/mhcseqs.git
cd mhcseqs
./develop.sh # uv pip install -e ".[dev]"
./test.sh # pytest
./lint.sh # ruff
Quick start
CLI
# Build all output CSVs (writes to ~/.cache/mhcseqs/)
mhcseqs build
# Build to a specific directory instead
mhcseqs build --output-dir output/
# Look up a specific allele
mhcseqs lookup "HLA-A*02:01"
# Check version
mhcseqs --version
Python API
import mhcseqs
# Build the database (downloads to ~/.cache/mhcseqs/, only needed once)
paths = mhcseqs.build() # BuildPaths dataclass
# Look up any allele → AlleleRecord with everything
r = mhcseqs.lookup("HLA-A*02:01")
r.sequence # full protein (with signal peptide)
r.mature_sequence # signal peptide removed (computed property)
r.mature_start # signal peptide length (24 for HLA-A*02:01)
r.groove1 # α1 domain
r.groove2 # α2 domain
r.ig_domain # α3 Ig-fold
r.tail # TM + cytoplasmic
r.species_category # "human"
# Apply mutations (IEDB-style, e.g. "K66A")
m = mhcseqs.lookup("HLA-A*02:01", mutations=["K66A", "D77S"])
Load as a DataFrame
import mhcseqs
# As a DataFrame (full sequence + groove decomposition + metadata)
df = mhcseqs.load_sequences_dataframe()
# Or as a list of dicts (no pandas dependency)
rows = mhcseqs.load_sequences_dict()
Current data summary
Three sources are merged into a single dataset:
| Source | Entries | Species | Notes |
|---|---|---|---|
| IMGT/HLA | 44,630 | Human | Downloaded at build time |
| IPD-MHC | 12,380 | Non-human mammals, birds, fish | Downloaded at build time |
| UniProt | 20,566 | 500+ species | Curated diverse MHC, B2M, H-2 references (shipped in package) |
| Total raw | 77,576 | ||
| After merge/dedup | 55,658 | One representative per two-field allele | |
| Groove OK | 54,121 | 97.2% of representatives |
By species category
| Category | Count |
|---|---|
| human | 25,364 |
| bird | 9,312 |
| nhp | 7,125 |
| fish | 4,859 |
| ungulate | 1,768 |
| murine | 1,587 |
| other_vertebrate | 1,137 |
| other_mammal | 943 |
| carnivore | 484 |
Species categories: human, nhp (non-human primates), murine (mice, rats, rodents), ungulate (cattle, pig, horse, sheep, goat), carnivore (dog, cat), other_mammal (marsupials, monotremes, bats, cetaceans, rabbit), bird, fish, other_vertebrate (reptiles, amphibians).
Data directory
By default, mhcseqs build downloads FASTA files and writes output CSVs to
~/.cache/mhcseqs/ (override with $MHCSEQS_DATA or --output-dir).
~/.cache/mhcseqs/
├── fasta/ # Downloaded FASTA source files
│ ├── hla_prot.fasta # IMGT/HLA (human)
│ └── ipd_mhc_prot.fasta # IPD-MHC (non-human)
├── mhc-seqs-raw.csv # Every protein entry from all sources
├── mhc-full-seqs.csv # One representative per two-field allele (with grooves)
├── mhc-merge-report.txt # Deduplication decisions
└── mhc-validation-report.txt # Sanity checks
Structural decomposition
Each protein chain is decomposed into four contiguous regions:
| Column | Class I alpha | Class II alpha | Class II beta |
|---|---|---|---|
groove1 |
α1 domain (~90 aa) | α1 domain (~83 aa) | — |
groove2 |
α2 domain (~93 aa) | — | β1 domain (~93 aa) |
ig_domain |
α3 Ig-fold (~95 aa) | α2 Ig-fold (~95 aa) | β2 Ig-fold (~95 aa) |
tail |
TM + cytoplasmic | TM + cytoplasmic | TM + cytoplasmic |
For a class I chain: mature_protein = groove1 + groove2 + ig_domain + tail
Groove extraction algorithm
The groove parser is alignment-free — it uses conserved Cys-Cys disulfide pairs in Ig-fold domains as structural landmarks to slice domain boundaries without multiple sequence alignment.
Signal peptide inference
Signal peptide length is inferred from the Cys pair position, not from sequence motifs. The conserved Ig-fold Cys has a known position in the mature protein (e.g., position 100 for class I α2). The offset between the raw sequence position and the expected mature position gives the signal peptide length:
mature_start = raw_cys_position - expected_mature_cys_position
Gene-specific constants account for groove domain length variation: DQA (109), DMA (120), DPB (114). All others use the defaults (class I: 100, class II alpha: 106, class II beta: 116).
Groove status values
| Status | Meaning |
|---|---|
ok |
Full decomposition: groove1 + groove2 + ig_domain + tail |
alpha1_only |
Single-exon class I fragment — α1 domain (no Cys pair) |
alpha2_only |
Single-exon class I fragment — α2 domain (has Cys pair) |
beta1_only_fallback |
Class II beta with β1 pair only (no β2 Ig pair) |
fragment_fallback |
Short fragment used as raw groove sequence |
inferred_from_alpha3 |
Groove boundaries estimated from α3 Cys pair |
not_applicable |
Non-groove gene (B2M, MICA, MICB, HFE, MR1) |
non_classical |
Non-classical MHC lineage (fish L/S/P/H) |
short |
Groove half < 70 aa — unlikely to be functional |
suspect_anchor |
Cys mutation produced implausible mature_start |
See groove.py module docstring for detailed algorithm documentation with ASCII structural diagrams.
Key columns
| Column | Description |
|---|---|
two_field_allele |
Allele name at two-field resolution |
gene |
MHC gene (e.g., A, DRB1, BF, UA) |
mhc_class |
I or II |
chain |
alpha, beta, or B2M |
species |
Latin binomial from source |
species_category |
One of 9 categories above |
source |
imgt, ipd_mhc, or uniprot |
source_id |
Database accession for provenance |
groove_status |
See table above |
is_functional |
True if groove parsed and not null/pseudogene |
Dependencies
- Python 3.10+
- mhcgnomes >= 3.14.0 — allele name parsing with species-directed disambiguation
No alignment tools, BLAST, or structure databases are required.
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mhcseqs-1.2.0.tar.gz.
File metadata
- Download URL: mhcseqs-1.2.0.tar.gz
- Upload date:
- Size: 786.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e518712b5103d6af47064c37cf96b9470258a1c623ef0a1205e618613862e3b1
|
|
| MD5 |
41466284f3e0a5d02c706f7e5ab018b9
|
|
| BLAKE2b-256 |
68ce4e54184a3c4aa1d90b3a07803d45ff18bf5fc38f7e473160fd56c225ab44
|
File details
Details for the file mhcseqs-1.2.0-py3-none-any.whl.
File metadata
- Download URL: mhcseqs-1.2.0-py3-none-any.whl
- Upload date:
- Size: 786.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a61c1b06e631a88e8bf7b938afdbad9c300a66cb29b0f30e1070c8a20aca597
|
|
| MD5 |
ecf3fe5f8ca7bd818fab708e25e19433
|
|
| BLAKE2b-256 |
8e68c51864bfd1c04e50d6e2c6686fc3991b5d21a266edd69c72a2fa344f9476
|