MHC sequence curation and binding groove extraction
Project description
mhcseqs
Self-contained pipeline for downloading, curating, and extracting binding grooves from MHC (Major Histocompatibility Complex) protein sequences.
Install
pip install mhcseqs
For development:
git clone https://github.com/pirl-unc/mhcseqs.git
cd mhcseqs
./develop.sh # uv pip install -e ".[dev]"
./test.sh # pytest
./lint.sh # ruff
Quick start
CLI
# Build all output CSVs (writes to ~/.cache/mhcseqs/)
mhcseqs build
# Build to a specific directory instead
mhcseqs build --output-dir output/
# Look up a specific allele
mhcseqs lookup "HLA-A*02:01"
# Check version
mhcseqs --version
Python API
import mhcseqs
# Build the database (downloads to ~/.cache/mhcseqs/, only needed once)
paths = mhcseqs.build() # BuildPaths dataclass
# Look up any allele → AlleleRecord with everything
r = mhcseqs.lookup("HLA-A*02:01")
r.sequence # full protein (with signal peptide)
r.mature_sequence # signal peptide removed (computed property)
r.mature_start # signal peptide length (24 for HLA-A*02:01)
r.groove1 # α1 domain
r.groove2 # α2 domain
r.ig_domain # α3 Ig-fold
r.tail # TM + cytoplasmic
r.domains # typed domain spans
r.domain_architecture
r.domain_spans
r.species_category # "human"
# Apply mutations (IEDB-style, e.g. "K66A")
m = mhcseqs.lookup("HLA-A*02:01", mutations=["K66A", "D77S"])
Load as a DataFrame
import mhcseqs
# As a DataFrame (full sequence + groove decomposition + metadata)
df = mhcseqs.load_sequences_dataframe()
# Or as a list of dicts (no pandas dependency)
rows = mhcseqs.load_sequences_dict()
Current data summary
Three sources are merged into a single dataset:
| Source | Entries | Species | Notes |
|---|---|---|---|
| IMGT/HLA | 44,630 | Human | Downloaded at build time |
| IPD-MHC | 12,380 | Non-human mammals, birds, fish | Downloaded at build time |
| UniProt | 20,566 | 500+ species | Curated diverse MHC, B2M, H-2 references (shipped in package) |
| Total raw | 77,576 | ||
| After merge/dedup | 55,658 | One representative per two-field allele | |
| Groove OK | 54,098 | 97.2% of representatives |
By species category
| Category | Count | Groove OK | Class I full | Class II full |
|---|---|---|---|---|
| human | 25,364 | 99.8% | 17,426 | 5,347 |
| nhp | 7,125 | 98.0% | 4,583 | 1,426 |
| bird | 9,312 | 98.9% | 546 | 562 |
| fish | 4,859 | 97.4% | 776 | 1,259 |
| ungulate | 1,768 | 94.7% | 577 | 303 |
| murine | 1,587 | 89.7% | 490 | 118 |
| carnivore | 484 | 99.6% | 164 | 0 |
| other_mammal | 943 | 95.1% | 280 | 146 |
| other_vertebrate | 1,137 | 96.0% | 324 | 129 |
"Groove OK" includes both full-length and fragment parses. "Full" means both groove
halves present (class I: α1 + α2, ~183 aa; class II: single chain's groove half with
Ig support, ~88 aa groove + ~90 aa Ig). The remaining entries are single-exon fragments
(one groove half only) — common in bird and fish submissions to IPD-MHC where only
the polymorphic exon is sequenced. The parser correctly characterizes these as
alpha1_only, alpha2_only, beta1_only_fallback, or fragment_fallback.
Species categories: human, nhp (non-human primates), murine (mice, rats, rodents), ungulate (cattle, pig, horse, sheep, goat), carnivore (dog, cat), other_mammal (marsupials, monotremes, bats, cetaceans, rabbit), bird, fish, other_vertebrate (reptiles, amphibians).
Signal peptide detection accuracy
Validated against 2,403 UniProt ground-truth SP annotations:
| Species | Class I exact | Class II exact |
|---|---|---|
| Human | 99.2% | 88.4% |
| NHP | 100.0% | 91.2% |
| Bird | 94.4% | 89.3% |
| Fish | 86.9% | 81.9% |
| Other vertebrate | 68.2% | 86.1% |
| Overall | 82.0% exact, 89.9% within ±2 aa |
False positive rate on 2,155 mature-only controls: 3.8%.
Data directory
By default, mhcseqs build downloads FASTA files and writes output CSVs to
~/.cache/mhcseqs/ (override with $MHCSEQS_DATA or --output-dir).
~/.cache/mhcseqs/
├── fasta/ # Downloaded FASTA source files
│ ├── hla_prot.fasta # IMGT/HLA (human)
│ └── ipd_mhc_prot.fasta # IPD-MHC (non-human)
├── mhc-seqs-raw.csv # Every protein entry from all sources
├── mhc-full-seqs.csv # One representative per two-field allele (with grooves)
├── mhc-merge-report.txt # Deduplication decisions
└── mhc-validation-report.txt # Sanity checks
Structural decomposition
The parser materializes an explicit domain grammar:
| Chain | Grammar |
|---|---|
| Class I alpha | signal_peptide? -> g_alpha1 -> g_alpha2 -> c1_alpha3 -> transmembrane? -> cytoplasmic_tail? |
| Class II alpha | signal_peptide? -> g_alpha1 -> c1_alpha2 -> transmembrane? -> cytoplasmic_tail? |
| Class II beta | signal_peptide? -> g_beta1 -> c1_beta2 -> transmembrane? -> cytoplasmic_tail? |
The exported contiguous sequence fields are:
| Column | Class I alpha | Class II alpha | Class II beta |
|---|---|---|---|
groove1 |
α1 domain (~80-95 aa typical) | α1 domain (~75-95 aa typical) | — |
groove2 |
α2 domain (~80-100 aa typical) | — | β1 domain (~70-100 aa typical) |
ig_domain |
α3 C-like support domain | α2 C-like support domain | β2 C-like support domain |
tail |
linker + TM + cytoplasmic tail | linker + TM + cytoplasmic tail | linker + TM + cytoplasmic tail |
domain_architecture and domain_spans expose the typed domain grammar
directly, for example:
- class I:
signal_peptide>g_alpha1>g_alpha2>c1_alpha3>tail_linker>transmembrane>cytoplasmic_tail - class II beta:
signal_peptide>g_beta1>c1_beta2>tail_linker>transmembrane>cytoplasmic_tail
How Parsing Works
The parser is alignment-free and holistic. It does not rely on one absolute Cys position to define the mature start.
For each sequence it:
- Enumerates all plausible Cys-Cys pairs in the Ig/C-like separation range.
- Scores each pair as a candidate G-domain or C-like anchor using fold-topology
evidence, especially the Trp41-like signal around
c1+14. - Enumerates candidate SP boundaries and whole domain parses, including partial parses when only fragment evidence is available.
- Chooses the best full parse using factored multiplicative scoring: three structural claims (SP grammar, domain architecture, completeness) each produce a [0,1] factor. Contradictory evidence in any factor gates the score down multiplicatively, while missing evidence is a softer penalty.
The strongest evidence types are:
- SP cleavage grammar: hydrophobic h-region, short c-region, von Heijne
-3/-1compatibility, exclusion of impossible-3/-1property pairs, and mild+1mature-sequence penalties. - Domain-fold grammar: canonical G-domain versus C-like disulfide topology,
including the IMGT-style
Cys11-Cys74G-domain signature and theCys23/Trp41/Cys104C-like grammar. - Class-specific groove boundaries:
- class I α1/α2 junction motifs
- class I α2 -> α3 boundary motifs
- class II α1 -> α2 and β1 -> β2 boundary motifs
- Soft priors on groove/support-domain lengths and TM support downstream.
The parser handles:
- full-length proteins with or without signal peptides
- SP-stripped deposits (
mature_start = 0) - common fragments:
- class I exon 2 only ->
alpha1_only - class I exon 3 only ->
alpha2_only - class II exon 2-like fragments ->
fragment_fallback
- class I exon 2 only ->
- low-evidence salvage:
- class I from α3 C-like support only ->
inferred_from_alpha3 - class II beta from β1 groove pair only ->
beta1_only_fallback
- class I from α3 C-like support only ->
- true groove absence / insufficient structural evidence ->
missing_groove
Groove Status Values
These are the important parser-facing statuses:
| Status | Meaning |
|---|---|
ok |
Full decomposition from the main structural grammar |
alpha1_only |
Class I fragment consistent with α1 / exon 2 only |
alpha2_only |
Class I fragment consistent with α2 / exon 3 only |
fragment_fallback |
Short fragment retained as the observable groove half |
inferred_from_alpha3 |
Class I salvage parse using a downstream α3 C-like anchor |
beta1_only_fallback |
Class II beta salvage parse using only the β1 groove pair |
missing_groove |
No recoverable groove architecture from the available evidence |
non_classical |
Non-classical class-I lineage flagged post-parse |
short |
Groove half too short to look functionally peptide-binding |
Pipeline-only statuses can still appear in CSV outputs:
| Status | Meaning |
|---|---|
not_applicable |
Row intentionally excluded from groove functionality, mainly B2M in build outputs |
Literature Basis
The parser is built around conserved sequence grammar from the MHC literature:
- MHC domain organization is more conserved than short local motifs across vertebrates: Primordial Linkage of β2-Microglobulin to the MHC
- IMGT domain numbering and the G-domain versus C-like disulfide grammar: PMC3913909
- Classical class-I domain layout and landmarks: PMC2434379
- Salmonid class-II alpha/beta cysteine topology and lineage-specific extra cysteines: PMC2386828
- Teleost class-II evolutionary divergence while retaining the same modular architecture: PMC4219347
Signal-peptide logic follows the standard SPase grammar:
- von Heijne
-3/-1cleavage rule: How signal sequences maintain cleavage specificity - flanking
-2/+1effects: Flanking signal and mature peptide residues influence signal peptide cleavage - strong penalty for
+1 Pro: PubMed 1544500 - h-region / c-region structural context: Structure of the human signal peptidase complex reveals the determinants for signal peptide cleavage
Key columns
| Column | Description |
|---|---|
two_field_allele |
Allele name at two-field resolution |
gene |
MHC gene (e.g., A, DRB1, BF, UA) |
mhc_class |
I or II |
chain |
alpha, beta, or B2M |
species |
Latin binomial from source |
species_category |
One of 9 categories above |
source |
imgt, ipd_mhc, or uniprot |
source_id |
Database accession for provenance |
groove_status |
See table above |
is_functional |
True if groove parsed and not null/pseudogene |
Dependencies
- Python 3.10+
- mhcgnomes >= 3.14.0 — allele name parsing with species-directed disambiguation
No alignment tools, BLAST, or structure databases are required.
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mhcseqs-2.2.0.tar.gz.
File metadata
- Download URL: mhcseqs-2.2.0.tar.gz
- Upload date:
- Size: 843.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68c73c609d5e74c985117d1124af35dfae01d472e1bc39b695ae4eb70f509f2a
|
|
| MD5 |
b75b999edd0525c1179442eec5fd4d41
|
|
| BLAKE2b-256 |
2d63588834b40f87fb3d9e3d37bc71d0f17bf78ff467b3d9be95b0641fc8987d
|
File details
Details for the file mhcseqs-2.2.0-py3-none-any.whl.
File metadata
- Download URL: mhcseqs-2.2.0-py3-none-any.whl
- Upload date:
- Size: 832.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a3e7c66c1f532f81848c954fbe6aa0f8f262f349567b6e75335b2e47f2a5ea3
|
|
| MD5 |
153a64b3f7d43edfd7c11647edd2b16e
|
|
| BLAKE2b-256 |
fbf52ec9ce3ba0a5befd7de2216ad508d388c947388d2bc3adf11b8c33551d5f
|