Skip to main content

MHC sequence curation and binding groove extraction

Project description

mhcseqs

Self-contained pipeline for downloading, curating, and extracting binding grooves from MHC (Major Histocompatibility Complex) protein sequences.

Install

pip install mhcseqs

For development:

git clone https://github.com/pirl-unc/mhcseqs.git
cd mhcseqs
./develop.sh          # uv pip install -e ".[dev]"
./test.sh             # pytest
./lint.sh             # ruff

Quick start

CLI

# Build all output CSVs (writes to ~/.cache/mhcseqs/)
mhcseqs build

# Build to a specific directory instead
mhcseqs build --output-dir output/

# Look up a specific allele
mhcseqs lookup "HLA-A*02:01"

# Check version
mhcseqs --version

Python API

import mhcseqs

# Build the database (downloads to ~/.cache/mhcseqs/, only needed once)
paths = mhcseqs.build()  # BuildPaths dataclass

# Look up any allele → AlleleRecord with everything
r = mhcseqs.lookup("HLA-A*02:01")
r.sequence          # full protein (with signal peptide)
r.mature_sequence   # signal peptide removed (computed property)
r.mature_start      # signal peptide length (24 for HLA-A*02:01)
r.groove1           # α1 domain
r.groove2           # α2 domain
r.ig_domain         # α3 Ig-fold
r.tail              # TM + cytoplasmic
r.domains           # typed domain spans
r.domain_architecture
r.domain_spans
r.species_category  # "human"

# Apply mutations (IEDB-style, e.g. "K66A")
m = mhcseqs.lookup("HLA-A*02:01", mutations=["K66A", "D77S"])

Load as a DataFrame

import mhcseqs

# As a DataFrame (full sequence + groove decomposition + metadata)
df = mhcseqs.load_sequences_dataframe()

# Or as a list of dicts (no pandas dependency)
rows = mhcseqs.load_sequences_dict()

Current data summary

Three sources are merged into a single dataset:

Source Entries Species Notes
IMGT/HLA 44,630 Human Downloaded at build time
IPD-MHC 12,380 Non-human mammals, birds, fish Downloaded at build time
UniProt 20,566 500+ species Curated diverse MHC, B2M, H-2 references (shipped in package)
Total raw 77,576
After merge/dedup 55,696 One representative per two-field allele
Groove OK 54,155 97.2% of representatives

By species category

Category Count Groove OK Class I full Class II full
human 25,364 99.8% 17,364 5,347
nhp 7,125 98.0% 4,582 1,429
bird 9,312 99.1% 553 558
fish 4,859 97.2% 777 1,259
ungulate 1,768 96.2% 603 303
murine 1,625 89.0% 466 158
carnivore 484 99.6% 164 0
other_mammal 943 95.0% 275 152
other_vertebrate 1,137 96.2% 327 129

"Groove OK" includes both full-length and fragment parses. "Full" means both groove halves present (class I: α1 + α2, ~183 aa; class II: single chain's groove half with Ig support, ~88 aa groove + ~90 aa Ig). The remaining entries are single-exon fragments (one groove half only) — common in bird and fish submissions to IPD-MHC where only the polymorphic exon is sequenced. The parser correctly characterizes these as alpha1_only, alpha2_only, beta1_only_fallback, or fragment_fallback.

Species categories: human, nhp (non-human primates), murine (mice, rats, rodents), ungulate (cattle, pig, horse, sheep, goat), carnivore (dog, cat), other_mammal (marsupials, monotremes, bats, cetaceans, rabbit), bird, fish, other_vertebrate (reptiles, amphibians).

Signal peptide detection accuracy

Validated against 2,403 UniProt ground-truth SP annotations:

Species Class I exact Class II exact
Human 99.2% 88.4%
NHP 100.0% 91.2%
Bird 94.4% 89.3%
Fish 86.9% 81.9%
Other vertebrate 68.2% 86.1%
Overall 82.0% exact, 89.9% within ±2 aa

False positive rate on 2,155 mature-only controls: 3.8%.

Data directory

By default, mhcseqs build downloads FASTA files and writes output CSVs to ~/.cache/mhcseqs/ (override with $MHCSEQS_DATA or --output-dir).

~/.cache/mhcseqs/
├── fasta/                     # Downloaded FASTA source files
│   ├── hla_prot.fasta         # IMGT/HLA (human)
│   └── ipd_mhc_prot.fasta    # IPD-MHC (non-human)
├── mhc-seqs-raw.csv           # Every protein entry from all sources
├── mhc-full-seqs.csv          # One representative per two-field allele (with grooves)
├── mhc-merge-report.txt       # Deduplication decisions
└── mhc-validation-report.txt  # Sanity checks

Structural decomposition

The parser materializes an explicit domain grammar:

Chain Grammar
Class I alpha signal_peptide? -> g_alpha1 -> g_alpha2 -> c1_alpha3 -> transmembrane? -> cytoplasmic_tail?
Class II alpha signal_peptide? -> g_alpha1 -> c1_alpha2 -> transmembrane? -> cytoplasmic_tail?
Class II beta signal_peptide? -> g_beta1 -> c1_beta2 -> transmembrane? -> cytoplasmic_tail?

The exported contiguous sequence fields are:

Column Class I alpha Class II alpha Class II beta
groove1 α1 domain (~80-95 aa typical) α1 domain (~75-95 aa typical)
groove2 α2 domain (~80-100 aa typical) β1 domain (~70-100 aa typical)
ig_domain α3 C-like support domain α2 C-like support domain β2 C-like support domain
tail linker + TM + cytoplasmic tail linker + TM + cytoplasmic tail linker + TM + cytoplasmic tail

domain_architecture and domain_spans expose the typed domain grammar directly, for example:

  • class I: signal_peptide>g_alpha1>g_alpha2>c1_alpha3>tail_linker>transmembrane>cytoplasmic_tail
  • class II beta: signal_peptide>g_beta1>c1_beta2>tail_linker>transmembrane>cytoplasmic_tail

How Parsing Works

The parser is alignment-free and holistic. It does not rely on one absolute Cys position to define the mature start.

For each sequence it:

  1. Enumerates all plausible Cys-Cys pairs in the Ig/C-like separation range.
  2. Scores each pair as a candidate G-domain or C-like anchor using fold-topology evidence, especially the Trp41-like signal around c1+14.
  3. Enumerates candidate SP boundaries and whole domain parses, including partial parses when only fragment evidence is available.
  4. Chooses the best full parse using factored multiplicative scoring: three structural claims (SP grammar, domain architecture, completeness) each produce a [0,1] factor. Contradictory evidence in any factor gates the score down multiplicatively, while missing evidence is a softer penalty.

The strongest evidence types are:

  • SP cleavage grammar: hydrophobic h-region, short c-region, von Heijne -3/-1 compatibility, exclusion of impossible -3/-1 property pairs, and mild +1 mature-sequence penalties.
  • Domain-fold grammar: canonical G-domain versus C-like disulfide topology, including the IMGT-style Cys11-Cys74 G-domain signature and the Cys23/Trp41/Cys104 C-like grammar.
  • Class-specific groove boundaries:
    • class I α1/α2 junction motifs
    • class I α2 -> α3 boundary motifs
    • class II α1 -> α2 and β1 -> β2 boundary motifs
  • Soft priors on groove/support-domain lengths and TM support downstream.

The parser handles:

  • full-length proteins with or without signal peptides
  • SP-stripped deposits (mature_start = 0)
  • common fragments:
    • class I exon 2 only -> alpha1_only
    • class I exon 3 only -> alpha2_only
    • class II exon 2-like fragments -> fragment_fallback
  • low-evidence salvage:
    • class I from α3 C-like support only -> inferred_from_alpha3
    • class II beta from β1 groove pair only -> beta1_only_fallback
  • true groove absence / insufficient structural evidence -> missing_groove

Groove Status Values

These are the important parser-facing statuses:

Status Meaning
ok Full decomposition from the main structural grammar
alpha1_only Class I fragment consistent with α1 / exon 2 only
alpha2_only Class I fragment consistent with α2 / exon 3 only
fragment_fallback Short fragment retained as the observable groove half
inferred_from_alpha3 Class I salvage parse using a downstream α3 C-like anchor
beta1_only_fallback Class II beta salvage parse using only the β1 groove pair
missing_groove No recoverable groove architecture from the available evidence
non_classical Non-classical class-I lineage flagged post-parse
short Groove half too short to look functionally peptide-binding

Pipeline-only statuses can still appear in CSV outputs:

Status Meaning
not_applicable Row intentionally excluded from groove functionality, mainly B2M in build outputs

Literature Basis

The parser is built around conserved sequence grammar from the MHC literature:

  • MHC domain organization is more conserved than short local motifs across vertebrates: Primordial Linkage of β2-Microglobulin to the MHC
  • IMGT domain numbering and the G-domain versus C-like disulfide grammar: PMC3913909
  • Classical class-I domain layout and landmarks: PMC2434379
  • Salmonid class-II alpha/beta cysteine topology and lineage-specific extra cysteines: PMC2386828
  • Teleost class-II evolutionary divergence while retaining the same modular architecture: PMC4219347

Signal-peptide logic follows the standard SPase grammar:

Key columns

Column Description
two_field_allele Allele name at two-field resolution
gene MHC gene (e.g., A, DRB1, BF, UA)
mhc_class I or II
chain alpha, beta, or B2M
species Latin binomial from source
species_category One of 9 categories above
source imgt, ipd_mhc, or uniprot
source_id Database accession for provenance
groove_status See table above
is_functional True if groove parsed and not null/pseudogene

Dependencies

  • Python 3.10+
  • mhcgnomes >= 3.18.0 — allele name parsing with species-directed disambiguation

No alignment tools, BLAST, or structure databases are required.

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mhcseqs-2.3.1.tar.gz (844.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mhcseqs-2.3.1-py3-none-any.whl (833.0 kB view details)

Uploaded Python 3

File details

Details for the file mhcseqs-2.3.1.tar.gz.

File metadata

  • Download URL: mhcseqs-2.3.1.tar.gz
  • Upload date:
  • Size: 844.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for mhcseqs-2.3.1.tar.gz
Algorithm Hash digest
SHA256 45f391bc04201b109031392549ce5c754ec6cc03f705d3e1bbc599afcf3b4efe
MD5 be958ea64a0b761d8f3b1d15744f0aa0
BLAKE2b-256 6c48fe9ea7e060ec602775e0504c2428aa71cf1b9bd27cfaa9ad6bdcc9a5d490

See more details on using hashes here.

File details

Details for the file mhcseqs-2.3.1-py3-none-any.whl.

File metadata

  • Download URL: mhcseqs-2.3.1-py3-none-any.whl
  • Upload date:
  • Size: 833.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for mhcseqs-2.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bc6a534d3fc075bfda9c7f2700902e409a09b2154edc9b1bca642a99a15cc23b
MD5 5c3de8967e5e9ffbe78f941b9f5ac906
BLAKE2b-256 3e1e90f981313189a4776ec41f394e17958f5e2b9b8fda7b0e70e4c785ea0630

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page