Skip to main content

MHC sequence curation and binding groove extraction

Project description

mhcseqs

Self-contained pipeline for downloading, curating, and extracting binding grooves from MHC (Major Histocompatibility Complex) protein sequences.

Install

pip install mhcseqs

For development:

git clone https://github.com/openvax/mhcseqs.git
cd mhcseqs
./develop.sh          # uv pip install -e ".[dev]"
./test.sh             # pytest
./lint.sh             # ruff

Quick start

CLI

# Build all output CSVs (writes to ~/.cache/mhcseqs/)
mhcseqs build

# Build to a specific directory instead
mhcseqs build --output-dir output/

# Look up a specific allele
mhcseqs lookup "HLA-A*02:01"

# Check version
mhcseqs --version

Python API

import mhcseqs

# Build the database (downloads to ~/.cache/mhcseqs/, only needed once)
paths = mhcseqs.build()  # BuildPaths dataclass

# Look up any allele → AlleleRecord with everything
r = mhcseqs.lookup("HLA-A*02:01")
r.sequence          # full protein (with signal peptide)
r.mature_sequence   # signal peptide removed (computed property)
r.mature_start      # signal peptide length (24 for HLA-A*02:01)
r.groove1           # α1 domain
r.groove2           # α2 domain
r.ig_domain         # α3 Ig-fold
r.tail              # TM + cytoplasmic
r.species_category  # "human"

# Apply mutations (IEDB-style, e.g. "K66A")
m = mhcseqs.lookup("HLA-A*02:01", mutations=["K66A", "D77S"])

Load as a DataFrame

import mhcseqs

# As a DataFrame (full sequence + groove decomposition + metadata)
df = mhcseqs.load_sequences_dataframe()

# Or as a list of dicts (no pandas dependency)
rows = mhcseqs.load_sequences_dict()

Data directory

By default, mhcseqs build downloads FASTA files and writes output CSVs to ~/.cache/mhcseqs/ (override with $MHCSEQS_DATA or --output-dir).

~/.cache/mhcseqs/
├── fasta/                     # Downloaded FASTA source files
│   ├── hla_prot.fasta         # IMGT/HLA (human)
│   └── ipd_mhc_prot.fasta    # IPD-MHC (non-human)
├── mhc-seqs-raw.csv           # Every protein entry from all sources
├── mhc-full-seqs.csv          # One representative per two-field allele (with grooves)
├── mhc-merge-report.txt       # Deduplication decisions
└── mhc-validation-report.txt  # Sanity checks

Pre-built CSVs are also attached to each GitHub release.

Output files

File Description
mhc-seqs-raw.csv Every protein entry from all sources
mhc-full-seqs.csv One representative per two-field allele: full sequence, groove decomposition, and metadata

Current data summary

All sources (IMGT/HLA, IPD-MHC, UniProt curated references, and 15,860 diverse MHC sequences from UniProt) are merged into a single dataset:

Category Class I Class II Total
human 17,462 7,878 25,340
nhp 4,639 2,486 7,125
murine 59 29 88
ungulate 638 1,128 1,766
carnivore 166 318 484
other_mammal 469 386 855
bird 5,961 3,351 9,312
fish 1,292 3,569 4,861
other_vertebrate 470 666 1,136
total 31,156 19,811 50,967

Covering 466+ species prefixes. Groove parse success rate on IMGT/IPD-MHC entries: 99.6%.

Structural decomposition

Each protein chain is decomposed into four contiguous regions:

Column Class I alpha Class II alpha Class II beta
groove1 α1 domain (~90 aa) α1 domain (~83 aa)
groove2 α2 domain (~93 aa) β1 domain (~93 aa)
ig_domain α3 Ig-fold (~95 aa) α2 Ig-fold (~95 aa) β2 Ig-fold (~95 aa)
tail TM + cytoplasmic TM + cytoplasmic TM + cytoplasmic

For a class I chain: mature_protein = groove1 + groove2 + ig_domain + tail

Key columns

Both CSVs share: gene, mhc_class, chain, species, species_category, species_prefix, source, source_id.

source is one of: imgt, ipd_mhc, uniprot_curated, uniprot_reference, uniprot_diverse.

source_id is the database accession for provenance tracking (e.g., HLA00001 for IMGT, NHP00001 for IPD-MHC, P01901 for UniProt).

species_category is one of: human, nhp, murine, ungulate, carnivore, cetacean, other_mammal, bird, fish, other_vertebrate.

Data sources

Source source value Species Data
IMGT/HLA imgt Human Downloaded at build time
IPD-MHC ipd_mhc Non-human Downloaded at build time
UniProt uniprot_reference Multi-species B2M references (shipped in package)
UniProt uniprot_curated Mouse 30 H-2 alleles (shipped in package)
UniProt uniprot_diverse 614 species 16,208 diverse MHC sequences (shipped in package)

IMGT/HLA and IPD-MHC FASTA files are downloaded on first build and cached. The UniProt curated CSVs (b2m_sequences.csv, mouse_h2_sequences.csv, diverse_mhc_sequences.csv) ship inside the mhcseqs package — no download needed.

To refresh the diverse MHC dataset from UniProt:

python scripts/fetch_diverse_mhc.py    # Download raw data → data/diverse_mhc_raw.csv
python scripts/curate_diverse_mhc.py   # Curate → mhcseqs/diverse_mhc_sequences.csv

Species prefixes

Species Latin name MHC prefix
human Homo sapiens HLA
macaque Macaca mulatta Mamu
chimpanzee Pan troglodytes Patr
gorilla Gorilla gorilla Gogo
mouse Mus musculus H2
rat Rattus norvegicus RT1
cattle Bos taurus BoLA
pig Sus scrofa SLA
horse Equus caballus ELA
sheep Ovis aries OLA
dog Canis lupus familiaris DLA
cat Felis catus FLA
chicken Gallus gallus Gaga
salmon Salmo salar Sasa
zebrafish Danio rerio Dare

Groove extraction algorithm

The groove parser is alignment-free — it uses conserved Cys-Cys disulfide pairs in Ig-fold domains as structural landmarks to slice domain boundaries without multiple sequence alignment.

See groove.py module docstring for detailed algorithm documentation with ASCII structural diagrams.

Dependencies

  • Python 3.10+
  • mhcgnomes >= 3.1.0 — allele name parsing

No alignment tools, BLAST, or structure databases are required.

Repository structure

mhcseqs/
├── mhcseqs/
│   ├── __init__.py        # Public API
│   ├── __main__.py        # CLI entry point
│   ├── version.py         # Package version
│   ├── download.py        # FASTA source downloading
│   ├── species.py         # Species taxonomy (29-class → 10-class)
│   ├── alleles.py         # Allele name parsing (mhcgnomes wrapper)
│   ├── groove.py          # Binding groove extraction + mutation support
│   ├── imgt.py            # IMGT G-DOMAIN position numbering
│   ├── pipeline.py        # Two-step build pipeline
│   ├── validate.py        # Post-build validation
│   ├── b2m_sequences.csv                # Reference B2M sequences (UniProt)
│   ├── mouse_h2_sequences.csv          # Mouse H-2 sequences (UniProt)
│   └── diverse_mhc_sequences.csv      # 16k diverse MHC sequences (UniProt)
├── tests/                 # pytest test suite
├── scripts/
│   ├── fetch_diverse_mhc.py      # Download diverse MHC from UniProt
│   ├── curate_diverse_mhc.py     # Curate into shipped CSV
│   └── validate_signal_peptides.py
├── data/                          # Intermediate data (not shipped)
├── build.py               # Convenience shim
├── pyproject.toml         # Package metadata
├── develop.sh             # Install in dev mode
├── lint.sh                # Run ruff
├── test.sh                # Run pytest
└── deploy.sh              # Build + publish to PyPI

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mhcseqs-1.1.0.tar.gz (785.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mhcseqs-1.1.0-py3-none-any.whl (785.6 kB view details)

Uploaded Python 3

File details

Details for the file mhcseqs-1.1.0.tar.gz.

File metadata

  • Download URL: mhcseqs-1.1.0.tar.gz
  • Upload date:
  • Size: 785.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for mhcseqs-1.1.0.tar.gz
Algorithm Hash digest
SHA256 666d2b38b3935594dd20dbbe6ac42dc5116e7a578ee6b6221f5f4c0d74781787
MD5 ccb9ed298f0dc27a90bbc4f1c4410039
BLAKE2b-256 6a449c670bd3150904bb3d0b8c3125f8eb82719ab1eddf3923f464d77a20b1b3

See more details on using hashes here.

File details

Details for the file mhcseqs-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: mhcseqs-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 785.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for mhcseqs-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b1728da6950df5dc77644f076d98600918274e222d545d5dca09bbf169d6c5fb
MD5 33b68fabf412fb8bed4dfe6fa48b8a25
BLAKE2b-256 e1b2f6e2df55088006817e586194db720954ec51cbaa1b72a536f0f1f19dc2da

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page