MHC sequence curation and binding groove extraction

These details have not been verified by PyPI

Project description

mhcseqs

Self-contained pipeline for downloading, curating, and extracting binding grooves from MHC (Major Histocompatibility Complex) protein sequences.

Install

pip install mhcseqs

For development:

git clone https://github.com/pirl-unc/mhcseqs.git
cd mhcseqs
./develop.sh          # uv pip install -e ".[dev]"
./test.sh             # pytest
./lint.sh             # ruff

Quick start

CLI

# Build all output CSVs (writes to ~/.cache/mhcseqs/)
mhcseqs build

# Build to a specific directory instead
mhcseqs build --output-dir output/

# Look up a specific allele
mhcseqs lookup "HLA-A*02:01"

# Check version
mhcseqs --version

Python API

import mhcseqs

# Build the database (downloads to ~/.cache/mhcseqs/, only needed once)
paths = mhcseqs.build()  # BuildPaths dataclass

# Look up any allele → AlleleRecord with everything
r = mhcseqs.lookup("HLA-A*02:01")
r.sequence          # full protein (with signal peptide)
r.mature_sequence   # signal peptide removed (computed property)
r.mature_start      # signal peptide length (24 for HLA-A*02:01)
r.groove1           # α1 domain
r.groove2           # α2 domain
r.ig_domain         # α3 Ig-fold
r.tail              # TM + cytoplasmic
r.domains           # typed domain spans
r.domain_architecture
r.domain_spans
r.species_category  # "human"

# Apply mutations (IEDB-style, e.g. "K66A")
m = mhcseqs.lookup("HLA-A*02:01", mutations=["K66A", "D77S"])

Load as a DataFrame

import mhcseqs

# As a DataFrame (full sequence + groove decomposition + metadata)
df = mhcseqs.load_sequences_dataframe()

# Or as a list of dicts (no pandas dependency)
rows = mhcseqs.load_sequences_dict()

Current data summary

Three sources are merged into a single dataset:

Source	Entries	Species	Notes
IMGT/HLA	44,630	Human	Downloaded at build time
IPD-MHC	12,380	Non-human mammals, birds, fish	Downloaded at build time
UniProt	20,566	500+ species	Curated diverse MHC, B2M, H-2 references (shipped in package)
Total raw	77,576
After merge/dedup	55,658		One representative per two-field allele
Groove OK	54,098		97.2% of representatives

By species category

Category	Count	Groove OK	Class I full	Class II full
human	25,364	99.8%	17,426	5,347
nhp	7,125	98.0%	4,583	1,426
bird	9,312	98.9%	546	562
fish	4,859	97.4%	776	1,259
ungulate	1,768	94.7%	577	303
murine	1,587	89.7%	490	118
carnivore	484	99.6%	164	0
other_mammal	943	95.1%	280	146
other_vertebrate	1,137	96.0%	324	129

"Groove OK" includes both full-length and fragment parses. "Full" means both groove halves present (class I: α1 + α2, ~183 aa; class II: single chain's groove half with Ig support, ~88 aa groove + ~90 aa Ig). The remaining entries are single-exon fragments (one groove half only) — common in bird and fish submissions to IPD-MHC where only the polymorphic exon is sequenced. The parser correctly characterizes these as alpha1_only, alpha2_only, beta1_only_fallback, or fragment_fallback.

Species categories: human, nhp (non-human primates), murine (mice, rats, rodents), ungulate (cattle, pig, horse, sheep, goat), carnivore (dog, cat), other_mammal (marsupials, monotremes, bats, cetaceans, rabbit), bird, fish, other_vertebrate (reptiles, amphibians).

Signal peptide detection accuracy

Validated against 2,403 UniProt ground-truth SP annotations:

Species	Class I exact	Class II exact
Human	99.2%	88.4%
NHP	100.0%	91.2%
Bird	94.4%	89.3%
Fish	86.9%	81.9%
Other vertebrate	68.2%	86.1%
Overall	82.0% exact, 89.9% within ±2 aa

False positive rate on 2,155 mature-only controls: 3.8%.

Data directory

By default, mhcseqs build downloads FASTA files and writes output CSVs to ~/.cache/mhcseqs/ (override with $MHCSEQS_DATA or --output-dir).

~/.cache/mhcseqs/
├── fasta/                     # Downloaded FASTA source files
│   ├── hla_prot.fasta         # IMGT/HLA (human)
│   └── ipd_mhc_prot.fasta    # IPD-MHC (non-human)
├── mhc-seqs-raw.csv           # Every protein entry from all sources
├── mhc-full-seqs.csv          # One representative per two-field allele (with grooves)
├── mhc-merge-report.txt       # Deduplication decisions
└── mhc-validation-report.txt  # Sanity checks

Structural decomposition

The parser materializes an explicit domain grammar:

Chain	Grammar
Class I alpha	`signal_peptide? -> g_alpha1 -> g_alpha2 -> c1_alpha3 -> transmembrane? -> cytoplasmic_tail?`
Class II alpha	`signal_peptide? -> g_alpha1 -> c1_alpha2 -> transmembrane? -> cytoplasmic_tail?`
Class II beta	`signal_peptide? -> g_beta1 -> c1_beta2 -> transmembrane? -> cytoplasmic_tail?`

The exported contiguous sequence fields are:

Column	Class I alpha	Class II alpha	Class II beta
`groove1`	α1 domain (~80-95 aa typical)	α1 domain (~75-95 aa typical)	—
`groove2`	α2 domain (~80-100 aa typical)	—	β1 domain (~70-100 aa typical)
`ig_domain`	α3 C-like support domain	α2 C-like support domain	β2 C-like support domain
`tail`	linker + TM + cytoplasmic tail	linker + TM + cytoplasmic tail	linker + TM + cytoplasmic tail

domain_architecture and domain_spans expose the typed domain grammar directly, for example:

class I: signal_peptide>g_alpha1>g_alpha2>c1_alpha3>tail_linker>transmembrane>cytoplasmic_tail
class II beta: signal_peptide>g_beta1>c1_beta2>tail_linker>transmembrane>cytoplasmic_tail

How Parsing Works

The parser is alignment-free and holistic. It does not rely on one absolute Cys position to define the mature start.

For each sequence it:

Enumerates all plausible Cys-Cys pairs in the Ig/C-like separation range.
Scores each pair as a candidate G-domain or C-like anchor using fold-topology evidence, especially the Trp41-like signal around c1+14.
Enumerates candidate SP boundaries and whole domain parses, including partial parses when only fragment evidence is available.
Chooses the best full parse using factored multiplicative scoring: three structural claims (SP grammar, domain architecture, completeness) each produce a [0,1] factor. Contradictory evidence in any factor gates the score down multiplicatively, while missing evidence is a softer penalty.

The strongest evidence types are:

SP cleavage grammar: hydrophobic h-region, short c-region, von Heijne -3/-1 compatibility, exclusion of impossible -3/-1 property pairs, and mild +1 mature-sequence penalties.
Domain-fold grammar: canonical G-domain versus C-like disulfide topology, including the IMGT-style Cys11-Cys74 G-domain signature and the Cys23/Trp41/Cys104 C-like grammar.
Class-specific groove boundaries:
- class I α1/α2 junction motifs
- class I α2 -> α3 boundary motifs
- class II α1 -> α2 and β1 -> β2 boundary motifs
Soft priors on groove/support-domain lengths and TM support downstream.

The parser handles:

full-length proteins with or without signal peptides
SP-stripped deposits (mature_start = 0)
common fragments:
- class I exon 2 only -> alpha1_only
- class I exon 3 only -> alpha2_only
- class II exon 2-like fragments -> fragment_fallback
low-evidence salvage:
- class I from α3 C-like support only -> inferred_from_alpha3
- class II beta from β1 groove pair only -> beta1_only_fallback
true groove absence / insufficient structural evidence -> missing_groove

Groove Status Values

These are the important parser-facing statuses:

Status	Meaning
`ok`	Full decomposition from the main structural grammar
`alpha1_only`	Class I fragment consistent with α1 / exon 2 only
`alpha2_only`	Class I fragment consistent with α2 / exon 3 only
`fragment_fallback`	Short fragment retained as the observable groove half
`inferred_from_alpha3`	Class I salvage parse using a downstream α3 C-like anchor
`beta1_only_fallback`	Class II beta salvage parse using only the β1 groove pair
`missing_groove`	No recoverable groove architecture from the available evidence
`non_classical`	Non-classical class-I lineage flagged post-parse
`short`	Groove half too short to look functionally peptide-binding

Pipeline-only statuses can still appear in CSV outputs:

Status	Meaning
`not_applicable`	Row intentionally excluded from groove functionality, mainly B2M in build outputs

Literature Basis

The parser is built around conserved sequence grammar from the MHC literature:

MHC domain organization is more conserved than short local motifs across vertebrates: Primordial Linkage of β2-Microglobulin to the MHC
IMGT domain numbering and the G-domain versus C-like disulfide grammar: PMC3913909
Classical class-I domain layout and landmarks: PMC2434379
Salmonid class-II alpha/beta cysteine topology and lineage-specific extra cysteines: PMC2386828
Teleost class-II evolutionary divergence while retaining the same modular architecture: PMC4219347

Signal-peptide logic follows the standard SPase grammar:

von Heijne -3/-1 cleavage rule: How signal sequences maintain cleavage specificity
flanking -2/+1 effects: Flanking signal and mature peptide residues influence signal peptide cleavage
strong penalty for +1 Pro: PubMed 1544500
h-region / c-region structural context: Structure of the human signal peptidase complex reveals the determinants for signal peptide cleavage

Key columns

Column	Description
`two_field_allele`	Allele name at two-field resolution
`gene`	MHC gene (e.g., A, DRB1, BF, UA)
`mhc_class`	I or II
`chain`	alpha, beta, or B2M
`species`	Latin binomial from source
`species_category`	One of 9 categories above
`source`	`imgt`, `ipd_mhc`, or `uniprot`
`source_id`	Database accession for provenance
`groove_status`	See table above
`is_functional`	True if groove parsed and not null/pseudogene

Dependencies

Python 3.10+
mhcgnomes >= 3.14.0 — allele name parsing with species-directed disambiguation

No alignment tools, BLAST, or structure databases are required.

License

Apache 2.0

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.5.9

Apr 18, 2026

2.5.8

Apr 15, 2026

2.5.7

Apr 15, 2026

2.5.6

Apr 15, 2026

2.5.5

Apr 15, 2026

2.5.4

Apr 14, 2026

2.5.3

Apr 14, 2026

2.5.2

Apr 14, 2026

2.5.1

Apr 13, 2026

2.5.0

Apr 12, 2026

2.4.1

Apr 9, 2026

2.4.0

Apr 9, 2026

2.3.6

Apr 8, 2026

2.3.5

Apr 7, 2026

2.3.4

Apr 7, 2026

2.3.3

Apr 7, 2026

2.3.2

Apr 7, 2026

2.3.1

Apr 7, 2026

2.3.0

Apr 5, 2026

2.2.2

Apr 5, 2026

2.2.1

Apr 2, 2026

This version

2.2.0

Mar 30, 2026

1.2.0

Mar 23, 2026

1.1.0

Mar 23, 2026

1.0.0

Mar 23, 2026

0.11.0

Mar 22, 2026

0.10.0

Mar 20, 2026

0.9.0

Mar 17, 2026

0.8.0

Mar 17, 2026

0.6.1

Mar 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mhcseqs-2.2.0.tar.gz (843.2 kB view details)

Uploaded Mar 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mhcseqs-2.2.0-py3-none-any.whl (832.4 kB view details)

Uploaded Mar 30, 2026 Python 3

File details

Details for the file mhcseqs-2.2.0.tar.gz.

File metadata

Download URL: mhcseqs-2.2.0.tar.gz
Upload date: Mar 30, 2026
Size: 843.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for mhcseqs-2.2.0.tar.gz
Algorithm	Hash digest
SHA256	`68c73c609d5e74c985117d1124af35dfae01d472e1bc39b695ae4eb70f509f2a`
MD5	`b75b999edd0525c1179442eec5fd4d41`
BLAKE2b-256	`2d63588834b40f87fb3d9e3d37bc71d0f17bf78ff467b3d9be95b0641fc8987d`

See more details on using hashes here.

File details

Details for the file mhcseqs-2.2.0-py3-none-any.whl.

File metadata

Download URL: mhcseqs-2.2.0-py3-none-any.whl
Upload date: Mar 30, 2026
Size: 832.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for mhcseqs-2.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2a3e7c66c1f532f81848c954fbe6aa0f8f262f349567b6e75335b2e47f2a5ea3`
MD5	`153a64b3f7d43edfd7c11647edd2b16e`
BLAKE2b-256	`fbf52ec9ce3ba0a5befd7de2216ad508d388c947388d2bc3adf11b8c33551d5f`

See more details on using hashes here.

mhcseqs 2.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

mhcseqs

Install

Quick start

CLI

Python API

Load as a DataFrame

Current data summary

By species category

Signal peptide detection accuracy

Data directory

Structural decomposition

How Parsing Works

Groove Status Values

Literature Basis

Key columns

Dependencies

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes