MHC sequence curation and binding groove extraction

These details have not been verified by PyPI

Project description

mhcseqs

Self-contained pipeline for downloading, curating, and extracting binding grooves from MHC (Major Histocompatibility Complex) protein sequences.

Install

pip install mhcseqs

For development:

git clone https://github.com/pirl-unc/mhcseqs.git
cd mhcseqs
./develop.sh          # uv pip install -e ".[dev]"
./test.sh             # pytest
./lint.sh             # ruff

Quick start

CLI

# Build all output CSVs (writes to ~/.cache/mhcseqs/)
mhcseqs build

# Build to a specific directory instead
mhcseqs build --output-dir output/

# Look up a specific allele
mhcseqs lookup "HLA-A*02:01"

# Check version
mhcseqs --version

Python API

import mhcseqs

# Build the database (downloads to ~/.cache/mhcseqs/, only needed once)
paths = mhcseqs.build()  # BuildPaths dataclass

# Look up any allele → AlleleRecord with everything
r = mhcseqs.lookup("HLA-A*02:01")
r.sequence          # full protein (with signal peptide)
r.mature_sequence   # signal peptide removed (computed property)
r.mature_start      # signal peptide length (24 for HLA-A*02:01)
r.groove1           # α1 domain
r.groove2           # α2 domain
r.ig_domain         # α3 Ig-fold
r.tail              # TM + cytoplasmic
r.species_category  # "human"

# Apply mutations (IEDB-style, e.g. "K66A")
m = mhcseqs.lookup("HLA-A*02:01", mutations=["K66A", "D77S"])

Load as a DataFrame

import mhcseqs

# As a DataFrame (full sequence + groove decomposition + metadata)
df = mhcseqs.load_sequences_dataframe()

# Or as a list of dicts (no pandas dependency)
rows = mhcseqs.load_sequences_dict()

Current data summary

Three sources are merged into a single dataset:

Source	Entries	Species	Notes
IMGT/HLA	44,630	Human	Downloaded at build time
IPD-MHC	12,380	Non-human mammals, birds, fish	Downloaded at build time
UniProt	20,566	500+ species	Curated diverse MHC, B2M, H-2 references (shipped in package)
Total raw	77,576
After merge/dedup	55,658		One representative per two-field allele
Groove OK	54,121		97.2% of representatives

By species category

Category	Count
human	25,364
bird	9,312
nhp	7,125
fish	4,859
ungulate	1,768
murine	1,587
other_vertebrate	1,137
other_mammal	943
carnivore	484

Species categories: human, nhp (non-human primates), murine (mice, rats, rodents), ungulate (cattle, pig, horse, sheep, goat), carnivore (dog, cat), other_mammal (marsupials, monotremes, bats, cetaceans, rabbit), bird, fish, other_vertebrate (reptiles, amphibians).

Data directory

By default, mhcseqs build downloads FASTA files and writes output CSVs to ~/.cache/mhcseqs/ (override with $MHCSEQS_DATA or --output-dir).

~/.cache/mhcseqs/
├── fasta/                     # Downloaded FASTA source files
│   ├── hla_prot.fasta         # IMGT/HLA (human)
│   └── ipd_mhc_prot.fasta    # IPD-MHC (non-human)
├── mhc-seqs-raw.csv           # Every protein entry from all sources
├── mhc-full-seqs.csv          # One representative per two-field allele (with grooves)
├── mhc-merge-report.txt       # Deduplication decisions
└── mhc-validation-report.txt  # Sanity checks

Structural decomposition

Each protein chain is decomposed into four contiguous regions:

Column	Class I alpha	Class II alpha	Class II beta
`groove1`	α1 domain (~90 aa)	α1 domain (~83 aa)	—
`groove2`	α2 domain (~93 aa)	—	β1 domain (~93 aa)
`ig_domain`	α3 Ig-fold (~95 aa)	α2 Ig-fold (~95 aa)	β2 Ig-fold (~95 aa)
`tail`	TM + cytoplasmic	TM + cytoplasmic	TM + cytoplasmic

For a class I chain: mature_protein = groove1 + groove2 + ig_domain + tail

Groove extraction algorithm

The groove parser is alignment-free — it uses conserved Cys-Cys disulfide pairs in Ig-fold domains as structural landmarks to slice domain boundaries without multiple sequence alignment.

Signal peptide inference

Signal peptide length is inferred from the Cys pair position, not from sequence motifs. The conserved Ig-fold Cys has a known position in the mature protein (e.g., position 100 for class I α2). The offset between the raw sequence position and the expected mature position gives the signal peptide length:

mature_start = raw_cys_position - expected_mature_cys_position

Gene-specific constants account for groove domain length variation: DQA (109), DMA (120), DPB (114). All others use the defaults (class I: 100, class II alpha: 106, class II beta: 116).

Groove status values

Status	Meaning
`ok`	Full decomposition: groove1 + groove2 + ig_domain + tail
`alpha1_only`	Single-exon class I fragment — α1 domain (no Cys pair)
`alpha2_only`	Single-exon class I fragment — α2 domain (has Cys pair)
`beta1_only_fallback`	Class II beta with β1 pair only (no β2 Ig pair)
`fragment_fallback`	Short fragment used as raw groove sequence
`inferred_from_alpha3`	Groove boundaries estimated from α3 Cys pair
`not_applicable`	Non-groove gene (B2M, MICA, MICB, HFE, MR1)
`non_classical`	Non-classical MHC lineage (fish L/S/P/H)
`short`	Groove half < 70 aa — unlikely to be functional
`suspect_anchor`	Cys mutation produced implausible mature_start

See groove.py module docstring for detailed algorithm documentation with ASCII structural diagrams.

Key columns

Column	Description
`two_field_allele`	Allele name at two-field resolution
`gene`	MHC gene (e.g., A, DRB1, BF, UA)
`mhc_class`	I or II
`chain`	alpha, beta, or B2M
`species`	Latin binomial from source
`species_category`	One of 9 categories above
`source`	`imgt`, `ipd_mhc`, or `uniprot`
`source_id`	Database accession for provenance
`groove_status`	See table above
`is_functional`	True if groove parsed and not null/pseudogene

Dependencies

Python 3.10+
mhcgnomes >= 3.14.0 — allele name parsing with species-directed disambiguation

No alignment tools, BLAST, or structure databases are required.

License

Apache 2.0

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.5.9

Apr 18, 2026

2.5.8

Apr 15, 2026

2.5.7

Apr 15, 2026

2.5.6

Apr 15, 2026

2.5.5

Apr 15, 2026

2.5.4

Apr 14, 2026

2.5.3

Apr 14, 2026

2.5.2

Apr 14, 2026

2.5.1

Apr 13, 2026

2.5.0

Apr 12, 2026

2.4.1

Apr 9, 2026

2.4.0

Apr 9, 2026

2.3.6

Apr 8, 2026

2.3.5

Apr 7, 2026

2.3.4

Apr 7, 2026

2.3.3

Apr 7, 2026

2.3.2

Apr 7, 2026

2.3.1

Apr 7, 2026

2.3.0

Apr 5, 2026

2.2.2

Apr 5, 2026

2.2.1

Apr 2, 2026

2.2.0

Mar 30, 2026

This version

1.2.0

Mar 23, 2026

1.1.0

Mar 23, 2026

1.0.0

Mar 23, 2026

0.11.0

Mar 22, 2026

0.10.0

Mar 20, 2026

0.9.0

Mar 17, 2026

0.8.0

Mar 17, 2026

0.6.1

Mar 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mhcseqs-1.2.0.tar.gz (786.6 kB view details)

Uploaded Mar 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mhcseqs-1.2.0-py3-none-any.whl (786.9 kB view details)

Uploaded Mar 23, 2026 Python 3

File details

Details for the file mhcseqs-1.2.0.tar.gz.

File metadata

Download URL: mhcseqs-1.2.0.tar.gz
Upload date: Mar 23, 2026
Size: 786.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for mhcseqs-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`e518712b5103d6af47064c37cf96b9470258a1c623ef0a1205e618613862e3b1`
MD5	`41466284f3e0a5d02c706f7e5ab018b9`
BLAKE2b-256	`68ce4e54184a3c4aa1d90b3a07803d45ff18bf5fc38f7e473160fd56c225ab44`

See more details on using hashes here.

File details

Details for the file mhcseqs-1.2.0-py3-none-any.whl.

File metadata

Download URL: mhcseqs-1.2.0-py3-none-any.whl
Upload date: Mar 23, 2026
Size: 786.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for mhcseqs-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0a61c1b06e631a88e8bf7b938afdbad9c300a66cb29b0f30e1070c8a20aca597`
MD5	`ecf3fe5f8ca7bd818fab708e25e19433`
BLAKE2b-256	`8e68c51864bfd1c04e50d6e2c6686fc3991b5d21a266edd69c72a2fa344f9476`

See more details on using hashes here.

mhcseqs 1.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

mhcseqs

Install

Quick start

CLI

Python API

Load as a DataFrame

Current data summary

By species category

Data directory

Structural decomposition

Groove extraction algorithm

Signal peptide inference

Groove status values

Key columns

Dependencies

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes