Python library for parsing MHC nomenclature in the wild
Project description
mhcgnomes: Parsing MHC nomenclature in the wild
Documentation site: https://pirl-unc.github.io/mhcgnomes/
MHCgnomes is a parsing library for multi-species MHC nomenclature which aims to correctly parse every name in IEDB, IMGT/HLA, IPD/MHC, and the allele lists for both NetMHCpan and NetMHCIIpan predictors. This allows for standardization between immune databases and tools, which often use different naming conventions.
Usage example
In [1]: mhcgnomes.parse("HLA-A0201")
Out[1]: Allele(
gene=Gene(
species=Species(name="Homo sapiens", mhc_prefix="HLA"),
name="A"),
allele_fields=("02", "01"),
annotations=(),
mutations=())
In [2]: mhcgnomes.parse("HLA-A0201").to_string()
Out[2]: 'HLA-A*02:01'
In [3]: mhcgnomes.parse("HLA-A0201").compact_string()
Out[3]: 'A0201'
The problem: MHC nomenclature is nuts
Despite the valiant efforts of groups such as the Comparative MHC Nomenclature Committee, the names of MHC alleles you might encounter in different datasets (or accepted by immunoinformatics tools) are frustratingly ill-specified. It's not uncommon to see dozens of different forms for the same allele.
For example, these all refer to the same MHC protein sequence:
- "HLA-A*02:01"
- "HLA-A02:01"
- "HLA-A:02:01"
- "HLA-A0201"
Additionally, for human alleles, the species prefix is often omitted:
- "A*02:01"
- "A*0201"
- "A02:01"
- "A:02:01"
- "A0201"
Annotations
Sometimes, alleles are bundled with modifier suffixes which specify the functionality or abundance of the MHC. Here's an example with an allele which is secreted instead of membrane-bound:
- "HLA-A*02:01:01S"
These are collected in the annotations field of an
Allele
result.
Multi-letter annotations are also used in some non-human systems. In particular,
Ps (pseudogene) and Sp (splice variant) appear as suffixes on allele fields,
e.g. Mamu-B*074:03Sp or Caja-B5*01:01Ps, and are parsed into the
annotations field as Sp or Ps respectively.
Note that Ps can also appear as part of a gene name (prefix or suffix) in
non-human primates, such as Caja-G2Ps*01. In those cases Ps is treated as
part of the gene name, not an allele annotation.
Mutations
MHC proteins are sometimes described in terms of mutations to a known allele.
- "HLA-B*08:01 N80I mutant"
These mutations are collected in the mutations field of an
Allele result.
Beyond humans
To make things worse, several model organisms (like mice and rats) use archaic naming systems, where there is no notion of allele groups or four/six/eight digit alleles but every allele is simply given a name, such as:
- "H2-Kk"
- "RT1-9.5f"
In the above example "H2"/"RT1" correspond to species, "K"/"9.5" are the gene names and "k"/"f" are the allele names.
To make these even worse, the name of a species is subject to variation (e.g. "H2" vs. "H-2") as well as drift over time (e.g. ChLA -> MhcPatr -> Patr).
Serotypes, supertypes, haplotypes, and other named entities
Besides alleles there are also other named MHC related entities you'll encounter in immunological data. Closely related to alleles are serotypes, which effectively denote a grouping of alleles that are all recognized by the same antibody:
- "HLA-A2"
- "A2"
Supertypes are functional groupings based on shared peptide-binding specificity rather than serological reactivity (Sidney et al. 2008). These are parsed when the "supertype" keyword is present:
- "A2 supertype"
- "HLA-B44 supertype"
Class II heterodimers can be specified using dot notation, which is common in celiac disease literature:
- "DQ2.5" (equivalent to DQA1*05:01/DQB1*02:01)
- "DQ8.5"
In many datasets the exact allele is not known but an experiment might note the genetic background of a model animal, resulting in loose haplotype restrictions such as:
- "H2-k class I"
Yes, good luck disambiguating "H2-k" the haplotype from "H2-K" the gene, especially since capitalization is not stable enough to be relied on for parsing.
In some cases immunological data comes only with a denoted species (e.g. "mouse"), a gene (e.g. "HLA-A"), or an MHC class ("human class I"). MHCgnomes has a structured representation for all of these cases and more.
CLI
After installation, a mhcgnomes CLI is available:
mhcgnomes "HLA-A*02:01" "DQ2.5"
# or:
python -m mhcgnomes "HLA-A*02:01" "DQ2.5"
This prints a table with:
- input string
- parsed result type
- normalized and compact forms
- species/gene/MHC class
- parsed properties from
to_record()
You can also use machine-friendly output:
mhcgnomes --format tsv "HLA-A*02:01" "HLA-A2"
mhcgnomes --format json "HLA-A*02:01" "not a real allele"
By default, unparseable values are shown as ParseError rows.
Use strict mode to fail fast:
mhcgnomes --strict "not a real allele"
Parsing strategy
It is a fool's errand to curate all possible MHC allele names since that list grows daily as the MHC loci of more people (and non-human animals) are sequenced. Instead, MHCgnomes contains an ontology of curated species and genes and then attempts to parse any given string into multiple candidates of the following types:
The set of candidate interpretations for each string are then
ranked according to heuristic rules. For example, a string will be
preferentially interpreted as an Allele rather
than a Serotype
or Haplotype.
How many digits per field?
Originally alleles for many genes were numbered with two digits:
- "HLA-MICB*01"
But as the number of identified alleles increased, the number of fields specifying a distinct protein increased to two. This became conventionally called a "four digit" format, since each field has two digits. Yet, as the number of identified alleles continued to increase, the number of digits per field has often increased from two to three:
- "MICB*002:01"
- "HLA-A00201"
- "A:002:01"
- "A*00201"
These are not always currently treated as equivalent to allele strings with two digits in their first field, but that feature is in the works.
However, if databases such as IPD-MHC or IMGT-HLA recorded an older form of an allele, then MHCgnomes can optionally map it onto the modern version (including capturing differences in numbers of digits per field).
Species-directed parsing
species= constrains parsing to a single species. The final parsed object
must match that species exactly, or parsing fails. This is useful when you
know the organism and want to reject cross-species mismatches:
>>> mhcgnomes.parse("BoLA-DRB3*01:01", species="Bos taurus").to_string()
'Bota-DRB3*01:01'
>>> mhcgnomes.parse("HLA-A*02:01", species="Bos taurus", raise_on_error=False) is None
True
>>> mhcgnomes.parse("A*02:01", species="Homo sapiens").species.name
'Homo sapiens'
When the input uses an ancestor prefix (like BoLA for genus-level Bos sp.),
species= rewrites the result to the requested descendant species if valid.
default_species= is a less strict alternative — it provides a fallback
species hint for inputs that don't contain a species prefix, but does not
reject inputs that resolve to a different species:
>>> mhcgnomes.parse("A*02:01", default_species="Homo sapiens").species.name
'Homo sapiens'
>>> mhcgnomes.parse("DMA", default_species="Chelonia mydas").species.name
'Chelonia mydas'
Species and gene ontology
MHCgnomes maintains a curated ontology of species prefixes and MHC gene names
in YAML data files under mhcgnomes/data/. The key files are:
| File | Purpose |
|---|---|
species.yaml |
Canonical species entries with MHC prefix, gene names, and class assignments |
gene_aliases.yaml |
Alternative gene spellings that normalize to canonical genes |
allele_aliases.yaml |
Retired or shorthand allele names that normalize to canonical alleles |
known_alleles.yaml |
Curated known allele labels per species/gene |
Species prefix conventions
Each species is identified by a short prefix (usually 2-4 characters) such as
HLA (human), H2 (mouse), Gaga (chicken), or Dare (zebrafish). The
parser uses these prefixes to identify species before parsing gene names and
allele fields.
Prefixes are matched case-insensitively after stripping punctuation. A leading
Mhc prefix (common in bird MHC literature, e.g. MhcTyal-DAB1*01:01) is
automatically stripped as a fallback when normal prefix matching fails.
Some historically important prefixes are not single-species codes. Prefixes
such as DLA, SLA, OLA, BoLA, and CELA are curated as umbrella taxon
nodes in the ontology because the external nomenclature itself is genus- or
clade-level rather than species-specific. For example:
DLAmaps toCanis sp., whileCalumaps specifically toCanis lupusSLAmaps toSus sp., whileSuscmaps specifically toSus scrofaBoLAmaps toBos sp., whileBotamaps specifically toBos taurusOLAmaps toOvis sp., whileOvarmaps specifically toOvis ariesCELAmaps toCetacea sp., whileTutrmaps specifically toTursiops truncatus
This distinction matters when interpreting parsed objects: an allele parsed
from BoLA-... is attached to the generic cattle node unless the parse is
explicitly constrained or rewritten to a descendant species.
MHC gene class assignments
Genes in species.yaml are organized by MHC class:
- Ia: Classical class I (associates with B2M, presents peptides)
- Ib: Non-classical class I (in MHC locus, associates with B2M)
- Ic: Related MHC locus genes, no B2M association (e.g. MICA)
- Id: Class I-related genes on other chromosomes
- IIa: Classical class II alpha/beta chains presenting peptides
- IIb: Accessory or non-classical class II proteins
- other: Antigen processing genes (TAP1, TAP2, TAPBP, B2M)
Species prefix tiers
As mhcgnomes supports more species, short prefix codes increasingly collide.
Codes like HLA/SLA/DLA, OrLA, and four-letter codes like Calu all
hit collisions as coverage grows. We support multiple prefix tiers so that
every species is always parseable:
| Tier | Form | Example | When used |
|---|---|---|---|
| Established short prefix | 1–4 letters | HLA, Gaga, Crpo |
Published in MHC literature or IPD-MHC. Preferred for display. |
| Novel 4+4 prefix | First 4 of genus + first 4 of species | OryzLati, StruCame |
Standard display prefix for species without an established literature prefix. |
| 5+5 long prefix | First 5 of genus + first 5 of species | HomoSapie, OryziLatip |
Auto-generated alias for all binomial species. Always parseable. |
| Full latin name | Concatenated genus + species | HomoSapiens, ChrysemysPicta |
Always parseable as an alternative. Guaranteed collision-free. |
All tiers are parsed case-insensitively. For example, these all parse to the same allele:
HLA-A*02:01 # established prefix
HomoSapi-A*02:01 # 4+4 novel prefix (auto-generated alias)
HomoSapie-A*02:01 # 5+5 long prefix (auto-generated alias)
HomoSapiens-A*02:01 # full latin name
Homo sapiens-A*02:01 # latin name with space
The 8-letter (4+4) novel prefix space greatly reduces collision probability compared to 4-letter codes, but only the full latin name is truly guaranteed to be unique. Since we don't yet know what naming conventions the scientific community will settle on for newer taxa, we support all tiers simultaneously.
Which prefixes are established vs generated: Comments in species.yaml
document which prefixes are attested in MHC literature and which were generated
by mhcgnomes. Established prefixes are never changed; generated prefixes are
subject to replacement if a community convention emerges.
See the Curation Guide for the full prefix conflict resolution policy (source).
References
- IPD-MHC: nomenclature requirements for the non-human major histocompatibility complex in the next-generation sequencing era
- Comparative MHC nomenclature: report from the ISAG/IUIS-VIC committee 2018
- ISAG/IUIS-VIC Comparative MHC Nomenclature Committee report, 2005
- Nomenclature for factors of the SLA system, update 2008
Development
Local docs
./develop.sh
mkdocs serve
mkdocs build --strict
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mhcgnomes-3.26.0.tar.gz.
File metadata
- Download URL: mhcgnomes-3.26.0.tar.gz
- Upload date:
- Size: 853.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3902ba89b80e95e2bb51702397e8811564ee1ecfe63f290971d59fd6f26b28ea
|
|
| MD5 |
41cd20831735fd4f394091f8348d5fb2
|
|
| BLAKE2b-256 |
615cf9929cec12f599d2269fda3fe8fbfb84357c1d2c6688f2ca704c0d0a51e3
|
File details
Details for the file mhcgnomes-3.26.0-py3-none-any.whl.
File metadata
- Download URL: mhcgnomes-3.26.0-py3-none-any.whl
- Upload date:
- Size: 174.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4dd38091c679bfe367dba1644617df6cb8cac8c3bf32197d9d7d91231045c87a
|
|
| MD5 |
528ed1bf9afe4495f74c2712ba69dfb1
|
|
| BLAKE2b-256 |
448642a149ece6dc85768716bbf89ba5f7ff777602fd5e97234ac89455d9c92a
|