Skip to main content

Python library for parsing MHC nomenclature in the wild

Project description

Build Status Coverage Status PyPI

mhcgnomes: Parsing MHC nomenclature in the wild

MHCgnomes is a parsing library for multi-species MHC nomenclature which aims to correctly parse every name in IEDB, IMGT/HLA, IPD/MHC, and the allele lists for both NetMHCpan and NetMHCIIpan predictors. This allows for standardization between immune databases and tools, which often use different naming conventions.

Usage example

In [1]: mhcgnomes.parse("HLA-A0201")
Out[1]: Allele(
    gene=Gene(
        species=Species(name="Homo sapiens', prefix="HLA"), 
        name="A"), 
    allele_fields=("02", "01"), 
    annotations=(), 
    mutations=())

In [2]: mhcgnomes.parse("HLA-A0201").to_string()
Out[2]: 'HLA-A*02:01'

In [3]: mhcgnomes.parse("HLA-A0201").compact_string()
Out[3]: 'A0201'

The problem: MHC nomenclature is nuts

Despite the valiant efforts of groups such as the Comparative MHC Nomenclature Committee, the names of MHC alleles you might encounter in different datasets (or accepted by immunoinformatics tools) are frustratingly ill specified. It's not uncommon to see dozens of different forms for the same allele.

For example, these all refer to the same MHC protein sequence:

  • "HLA-A*02:01"
  • "HLA-A02:01"
  • "HLA-A:02:01"
  • "HLA-A0201"

Additionally, for human alleles, the species prefix is often omitted:

  • "A*02:01"
  • "A*0201"
  • "A02:01"
  • "A:02:01"
  • "A0201"

Annotations

Sometimes, alleles are bundled with modifier suffixes which specify the functionality or abundance of the MHC. Here's an example with an allele which is secreted instead of membrane-bound:

  • "HLA-A*02:01:01S"

These are collected in the annotations field of an Allele result.

Mutations

MHC proteins are sometimes described in terms of mutations to a known allele.

  • "HLA-B*08:01 N80I mutant"

These mutations are collected in the mutations field of an Allele result.

Beyond humans

To make things worse, several model organisms (like mice and rats) use archaic naming systems, where there is no notion of allele groups or four/six/eight digit alleles but every allele is simply given a name, such as:

  • "H2-Kk"
  • "RT1-9.5f"

In the above example "H2"/"RT1" correspond to species, "K"/"9.5" are the gene names and "k"/"f" are the allele names.

To make these even worse, the name of a species is subject to variation (e.g. "H2" vs. "H-2") as well as drift over time (e.g. ChLA -> MhcPatr -> Patr).

Serotypes, haplotypes, and other named entitites

Besides alleles are also other named MHC related entities you'll encounter in immunological data. Closely related to alleles are serotypes, which effectively denote a grouping of alleles that are all recognized by the same antibody:

  • "HLA-A2"
  • "A2"

In many datasets the exact allele is not known but an experiment might note the genetic background of a model animal, resulting in loose haplotype restrictions such as:

  • "H2-k class I"

Yes, good luck disambiguating "H2-k" the haplotype from "H2-K" the gene, especially since capitalization is not stable enough to be relied on for parsing.

In some cases immunological data comes only with a denoted species (e.g. "mouse"), a gene (e.g. "HLA-A"), or an MHC class ("human class I"). MHCgnomes has a structured representation for all of these cases and more.

Parsing strategy

It is a fool's errand to curate all possible MHC allele names since that list grows daily as the MHC loci of more people (and non-human animals) are sequenced. Instead, MHCgnomes contains an ontology of curated species and genes and then attempts to parse any given string into a multiple candidates of the following types:

The set of candidate interpretations for each string are then ranked according to heuristic rules. For example, a string will be preferentially interpreted as an Allele rather than a Serotype or Haplotype.

How many digits per field?

Originally alleles for many genes were numbered with two digits:

  • "HLA-MICB*01"

But as the number of identified alleles increased, the number of fields specifying a distinct protein increase to two. This became conventionally called a "four digit" format, since each field has two digits. Yet, as the number of identified alleles continued to increase, then the number of digits per field has often increased from two to three:

  • "MICB*002:01"
  • "HLA-A00201"
  • "A:002:01"
  • "A*00201"

These are not always currently treated as equivalent to allele strings with two digits in their first field, but that feature is in the works.

However, if databases such as IPD-MHC or IMGT-HLA recorded an older form of an allele, then MHCgnomes can optionally map it onto the modern version (including capturing differences in numbers of digits per field).

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mhcgnomes-1.8.6.tar.gz (709.0 kB view details)

Uploaded Source

Built Distribution

mhcgnomes-1.8.6-py3-none-any.whl (103.7 kB view details)

Uploaded Python 3

File details

Details for the file mhcgnomes-1.8.6.tar.gz.

File metadata

  • Download URL: mhcgnomes-1.8.6.tar.gz
  • Upload date:
  • Size: 709.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for mhcgnomes-1.8.6.tar.gz
Algorithm Hash digest
SHA256 d32b886d9cd58ed0e45d4cb3da83a439b1b68b59790ae04985711e489aa5e264
MD5 4dbf8119b7f5af73f17a098494bc1e4d
BLAKE2b-256 a8417b11a2fdee588025619868866ee9121235c5bb56bfddb4773d7c176bc4bb

See more details on using hashes here.

File details

Details for the file mhcgnomes-1.8.6-py3-none-any.whl.

File metadata

  • Download URL: mhcgnomes-1.8.6-py3-none-any.whl
  • Upload date:
  • Size: 103.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for mhcgnomes-1.8.6-py3-none-any.whl
Algorithm Hash digest
SHA256 f40cc7e0ba44dd8f1e733ba0525a8db62e016a0fbd1591a6fe2298ccee64dda0
MD5 d48e39a062cb0f84230dea130c2f7006
BLAKE2b-256 b03e4fa67920f80300828bbdcc5fe97eeb33958d6825f60a9b4f57ef392e8bd4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page