Download and output HLA sequences from IPD-IMGT/HLA.

These details have not been verified by PyPI

Project description

hladl

(HLA downloader)

JH @ MGH, 2025

This is a simple CLI to make grabbing specific HLA allele sequences easier. It aims to be similar to hladownload but without the more advanced features that offers (although that script appears to be out of action due to Biopython version changes since its last update).

Effectively, this script will spit out a cDNA nucleotide or protein amino acid sequence, given an allele identifier and a number of digits resolution. Sequences are grabbed from the IPD-IMGT/HLA Github repo (as is available through the EBI) and stored locally in a gzippd json, allowing them to be output without a need for later internet connectivity.

Installation

hladl was made with poetry and typer. It can be installed from PyPI:

pip install hladl

Usage

Getting the data

Sequences can be downloaded to the installed data directory using hladl init. Users specify the sequence type (nucleotide, protein, or both) with the -s flag, and the HLA allele digit resolution (i.e. 2, 4, 6, or 8 digit, being HLA-X*22:44:66:88) wit the -d flag like so:

# Download nucleotide (cDNA) sequences for 4 digit alleles
hladl init -s nuc -d 4
 
# Download protein (AA) sequences for 2 digit alleles
hladl init -s prot -d 2

The location of the data directory can be determined using the dd command:

hladl dd

# Will produce something like
/path/to/where/its/saving/stuff

Grabbing HLA sequences

Sequences can then be output to stdout using the seq command:

hladl seq -a DRA*01:01
hladl seq -a A*02 -s prot -d 2

Class I MHC protein sequences can also be automatically trimmed to remove leader and transmembrane/intracellular domains, yielding the extracellular domain, by specifying this in the mode option:

hladl seq -a A*02:01 -m ecd -s prot

Users can also instead choose to produce a FASTA file of the designated allele using the -om / --output_mode flag, which saves to the current directory:

hladl seq -a B*07:02 -om fasta

Importing `hladl` for use inside other scripts

The major case use I wanted hladl for is to import in other scripts, to allow for easy in-line grabbing of HLA sequences. It can be done by simple importing the relevant components and calling the seq function:

from hladl.get_seq import seq
from hladl.main import data_dir

seq1 = seq('A*02:01', 4, 'prot', 'ecd', 'stdout', data_dir)
print(seq1)
GSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWDGETRKVKAHSQTHRVDLGTLRGYYNQSEAGSHTVQRMYGCDVGSDWRFLRGYHQYAYDGKDYIALKEDLRSWTAADMAAQTTKHKWEAAHVAEQLRAYLEGTCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATLRCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGQEQRYTCHVQHEGLPKPLTLRWEPSSQPTIPI

seq2 = seq('B*08:01', 4, 'nuc', 'full', 'stdout', data_dir)
print(seq2)
ATGCTGGTCATGGCGCCCCGAACCGTCCTCCTGCTGCTCTCGGCGGCCCTGGCCCTGACCGAGACCTGGGCCGGCTCCCACTCCATGAGGTATTTCGACACCGCCATGTCCCGGCCCGGCCGCGGGGAGCCCCGCTTCATCTCAGTGGGCTACGTGGACGACACGCAGTTCGTGAGGTTCGACAGCGACGCCGCGAGTCCGAGAGAGGAGCCGCGGGCGCCGTGGATAGAGCAGGAGGGGCCGGAGTATTGGGACCGGAACACACAGATCTTCAAGACCAACACACAGACTGACCGAGAGAGCCTGCGGAACCTGCGCGGCTACTACAACCAGAGCGAGGCCGGGTCTCACACCCTCCAGAGCATGTACGGCTGCGACGTGGGGCCGGACGGGCGCCTCCTCCGCGGGCATAACCAGTACGCCTACGACGGCAAGGATTACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCGGCGGACACCGCGGCTCAGATCACCCAGCGCAAGTGGGAGGCGGCCCGTGTGGCGGAGCAGGACAGAGCCTACCTGGAGGGCACGTGCGTGGAGTGGCTCCGCAGATACCTGGAGAACGGGAAGGACACGCTGGAGCGCGCGGACCCCCCAAAGACACACGTGACCCACCACCCCATCTCTGACCATGAGGCCACCCTGAGGTGCTGGGCCCTGGGCTTCTACCCTGCGGAGATCACACTGACCTGGCAGCGGGATGGCGAGGACCAAACTCAGGACACTGAGCTTGTGGAGACCAGACCAGCAGGAGATAGAACCTTCCAGAAGTGGGCAGCTGTGGTGGTGCCTTCTGGAGAAGAGCAGAGATACACATGCCATGTACAGCATGAGGGGCTGCCGAAGCCCCTCACCCTGAGATGGGAGCCGTCTTCCCAGTCCACCGTCCCCATCGTGGGCATTGTTGCTGGCCTGGCTGTCCTAGCAGTTGTGGTCATCGGAGCTGTGGTCGCTGCTGTGATGTGTAGGAGGAAGAGCTCAGGTGGAAAAGGAGGGAGCTACTCTCAGGCTGCGTGCAGCGACAGTGCCCAGGGCTCTGATGTGTCTCTCACAGCTTGA

Inferring HLA alleles from sequence

Another task that I sometimes need to do when working with HLAs is to figure out what allele a given sequence derives from (most frequently when trying to determine the nature of an HLA found in a TCR-pMHC structure, which can be laborious to locate in the metadata and associated publications).

This can be achieved with the hladl infer command, which uses a tag string Aho-Corasick matching approach (inspired by the approach taken in the TCR annotation software Decombinator, in particular autoDCR, my experimental TCR toolkit derived from that. In effect it breaks each HLA allele (at a given resolution) into overlapping tag sequences, which it uses to populate a trie used to search a given input string, with HLA alleles identified by the greatest number of tag matches.

This defaults to expect protein sequences, but can also infer from cDNA sequences by providing nuc to the --seqtype / -s flag:

# Using the HLA-A*02:01 sequence (produced with hladl seq)
hladl infer -s nuc ATGGCCGTCATGGCGCCCCGAACCCTCGTCCTGCTACTCTCGGGGGCTCTGGCCCTGACCCAGACCTGGGCGGGCTCTCACTCCATGAGGTATTTCTTCACATCCGTGTCCCGGCCCGGCCGCGGGGAGCCCCGCTTCATCGCAGTGGGCTACGTGGACGACACGCAGTTCGTGCGGTTCGACAGCGACGCCGCGAGCCAGAGGATGGAGCCGCGGGCGCCGTGGATAGAGCAGGAGGGTCCGGAGTATTGGGACGGGGAGACACGGAAAGTGAAGGCCCACTCACAGACTCACCGAGTGGACCTGGGGACCCTGCGCGGCTACTACAACCAGAGCGAGGCCGGTTCTCACACCGTCCAGAGGATGTATGGCTGCGACGTGGGGTCGGACTGGCGCTTCCTCCGCGGGTACCACCAGTACGCCTACGACGGCAAGGATTACATCGCCCTGAAAGAGGACCTGCGCTCTTGGACCGCGGCGGACATGGCAGCTCAGACCACCAAGCACAAGTGGGAGGCGGCCCATGTGGCGGAGCAGTTGAGAGCCTACCTGGAGGGCACGTGCGTGGAGTGGCTCCGCAGATACCTGGAGAACGGGAAGGAGACGCTGCAGCGCACGGACGCCCCCAAAACGCATATGACTCACCACGCTGTCTCTGACCATGAAGCCACCCTGAGGTGCTGGGCCCTGAGCTTCTACCCTGCGGAGATCACACTGACCTGGCAGCGGGATGGGGAGGACCAGACCCAGGACACGGAGCTCGTGGAGACCAGGCCTGCAGGGGATGGAACCTTCCAGAAGTGGGCGGCTGTGGTGGTGCCTTCTGGACAGGAGCAGAGATACACCTGCCATGTGCAGCATGAGGGTTTGCCCAAGCCCCTCACCCTGAGATGGGAGCCGTCTTCCCAGCCCACCATCCCCATCGTGGGCATCATTGCTGGCCTGGTTCTCTTTGGAGCTGTGATCACTGGAGCTGTGGTCGCTGCTGTGATGTGGAGGAGGAAGAGCTCAGATAGAAAAGGAGGGAGCTACTCTCAGGCTGCAAGCAGTGACAGTGCCCAGGGCTCTGATGTGTCTCTCACAGCTTGTAAAGTGTGA

Detected top-matching alleles: ['A*02:01']
Number of tags per hit: 108

# Or let's try it on the protein sequence of B*35*08 from PDB file 2AK4
hladl infer GSHSMRYFYTAMSRPGRGEPRFIAVGYVDDTQFVRFDSDAASPRTEPRAPWIEQEGPEYWDRNTQIFKTNTQTYRESLRNLRGYYNQSEAGSHIIQRMYGCDLGPDGRLLRGHDQSAYDGKDYIALNEDLSSWTAADTAAQITQRKWEAARVAEQRRAYLEGLCVEWLRRYLENGKETLQRADPPKTHVTHHPVSDHEATLRCWALGFYPAEITLTWQRDGEDQTQDTELVETRPAGDRTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPLTLRWEP

Detected top-matching alleles: ['B*35:08']
Number of tags per hit: 26

Notes

If you run the hladl seq script without running the appropriate hladl init, it will try to download the appropriate sequences on the fly.
While the IMGTHLA repo does also store unspliced genomic DNA files, these are handled slightly different, are much larger files, and frankly I don't need them in my pipelines right now, so they're not yet catered to yet.
Pseudogenes and other aberrent length entries in the dataset cannot be used for ecd mode.
Note that by default the hladl infer trie uses 20-mer tags. Also note that the output is a (alphabetically sorted) list for all alleles which share the same number of tag hits. This is because often multiple alleles will be indistinguishable, particularly at the amino acid level or lower resolutions.

Data licensing and information

This tool doesn't host or distribute any of the IPD-IMGT/HLA data, it just facilitates its download and distribution from that resource. Nor am I affiliated with them in any way.

The IMGTHLA data is hosted on their Github repo, under a Creative Commons Attribution-NoDerivs License, meaning that users "are free to copy, distribute, display and make commercial use of the databases in all legislations", providing suitable attribution is provided.

For further details on the data, please see the following publications:

Barker D, Maccari G, Georgiou X, Cooper M, Flicek P, Robinson J, Marsh SGE The IPD-IMGT/HLA Database Nucleic Acids Research(2023), 51(D1): D948-D955
Robinson J, Barker D, Marsh SGE 25 years of the IPD-IMGT/HLA Database. HLA(2024),103(6): e15549
Robinson J, Malik A, Parham P, Bodmer JG, Marsh SGE: IMGT/HLA - a sequence database for the human major histocompatibility complex Tissue Antigens (2000), 55:280-287

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Apr 26, 2025

0.1.2

Apr 23, 2025

0.1.1

Apr 17, 2025

0.1.0

Apr 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hladl-0.2.0.tar.gz (14.8 kB view details)

Uploaded Apr 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hladl-0.2.0-py3-none-any.whl (14.4 kB view details)

Uploaded Apr 26, 2025 Python 3

File details

Details for the file hladl-0.2.0.tar.gz.

File metadata

Download URL: hladl-0.2.0.tar.gz
Upload date: Apr 26, 2025
Size: 14.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.2 CPython/3.12.0 Darwin/24.3.0

File hashes

Hashes for hladl-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`a220b62667dd988b776f2bfe0d9f0fbd9692d057bbcbf947928fe1356e8df41e`
MD5	`24faba590e8e75271fd9934c4b48d8d5`
BLAKE2b-256	`1d46a5fa51b0062b4df30640b3cbd000548b2e63183920c659d0cf632f3670ff`

See more details on using hashes here.

File details

Details for the file hladl-0.2.0-py3-none-any.whl.

File metadata

Download URL: hladl-0.2.0-py3-none-any.whl
Upload date: Apr 26, 2025
Size: 14.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.2 CPython/3.12.0 Darwin/24.3.0

File hashes

Hashes for hladl-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`699dfa2a4631bd4f81978c57818d17ffd1ed3a8f3990ca1ec8f8f1f3ab72a5c1`
MD5	`9af444e90c04b7cb4b015226bee84f06`
BLAKE2b-256	`d5ab5b4a7b0b1522214e09c5da47977c53c4a4dffa8c898fee0d218b3200d69a`

See more details on using hashes here.

hladl 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

hladl

(HLA downloader)

JH @ MGH, 2025

Installation

Usage

Getting the data

Grabbing HLA sequences

Importing `hladl` for use inside other scripts

Inferring HLA alleles from sequence

Notes

Data licensing and information

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

hladl 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

hladl

(HLA downloader)

JH @ MGH, 2025

Installation

Usage

Getting the data

Grabbing HLA sequences

Importing hladl for use inside other scripts

Inferring HLA alleles from sequence

Notes

Data licensing and information

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Importing `hladl` for use inside other scripts