Skip to main content

Download and combine HLA frequency data from multiple studies

Project description

HLAfreq

HLAfreq allows you to download and combine HLA allele frequencies from multiple datasets, e.g. combine data from several studies within a country or combine countries. Useful for studying regional diversity in immune genes and, when paired with epitope prediction, estimating a population's ability to mount an immune response to specific epitopes.

Automated download of allele frequency data download from allelefrequencies.net.

Full documentation at HLAfreq/docs. Source code is available at BarinthusBio/HLAfreq.

Details

Estimates are combined by modelling allele frequency as a Dirichlet distribution which defines the probability of drawing each allele. When combining studies their estimates are weighted as 2x sample size by default. Sample size is doubled as each person in the study contributes two alleles. Alternative weightings can be used, for example population size when averaging across countries.

When selecting a panel of HLA alleles to represent a population, allele frequency is not the only thing to consider. Depending on the purpose of the panel, you should include a range of loci and supertypes (grouped alleles sharing binding specificies).

Install

HLAfreq is a python package available on windows, mac, and linux. We recommend installing with conda.

conda create -n hlafreq -c conda-forge -c bioconda hlafreq
conda activate hlafreq

Troubleshooting

HLAfreq uses pymc to estimate credible intervals, which is the source of most installation difficulty, see pymc installation guide and tips and tricks.

You may see an error about g++ and degraded performance:

WARNING (pytensor.configdefaults): g++ not detected!  PyTensor will be unable to compile C-implementations and will default to Python. Performance may be severely degraded. To remove this warning, set PyTensor flags cxx to an empty string.

This means that one of the pymc backends is missing and estimating confidence intervals will be very slow. But don't worry, try one of these fixes below:

  • Set the channel priority to strict, then install as above (using conda-forge then bioconda channels).
conda config --set channel_priority strict
  • Install a conda compiler to handle g++ based on your os.
conda create -n hlafreq -c conda-forge -c bioconda hlafreq cxx-compiler

When running entire scripts on windows, you may see an error about "Safe importing of main module", multiprocessing, and starting new processes. To fix this, main guard your code with if __name__ == "__main__": after the imports as demonstrated in examples/quickstart.py.

If you do run into trouble please open an issue.

conda

If you're new to conda see the miniconda installation guide and documentation to get started with conda.

Enter the install command from above into your conda prompt to create and activate a conda environment with HLAfreq installed. Typing python into this activated environment will start a python session where you can enter your python code such as the HLAfreq minimal example below.

If you prefer to write your python code as scripts using an IDE such as PyCharm or VScode, you'll need to look up how to configure a conda virtual environment with those tools.

pip

If you don't intend to use credible intervals you can install with pip: pip install HLAfreq. However, if you do import HLAfreq_pymc you may get warnings about degraded performance.

See the pip documentation to get started with pip. If you do have issues with pip, try installing with conda as described above.

Minimal example

Download HLA data using HLAfreq.HLAfreq.makeURL() and HLAfreq.HLAfreq.getAFdata(). All arguments that can be specified in the webpage form are available, see the makeURL() docs for details.

import HLAfreq
base_url = HLAfreq.makeURL("Uganda", locus="A")
aftab = HLAfreq.getAFdata(base_url)

After downloading the data, it must be filtered so that all studies sum to allele frequency 1 (within tolerence). Then we must ensure that all studies report alleles at the same resolution. Finaly we can combine frequency estimates, for more details see the combineAF() api documentation.

aftab = HLAfreq.only_complete(aftab)
aftab = HLAfreq.decrease_resolution(aftab, 2)
caf = HLAfreq.combineAF(aftab)

To add confidence intervals to estimates see examples/quickstart.py.

Detailed examples

For more detailed walkthroughs see HLAfreq/examples.

Docs

Full documentation at HLAfreq/docs. API documentation for functions are under the submodules on the left.

  • HLAfreq.HLAfreq documents most functions, specifically download and combine allele data.
  • HLAfreq.HLAfreq_pymc is functions using pymc to acurately estimate credible intervals on allele frequency estimates.

For help on specific functions view the docstring, help(function_name).

Run pdoc -d google -o docs/ HLAfreq to generate the documentation in ./docs.

Citation

Wells, D. A., & McAuley, M. (2023). HLAfreq: Download and combine HLA allele frequency data. bioRxiv, 2023-09. https://doi.org/10.1101/2023.09.15.557761

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hlafreq-0.0.5.tar.gz (20.3 kB view details)

Uploaded Source

Built Distribution

HLAfreq-0.0.5-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file hlafreq-0.0.5.tar.gz.

File metadata

  • Download URL: hlafreq-0.0.5.tar.gz
  • Upload date:
  • Size: 20.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for hlafreq-0.0.5.tar.gz
Algorithm Hash digest
SHA256 297ae735de85dfdc9f60f94f3fef92d4a1f055b455b282b7171e10d3de358b96
MD5 23408df97b43eef9a2853ffcaa2b5a9b
BLAKE2b-256 ed5dafc847f0b05684da04ab32485a8522d5d47297aa9136522ff9c3aa4dfcde

See more details on using hashes here.

File details

Details for the file HLAfreq-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: HLAfreq-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 16.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for HLAfreq-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 adc9dffde8e3a9052647d73db3c20fd1c52f8c91edd9da0f35baf01aee2bfb39
MD5 791b5ddff299c520ffc536958bcb28f6
BLAKE2b-256 d77c25c676792103ae3f77338be50bf4b4e21cbaee774d8e5047b77f68c1f8cb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page