Skip to main content

Rapid population clustering with autoencoders

Project description

PyPI - Python Version PyPI - Version PyPI - License PyPI - Status tests PyPI - Downloads DOI

Neural ADMIXTURE

Neural ADMIXTURE is an unsupervised global ancestry inference technique based on ADMIXTURE. By using neural networks, Neural ADMIXTURE offers high quality ancestry assignments with a running time which is much faster than ADMIXTURE's. For more information, we recommend reading our corresponding article.

The software can be invoked via CLI and has a similar interface to ADMIXTURE (e.g. the output format is completely interchangeable). While the software runs in both CPU and GPU, we recommend using GPUs if available to take advantage of the neural network-based implementation.

nadm_mna

System requirements

Hardware requirements

The successful usage of this package requires a computer with enough RAM to be able to handle the large datasets the network has been designed to work with. Due to this, we recommend using compute clusters whenever available to avoid memory issues.

Software requirements

The package has been tested on both Linux (CentOS 7.9.2009, Ubuntu 18.04.5 LTS) and MacOS (BigSur 11.2.3, Intel and Monterey 12.3.1, M1). It is highly recommended to use GPUs for optimal performance - make sure CUDA drivers are properly installed.

Installation guide

We recommend creating a fresh Python 3.12 environment using conda (or virtualenv), and then install the package neural-admixture there. As an example, for conda, one should launch the following commands:

$ conda create -n nadmenv python=3.12
$ conda activate nadmenv
(nadmenv) $ pip install neural-admixture

Important note: Using GPUs greatly speeds up processing and is recommended for large datasets.

Specify the number of GPUs (--num_gpus) and threads (--threads) you have available in your machine to optimize the performance. For MacOS users with an Apple Metal chip, using --num_gpus 1 will enable MPS acceleration in the software. Note that despite MPS acceleration being supported, the RAM available in laptops is probably limited, so larger datasets should be run on CUDA-capable GPUs, which the software is more optimized for.

Usage

Running Neural ADMIXTURE

To train a model from scratch, simply invoke the following commands from the root directory of the project. For more info about all the arguments, please run neural-admixture train --help. If training a single-head version of the network suffices, please use the flag --k instead of --min_k and --max_k. Note that BED, PGEN and VCF are supported as of now.

For unsupervised Neural ADMIXTURE (single-head):

$ neural-admixture train --k K --name RUN_NAME --data_path DATA_PATH --save_dir SAVE_PATH --threads X

For unsupervised Neural ADMIXTURE (multi-head):

$ neural-admixture train --min_k K_MIN --max_k K_MAX --name RUN_NAME --data_path DATA_PATH --save_dir SAVE_PATH --threads X

For supervised Neural ADMIXTURE:

$ neural-admixture train --k K --pops_path POPS_PATH --name RUN_NAME --data_path DATA_PATH --save_dir SAVE_PATH --threads X # only single-head support at the moment

As an example, the following ADMIXTURE call

$ ./admixture snps_data.bed 8 -s 42

would be mimicked in Neural ADMIXTURE by running

$ neural-admixture train --k 8 --data_path snps_data.bed --save_dir SAVE_PATH --init_file INIT_FILE --name snps_data --seed 42 --threads X

with some parameters such as the decoder initialization or the save directories not having a direct equivalent.

Several files will be output to the SAVE_PATH directory (the name parameter will be used to create the whole filenames):

  • A .P file, similar to ADMIXTURE.
  • A .Q file, similar to ADMIXTURE.
  • A .pt file, containing the weights of the trained network.
  • A .json file, with the configuration of the network.

The last three files are required to run posterior inference using the network, so be aware of not deleting them accidentally! Logs are printed to the stdout channel by default. If you want to save them to a file, you can use the command tee along with a pipe:

$ neural-admixture train --k 8 ... | tee run.log

Inference mode (projective analysis)

ADMIXTURE allows reusing computations in the projective analysis mode, in which the P (F, frequencies) matrix is fixed to an already known result and only the assignments are computed. Due to the nature of our algorithm, assignments can be computed for unseen data by simply feeding the data through the encoder. This mode can be run by typing infer instead of train right after the neural-admixture call.

For example, assuming we have a trained Neural ADMIXTURE (named nadm_test) in the path ./outputs, one could run inference on unseen data (./data/unseen_data.bed) via the following command:

$ neural-admixture infer --name nadm_test --save_dir ./outputs --out_name unseen_nadm_test --data_path ./data/unseen_data.bed

For this command to work, files ./outputs/nadm_test.pt and ./outputs/nadm_test_config.json, which are training outputs, must exist. In this case, only a .Q will be created, which will contain the assignments for this data (the parameter of the flag out_name will be used to generate the output file name). This file will be written in the --save_dir directory (in this case, ./outputs).

Supervised Neural ADMIXTURE

The supervised version of the algorithm can be used when all samples have a corresponding population label. This can be very benificial, especially when dealing with large imbalances in the data (e.g data contains 1K samples from Pop1 and 50 samples from Pop2).

In order to use the supervised mode, the --pops_path argument pointing to the file where the ancestries are defined must be passed. The latter file must be a single-column, headerless, plain text file where row i denotes the ancestry for the i-th sample in the data. We currently do not support datasets which contain samples with missing ancestries.

The supervised mode works by adding a scaled classification loss to the bottleneck of the algorithm (Equation 5 of the paper). The scaling factor can have a big impact on the performance. If it is too small, then the supervised loss will have little impact on the training, so results would be similar to an unsupervised run. On the other hand, if it is too large, then the supervision will dominate training, making the network overconfident in its predictions: essentially, one would get only binary assignments. The default value of the scaling factor is _η=100, and can be controlled using the parameter --supervised_loss_weight.

Basically, if on validation data you are getting single-ancestry estimations when you expect admixed estimations, try setting a smaller value for the supervised loss scaling factor η (--supervised_loss_weight).

Moreover, note that the initialization method chosen will have no effect, as the supervised method is always used when using the supervised version.

Other options

  • batch_size: number of samples used at every update. If you have memory issues, try setting a lower batch size. Defaults to 800.
  • n_components: dimension of the PCA projection for SVD. Defaults to 8.
  • epochs: maximum number of times the whole training dataset is used to update the weights. Defaults to 250.
  • learning_rate: dictates how large an update to the weights will be. If you find the loss function oscillating, try setting a lower value. If convergence is slow, try setting a higher value. Defaults to 25e-4.
  • seed: RNG seed for replication purposes. Defaults to 42.
  • num_gpus: number of GPUs to use during training. Set to 0 for CPU-only execution. Defaults to 0.

Experiments replication

The datasets All-Chms, Chm-22 and Chm-22-Sim used in the Experiments section of the article can be found in figshare. For descriptions of the datasets, please refer to the corresponding section in the paper. The exact hyperparameters used in the experiments to allow replication can be found in the Supplementary Table 3 of the article.

Demo

To run the software with a small demo dataset, check the instructions in the corresponding folder of the repository.

Troubleshooting

CUDA issues

If you get an error similar to the following (when using GPU):

OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

simply installing nvcc using conda/mamba should fix it:

$ conda install -c nvidia nvcc

License

NOTICE: This software is available for use free of charge for academic research use only. Academic users may fork this repository and modify and improve to suit their research needs, but also inherit these terms and must include a licensing notice to that effect. Commercial users, for profit companies or consultants, and non-profit institutions not qualifying as "academic research" should contact the authors for a separate license. This applies to this repository directly and any other repository that includes source, executables, or git commands that pull/clone this repository as part of its function. Such repositories, whether ours or others, must include this notice.

Cite

When using this software, please cite the following paper:

@article{dominguezmantes23,
	abstract = {Characterizing the genetic structure of large cohorts has become increasingly important as genetic studies extend to massive, increasingly diverse biobanks. Popular methods decompose individual genomes into fractional cluster assignments with each cluster representing a vector of DNA variant frequencies. However, with rapidly increasing biobank sizes, these methods have become computationally intractable. Here we present Neural ADMIXTURE, a neural network autoencoder that follows the same modeling assumptions as the current standard algorithm, ADMIXTURE, while reducing the compute time by orders of magnitude surpassing even the fastest alternatives. One month of continuous compute using ADMIXTURE can be reduced to just hours with Neural ADMIXTURE. A multi-head approach allows Neural ADMIXTURE to offer even further acceleration by computing multiple cluster numbers in a single run. Furthermore, the models can be stored, allowing cluster assignment to be performed on new data in linear time without needing to share the training samples.},
	author = {Dominguez Mantes, Albert and Mas Montserrat, Daniel and Bustamante, Carlos D. and Gir{\'o}-i-Nieto, Xavier and Ioannidis, Alexander G.},
	doi = {10.1038/s43588-023-00482-7},
	id = {Dominguez Mantes2023},
	isbn = {2662-8457},
	journal = {Nature Computational Science},
	title = {Neural ADMIXTURE for rapid genomic clustering},
	url = {https://doi.org/10.1038/s43588-023-00482-7},
	year = {2023},
	bdsk-url-1 = {https://doi.org/10.1038/s43588-023-00482-7}}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

neural_admixture-1.6.7-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.14tmanylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

neural_admixture-1.6.7-cp314-cp314t-macosx_11_0_arm64.whl (519.7 kB view details)

Uploaded CPython 3.14tmacOS 11.0+ ARM64

neural_admixture-1.6.7-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

neural_admixture-1.6.7-cp314-cp314-macosx_11_0_arm64.whl (510.0 kB view details)

Uploaded CPython 3.14macOS 11.0+ ARM64

neural_admixture-1.6.7-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

neural_admixture-1.6.7-cp313-cp313-macosx_11_0_arm64.whl (508.9 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

neural_admixture-1.6.7-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

neural_admixture-1.6.7-cp312-cp312-macosx_11_0_arm64.whl (510.5 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

neural_admixture-1.6.7-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

neural_admixture-1.6.7-cp311-cp311-macosx_11_0_arm64.whl (510.0 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

neural_admixture-1.6.7-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

neural_admixture-1.6.7-cp310-cp310-macosx_11_0_arm64.whl (509.7 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file neural_admixture-1.6.7-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for neural_admixture-1.6.7-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 94104336167797061e4c6efb24035403698f8d626334be9e8fa46505ee08e193
MD5 e6e4876df6455ce1afce9142c4f427f3
BLAKE2b-256 2395c73bf86853b7cf84ccea49b0683b2abb461a59755e9fcbd82ad945363d78

See more details on using hashes here.

File details

Details for the file neural_admixture-1.6.7-cp314-cp314t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for neural_admixture-1.6.7-cp314-cp314t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b2d4cb9a338082568e83a4725c80487bfe139ab406d626b5329958c986728793
MD5 daf2ab9fc164ee3b9bca7b18d2fffece
BLAKE2b-256 19988157ad55d25e4dc203c1123c70164966baa55d793b0834a286699effba02

See more details on using hashes here.

File details

Details for the file neural_admixture-1.6.7-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for neural_admixture-1.6.7-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2696bb42a3ae968a743c8d6e7e533a7e8b9370b76d2e25c80e66da3505704286
MD5 9f7c2d322d610325bc5cec580422e620
BLAKE2b-256 122adb7c7662eb066f9e336cf02c4af58f3282aa79615d220a2e9dc314d6cb38

See more details on using hashes here.

File details

Details for the file neural_admixture-1.6.7-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for neural_admixture-1.6.7-cp314-cp314-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d5e13408890b8429c621fd72af572078a89b2d8cf08bf37c58feada0f531ffe8
MD5 9ccb5757e98dcd8b19ef43b88a6d4a84
BLAKE2b-256 b6fa27454c52dff95be34838e8a7765953437b96c30fd9714c82fa53ab9e0393

See more details on using hashes here.

File details

Details for the file neural_admixture-1.6.7-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for neural_admixture-1.6.7-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f6134ff827f86247a118a4e6d0702f05e3e8549b317e063af5ba4612170a5dfd
MD5 058d9b835ae5968d82d69e96c40aba45
BLAKE2b-256 efa838111e1a7640ecc2da8ee7102f8834d5ec673da847ae20bbb385c6a4e10e

See more details on using hashes here.

File details

Details for the file neural_admixture-1.6.7-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for neural_admixture-1.6.7-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 37af0792e188c7e9311d495fbffddddc4a6debabc9be3d5a0cd866faefd07b6d
MD5 fa1832a1186b3f49ea2f5e0b8d145e43
BLAKE2b-256 afc1166709141e4382d8b398b73ab5e593b6f76acef69ef76d5d06d7566f1eec

See more details on using hashes here.

File details

Details for the file neural_admixture-1.6.7-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for neural_admixture-1.6.7-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 bb846c03c0ef78b07684c7b13769748219a40ca5a0106495441d8f0c8939442e
MD5 2649a010796851a0f18f8868b9e335d4
BLAKE2b-256 b98aaa60799cca769b16fd49307aac4989cc85473e97c1b47ed794f3992330f4

See more details on using hashes here.

File details

Details for the file neural_admixture-1.6.7-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for neural_admixture-1.6.7-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cfe2897cbafcb47dee72807e9f14fec10ff7f116511a3462e9f2a6e18080f812
MD5 24cf8282eb904c54f57918c983ef2758
BLAKE2b-256 f44e72082e75ec201fc5353e706c952c45a8231875cb7926bf386791cef30d91

See more details on using hashes here.

File details

Details for the file neural_admixture-1.6.7-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for neural_admixture-1.6.7-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 7e8e5f700409eb4f4dcf93b88a4f321aa6e67d861232d5ee877259d5a2d09c9f
MD5 65be23af116af3ba15037feafb6cd47b
BLAKE2b-256 5e1bda579a78f7ee128f83e66c05479cea7741de2c718a8c1a37fb93d8df0ba1

See more details on using hashes here.

File details

Details for the file neural_admixture-1.6.7-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for neural_admixture-1.6.7-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1fadd17619eb5903ba02622cfc741995e5aab44c216496b33537bc0b8a47c06b
MD5 94e985dec40bec8a68114598d038f161
BLAKE2b-256 20b91bf12fe328c924b68a699bc8820f853730e4017b77fe7cfe2cc2e8ce11ab

See more details on using hashes here.

File details

Details for the file neural_admixture-1.6.7-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for neural_admixture-1.6.7-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c2011ee1efc5cd3a6ad3b0e1d605f103ec711b388a55711289a492352fb660f3
MD5 6b7dd262c7edb2e277922a2d4fdd4da8
BLAKE2b-256 095baaf0bbc7e11b2240fe50fcfa73c6caed65c5eec01b021f8de90ba21fac92

See more details on using hashes here.

File details

Details for the file neural_admixture-1.6.7-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for neural_admixture-1.6.7-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f659089c5397c3a477d56a54944a1e5d640685765392340270fe73007bd078b9
MD5 b9faf3f1b667eca9e885196da0093999
BLAKE2b-256 eab7c729142b3a8c00522482afb2c0399f02b2b6f9e3fdc54f0a711f08fa1232

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page