Skip to main content

Rapid population clustering with autoencoders

Project description

PyPI - Python Version PyPI - Version PyPI - License PyPI - Status tests PyPI - Downloads DOI

Neural ADMIXTURE

Neural ADMIXTURE is an unsupervised global ancestry inference technique based on ADMIXTURE. By using neural networks, Neural ADMIXTURE offers high quality ancestry assignments with a running time which is much faster than ADMIXTURE's. For more information, we recommend reading our corresponding article.

The software can be invoked via CLI and has a similar interface to ADMIXTURE (e.g. the output format is completely interchangeable). While the software runs in both CPU and GPU, we recommend using GPUs if available to take advantage of the neural network-based implementation.

nadm_mna

System requirements

Hardware requirements

The successful usage of this package requires a computer with enough RAM to be able to handle the large datasets the network has been designed to work with. Due to this, we recommend using compute clusters whenever available to avoid memory issues.

Software requirements

The package has been tested on both Linux (CentOS 7.9.2009, Ubuntu 18.04.5 LTS) and MacOS (BigSur 11.2.3, Intel and Monterey 12.3.1, M1). It is highly recommended to use GPUs for optimal performance - make sure CUDA drivers are properly installed.

Installation guide

We recommend creating a fresh Python 3.12 environment using conda (or virtualenv), and then install the package neural-admixture there. As an example, for conda, one should launch the following commands:

$ conda create -n nadmenv python=3.12
$ conda activate nadmenv
(nadmenv) $ pip install neural-admixture

Important note: Using GPUs greatly speeds up processing and is recommended for large datasets.

Specify the number of GPUs (--num_gpus) and CPUs (--num_cpus) you have available in your machine to optimize the performance. For MacOS users with an Apple Metal chip, using --num_gpus 1 will enable MPS acceleration in the software. Note that despite MPS acceleration being supported, the RAM available in laptops is probably limited, so larger datasets should be run on CUDA-capable GPUs, which the software is more optimized for.

Usage

Running Neural ADMIXTURE

To train a model from scratch, simply invoke the following commands from the root directory of the project. For more info about all the arguments, please run neural-admixture train --help. If training a single-head version of the network suffices, please use the flag --k instead of --min_k and --max_k. Note that BED, PGEN and VCF are supported as of now.

For unsupervised Neural ADMIXTURE (single-head):

$ neural-admixture train --k K --name RUN_NAME --data_path DATA_PATH --save_dir SAVE_PATH

For unsupervised Neural ADMIXTURE (multi-head):

$ neural-admixture train --min_k K_MIN --max_k K_MAX --name RUN_NAME --data_path DATA_PATH --save_dir SAVE_PATH

For supervised Neural ADMIXTURE:

$ neural-admixture train --k K --supervised --populations_path POPS_PATH --name RUN_NAME --data_path DATA_PATH --save_dir SAVE_PATH # only single-head support at the moment

As an example, the following ADMIXTURE call

$ ./admixture snps_data.bed 8 -s 42

would be mimicked in Neural ADMIXTURE by running

$ neural-admixture train --k 8 --data_path snps_data.bed --save_dir SAVE_PATH --init_file INIT_FILE --name snps_data --seed 42

with some parameters such as the decoder initialization or the save directories not having a direct equivalent.

Several files will be output to the SAVE_PATH directory (the name parameter will be used to create the whole filenames):

  • A .P file, similar to ADMIXTURE.
  • A .Q file, similar to ADMIXTURE.
  • A .pt file, containing the weights of the trained network.
  • A .json file, with the configuration of the network.

The last three files are required to run posterior inference using the network, so be aware of not deleting them accidentally! Logs are printed to the stdout channel by default. If you want to save them to a file, you can use the command tee along with a pipe:

$ neural-admixture train --k 8 ... | tee run.log

Inference mode (projective analysis)

ADMIXTURE allows reusing computations in the projective analysis mode, in which the P (F, frequencies) matrix is fixed to an already known result and only the assignments are computed. Due to the nature of our algorithm, assignments can be computed for unseen data by simply feeding the data through the encoder. This mode can be run by typing infer instead of train right after the neural-admixture call.

For example, assuming we have a trained Neural ADMIXTURE (named nadm_test) in the path ./outputs, one could run inference on unseen data (./data/unseen_data.bed) via the following command:

$ neural-admixture infer --name nadm_test --save_dir ./outputs --out_name unseen_nadm_test --data_path ./data/unseen_data.bed

For this command to work, files ./outputs/nadm_test.pt and ./outputs/nadm_test_config.json, which are training outputs, must exist. In this case, only a .Q will be created, which will contain the assignments for this data (the parameter of the flag out_name will be used to generate the output file name). This file will be written in the --save_dir directory (in this case, ./outputs).

Supervised Neural ADMIXTURE

The supervised version of the algorithm can be used when all samples have a corresponding population label. This can be very benificial, especially when dealing with large imbalances in the data (e.g data contains 1K samples from Pop1 and 50 samples from Pop2).

In order to use the supervised mode, the --pops_path argument pointing to the file where the ancestries are defined must be passed. The latter file must be a single-column, headerless, plain text file where row i denotes the ancestry for the i-th sample in the data. We currently do not support datasets which contain samples with missing ancestries.

The supervised mode works by adding a scaled classification loss to the bottleneck of the algorithm (Equation 5 of the paper). The scaling factor can have a big impact on the performance. If it is too small, then the supervised loss will have little impact on the training, so results would be similar to an unsupervised run. On the other hand, if it is too large, then the supervision will dominate training, making the network overconfident in its predictions: essentially, one would get only binary assignments. The default value of the scaling factor is _η=100, and can be controlled using the parameter --supervised_loss_weight.

Basically, if on validation data you are getting single-ancestry estimations when you expect admixed estimations, try setting a smaller value for the supervised loss scaling factor η (--supervised_loss_weight).

Moreover, note that the initialization method chosen will have no effect, as the supervised method is always used when using the supervised version.

Other options

  • batch_size: number of samples used at every update. If you have memory issues, try setting a lower batch size. Defaults to 800.
  • n_components: dimension of the PCA projection for SVD. Defaults to 8.
  • epochs: maximum number of times the whole training dataset is used to update the weights. Defaults to 250.
  • learning_rate: dictates how large an update to the weights will be. If you find the loss function oscillating, try setting a lower value. If convergence is slow, try setting a higher value. Defaults to 25e-4.
  • seed: RNG seed for replication purposes. Defaults to 42.
  • num_gpus: number of GPUs to use during training. Set to 0 for CPU-only execution. Defaults to 0.

Experiments replication

The datasets All-Chms, Chm-22 and Chm-22-Sim used in the Experiments section of the article can be found in figshare. For descriptions of the datasets, please refer to the corresponding section in the paper. The exact hyperparameters used in the experiments to allow replication can be found in the Supplementary Table 3 of the article.

Demo

To run the software with a small demo dataset, check the instructions in the corresponding folder of the repository.

Troubleshooting

CUDA issues

If you get an error similar to the following (when using GPU):

OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

simply installing nvcc using conda/mamba should fix it:

$ conda install -c nvidia nvcc

License

NOTICE: This software is available for use free of charge for academic research use only. Academic users may fork this repository and modify and improve to suit their research needs, but also inherit these terms and must include a licensing notice to that effect. Commercial users, for profit companies or consultants, and non-profit institutions not qualifying as "academic research" should contact the authors for a separate license. This applies to this repository directly and any other repository that includes source, executables, or git commands that pull/clone this repository as part of its function. Such repositories, whether ours or others, must include this notice.

Cite

When using this software, please cite the following paper:

@article{dominguezmantes23,
	abstract = {Characterizing the genetic structure of large cohorts has become increasingly important as genetic studies extend to massive, increasingly diverse biobanks. Popular methods decompose individual genomes into fractional cluster assignments with each cluster representing a vector of DNA variant frequencies. However, with rapidly increasing biobank sizes, these methods have become computationally intractable. Here we present Neural ADMIXTURE, a neural network autoencoder that follows the same modeling assumptions as the current standard algorithm, ADMIXTURE, while reducing the compute time by orders of magnitude surpassing even the fastest alternatives. One month of continuous compute using ADMIXTURE can be reduced to just hours with Neural ADMIXTURE. A multi-head approach allows Neural ADMIXTURE to offer even further acceleration by computing multiple cluster numbers in a single run. Furthermore, the models can be stored, allowing cluster assignment to be performed on new data in linear time without needing to share the training samples.},
	author = {Dominguez Mantes, Albert and Mas Montserrat, Daniel and Bustamante, Carlos D. and Gir{\'o}-i-Nieto, Xavier and Ioannidis, Alexander G.},
	doi = {10.1038/s43588-023-00482-7},
	id = {Dominguez Mantes2023},
	isbn = {2662-8457},
	journal = {Nature Computational Science},
	title = {Neural ADMIXTURE for rapid genomic clustering},
	url = {https://doi.org/10.1038/s43588-023-00482-7},
	year = {2023},
	bdsk-url-1 = {https://doi.org/10.1038/s43588-023-00482-7}}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

neural_admixture-1.6.3-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

neural_admixture-1.6.3-cp313-cp313-manylinux_2_17_i686.manylinux2014_i686.whl (1.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ i686

neural_admixture-1.6.3-cp313-cp313-macosx_11_0_arm64.whl (491.0 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

neural_admixture-1.6.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

neural_admixture-1.6.3-cp312-cp312-manylinux_2_17_i686.manylinux2014_i686.whl (1.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ i686

neural_admixture-1.6.3-cp312-cp312-macosx_11_0_arm64.whl (493.0 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

neural_admixture-1.6.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

neural_admixture-1.6.3-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl (1.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ i686

neural_admixture-1.6.3-cp311-cp311-macosx_11_0_arm64.whl (493.3 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

neural_admixture-1.6.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

neural_admixture-1.6.3-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl (1.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ i686

neural_admixture-1.6.3-cp310-cp310-macosx_11_0_arm64.whl (493.2 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file neural_admixture-1.6.3-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for neural_admixture-1.6.3-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ba3e8c528cac943787133a79ef2b40c7ffed7d51b073bc81eb84deedb4be6857
MD5 a2e4d141f743f52fd633997580f61d64
BLAKE2b-256 65027774f4cd500831b4b59e76c165a01d44a76d92840c14a19156cc5fa878d6

See more details on using hashes here.

File details

Details for the file neural_admixture-1.6.3-cp313-cp313-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for neural_admixture-1.6.3-cp313-cp313-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 844e7ca80e01af09b7220e02d8f4c3be4f2cc28c84a822897046cac702269d1b
MD5 0d630645102a36e77c02868da509bfcc
BLAKE2b-256 acb5b618fbd3feb3afd24b79ca243ecdcddcc8c43036eb6e640348066cd9a25a

See more details on using hashes here.

File details

Details for the file neural_admixture-1.6.3-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for neural_admixture-1.6.3-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9b984109b1e022592b1c60fefce8491b2ced027db54ebf1a96e1a2cb57a4040c
MD5 ccea47a6576b834c609ec198ad5a645f
BLAKE2b-256 3085f80f76bedd980a1ce8d7ab6910c85f6027163faac4a2dc50fc37908d8206

See more details on using hashes here.

File details

Details for the file neural_admixture-1.6.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for neural_admixture-1.6.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ac2d5f894213cd8a87710f3c9724c0f9c2524c639c1b329f7e4bd7c770da0fbc
MD5 6911afd419a2a186be2dc02f5263d7ae
BLAKE2b-256 70602ccd325613ed888fc92bd5b0db2030bc16028ab4837d00f07eb1a378d176

See more details on using hashes here.

File details

Details for the file neural_admixture-1.6.3-cp312-cp312-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for neural_admixture-1.6.3-cp312-cp312-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 075007a3015bd418828c802cca3c56ee7e39df61edca46ddccfa9bf3eb04129a
MD5 d5a40ffd67ce24b7decd65618c6c2706
BLAKE2b-256 9e0cf6b414a51d660c7756363768820804b4b77728e0ee7715ba1c009373aaa2

See more details on using hashes here.

File details

Details for the file neural_admixture-1.6.3-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for neural_admixture-1.6.3-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 37e5c1badf2a596431ed3b150e40d83795862e0859bbc1860d28fc8e6ca73708
MD5 7fbbefb47ac4dff82adbfd3af99c2922
BLAKE2b-256 415c3e06c1c7ee8d075736686a98d0dc8c1b6d572d779f17ce97c7865d860532

See more details on using hashes here.

File details

Details for the file neural_admixture-1.6.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for neural_admixture-1.6.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 389a8a4474c4755df2b060422e9810a92bfb99b9a6d91362ed3c850e0bdb5460
MD5 16e16d2260fce421e64c9d3c1b27bd28
BLAKE2b-256 4dca26c9dde127047b2703f52c60b97c3001f2d5f900e8988fd424eb3f32559c

See more details on using hashes here.

File details

Details for the file neural_admixture-1.6.3-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for neural_admixture-1.6.3-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 71f9aaa642dc6681824568487e1b77ebc2f4909cf2ef97a5b445b01fda664e91
MD5 0c470524fd8e08d8f20519bc04ab056c
BLAKE2b-256 fed3dde29822c4a43851e4dd1c56225d86877d6608a8c4c1d98e5fe2bb621d27

See more details on using hashes here.

File details

Details for the file neural_admixture-1.6.3-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for neural_admixture-1.6.3-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e2e40a3f614aad46f003e95daa29ed912289ae4d80e093ec5a99b0a07b29c523
MD5 d03b9742d64355b3483815f8328cd9de
BLAKE2b-256 797906de08b8cf77ed34209f7444dca91447b01c6967eea386435b46e08a35e6

See more details on using hashes here.

File details

Details for the file neural_admixture-1.6.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for neural_admixture-1.6.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ee09ed45bf0247f95f0dd704ff6f33835d1041ae4367c99a575ae9af410f8345
MD5 831fa643cefdee70feb12a04a2eca325
BLAKE2b-256 018350b0e6bf0fc821af3513f3f275a9c4a21c49976082cd3fc98e1048cc2750

See more details on using hashes here.

File details

Details for the file neural_admixture-1.6.3-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for neural_admixture-1.6.3-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 c839992b0b1ed29fc7f16ae1630443dd046f1840c5bcd602755f8263a7931d2d
MD5 ce2423b6ca0cf0637feb2e9f2a511b35
BLAKE2b-256 288d02e1d72370f64426d82e6e7302f764aac50b8ad0d6a3d5a31c0e6bcd9e03

See more details on using hashes here.

File details

Details for the file neural_admixture-1.6.3-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for neural_admixture-1.6.3-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cddd4f6497faa6687f337f06e4eaa70c29b21eda4521979e6292e7ffa0bc9cc1
MD5 142e6d0764c5437e580717e9c24c22c2
BLAKE2b-256 bf13f4e96d283914cff867097e6205e08d98ba54dbd88a5478023ff0a4bfb00f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page