Skip to main content

Neural network for capturing linkage disequilibrium related features

Project description

LinkedNN

Neural network for extracting LD features from SNPs


Installation

pip install linkedNN

To test the installation you can apply the pretrained model from the paper to predict from a simulated dataset:

linkedNN --wd Example_data/ --seed 1 --predict

The LD layer by itself can be accessed using:

from linkedNN.models import ld_layer

GPU compatibility: The code should work out of the box on a CPU, but to train on GPUs you need to sync torch with the particular CUDA version on your computer:

mamba install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

Usage

The following are explanations of command-line flags for linkedNN.

Preprocessing

The program trains on datasets simulated usins msprime or SLiM. Before training, simulated tree sequences are preprocessed to (i) add mutations, (ii) sample SNPs, and (iii) write binary files. This can be applied to individual simulations, a range of simulation ID's, or all simulations in the specified directory; toggle this using the --simid flag. The working directory for linkedNN must itself contain a folder with tree sequences called TreeSeqs/ and a separate folder with the corresponding targets called Targets/, the latter saved as ".npy" format.

Example preprocessing command:

linkedNN --preprocess \
         --wd <path> \
         --seed <int> \
         --num_snps <int> \
         --n <int> \
         --l <int> \
         --hold_out <int> \
         --simid <int>
  • preprocess: runs the preprocessing pipeline.
  • wd: path to output directory.
  • seed: random number seed ($>0$). The random number seed determines the names of outputs,so it's important to use different seeds for different analyses.
  • num_snps: fixed number of SNPs to extract; it is recommended to use the number in your empirical dataset.
  • n: number of diploid individuals; it is recommended to use the n from your empirical dataset.
  • l: chromosome length; it is recommented to use l from your empirical dataset.
  • hold_out: number of simulations from the full set to hold out for testing.
  • simid: (optional) either (i) an individual simulation ID, (ii) a comma-separated range of IDs; if excluded, all ID's are preprocessed.

Training

After preprocessing all simulations, linkedNN can train a model using:

linkedNN --train \
         --wd <path> \
         --seed <int> \
         --batch_size <int>
  • train: runs the training pipeline.
  • batch_size: the size of mini-batches

Testing

To predict on held-out test data, run:

linkedNN --predict \
         --wd <path> \
         --seed <int> \
         --batch_size <int>

Empirical applications

To predict from an empirical VCF: leave in rare alleles, subset for a particular chromosome, and run the below command.

linkedNN --predict \
         --wd <path> \
         --seed <int> \
         --batch_size <int> \
         --empirical <path>
  • empirical: is the path and prefix for the vcf file (without ".vcf").

Vignette

Below is a complete, example workflow with LinkedNN to provide a sense what inputs and outputs to expect at each stage in the pipeline.

Simulating training data

LinkedNN expects tree sequences, so you can use whatever program produces this output, i.e., msprime, SLiM, tsinfer. For this vignette, we will run one hundred small simulations using a script provided in the GitHub repo. However, note that 50,000 simulations and hundreds of training epochs may be required to train successfully.

git clone github.com/the-smith-lab/LinkedNN
for i in {1..100}
do
    echo "simulation ID $i"
    python LinkedNN/Misc/sim_demog.py $i 500,1e3 1e2,1e3 1e2,1e3 tempdir/
done

Preprocess

linkedNN --preprocess --wd tempdir/ --seed 2 --num_snps 5000 --n 10 --l 1e8 --hold_out 25

Train

linkedNN --train --wd tempdir/ --seed 2 --batch_size 10 --max_epochs 10

The new max_epochs flag is used here to limit the number of training epochs (default=1000).

Test

linkedNN --predict --wd tempdir/ --seed 2 --batch_size 10

How to cite:

Source code: github.com/the-smith-lab/LinkedNN

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

linkednn-0.1.0.tar.gz (20.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

linkednn-0.1.0-py3-none-any.whl (22.6 kB view details)

Uploaded Python 3

File details

Details for the file linkednn-0.1.0.tar.gz.

File metadata

  • Download URL: linkednn-0.1.0.tar.gz
  • Upload date:
  • Size: 20.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.9.0 Darwin/24.6.0

File hashes

Hashes for linkednn-0.1.0.tar.gz
Algorithm Hash digest
SHA256 cdcf1fd5726d1dc60c226484b0a010fb21b7dd85a5dbdac5d3678335c7487d04
MD5 5c9ccfe94d8689cb017bb018f187b8f9
BLAKE2b-256 cffa1aa7fa70e1c2167b92609ad94bc31a272a1ed4b7a6af10d146746168a65f

See more details on using hashes here.

File details

Details for the file linkednn-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: linkednn-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 22.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.9.0 Darwin/24.6.0

File hashes

Hashes for linkednn-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fce6719c3979d0a7901807de67d7924a65caf705dd722f89bec2efdd2b6f3079
MD5 a16cfc2c528d3cdba8b1195c6faad6e7
BLAKE2b-256 9c26fef217c980a9fdad312f6ddba1a5237c09d7ce75b68d85ac72bcd6f6bcce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page