Skip to main content

Deep Neural Architecture for DNA

Project description

DNADNA

documentation pipeline status coverage report

Deep Neural Architecture for DNA.

The goal of this package is to provide utility functions to improve development of neural networks for population genetics.

dnadna should allow researchers to focus on their research project, be it the analysis of population genetic data or building new methods, without the need to focus on proper development methodology (unit test, continuous integration, documentation, etc.). Results will thus be more easily reproduced and shared. Having a common interface will also decrease the risk of bugs.

Installation

Because DNADNA has some non-trivial dependencies (most notably PyTorch) the easiest way to install it is in a conda environment.

Follow the official instructions to install either the full Anaconda distribution or, advisably, the smaller Miniconda distribution.

Once the first release of DNADNA is out we will have an official conda package. In the meantime we have prepared conda environments containing all the required dependencies, which you can install by running:

$ conda env create --name dnadna --file https://gitlab.com/mlgenetics/dnadna/-/raw/master/environment-cuda.yml

Once the environment is built you can activate it and install the latest version of DNADNA from the git repository like:

$ conda activate dnadna
$ pip install git+https://gitlab.com/mlgenetics/dnadna.git

Note: For a lighter-weight environment without GPU support you can replace environment-cuda.yml with environment-cpu.yml.

Alternatively, you may clone the DNADNA git repository and install from there:

$ git clone https://gitlab.com/mlgenetics/dnadna.git
$ cd dnadna
$ conda env create
$ conda activate dnadna
$ pip install .

Or using the lighter CPU-only environment:

$ git clone https://gitlab.com/mlgenetics/dnadna.git
$ cd dnadna
$ conda env create --file environment-cpu.yml
$ conda activate dnadna-cpu
$ pip install .

If you plan to do development on the package, it is advisable to choose the git clone solution and install in "editable" mode by running instead:

$ pip install -e .

Note for internal users/developers

Users of the private repository on https://gitlab.inria.fr should see See the development documentation for notes on how to access the private repository.

Docker

A Dockerfile is also included for building a Docker image containing DNADNA (it is based on conda, so it essentially recreates the installation environment explained above, including GPU support. To build the image run:

$ docker build --tag dnadna .

Make sure to run this from the root of the repository, as the entire repository needs to be passed to the Docker build context.

If run without specifying any further commands, it will open a shell with the dnadna conda environment enabled:

$ docker run -ti dnadna

However, you can also run it non-interactively, e.g. by specifying the dnadna command. Here it is also likely a good idea to mount the data directory for your simulation/training files. For example:

$ docker run -t -v /path/to/my/data:/data --workdir /data dnadna dnadna init

Additional notes on the Docker image:

  • By default it logs in as a non-root user named "dnadna" with UID 1001 and GID also of 1001. If you start the container with the --user flag and your own UID it will change the UID and GID of the "dnadna" user. Make sure to specify both a UID and a GID, otherwise the group of all files owned by "dnadna" will be changed to "root". E.g., run: docker run -u $(id -u):$(id -g).

  • The default workdir is /home/dnadna/dnadna which contains the dnadna package source code.

  • If you want to install your own conda packages (e.g. to use this container for development), after starting the container it is best to create a new conda environment cloned from the base environment, then re-install dnadna:

    $ conda create -n dnadna --clone base
    $ conda activate dnadna
    $ pip install -e .
    

    This is not done by default by the image because it would require additional start-up time in the non-development case. However, as a short-cut for the above steps you can run the container with docker run -e DEV=1.

    Afterwards, you can create a snapshot of this container in another terminal, e.g. by running docker commit <my-container> dnadna-dev.

Dependencies

  • python >= 3.6
  • pytorch
  • pandas
  • numpy
  • matplotlib
  • msprime
  • jsonschema
  • pyyaml
  • tqdm

(For a complete list, see setup.cfg, or requirements.txt.)

Quickstart Tutorial

After successful installation you should have a command-line utility called dnadna installed:

$ dnadna --help
usage: dnadna [-h] [COMMAND]

dnadna version ... top-level command.

See dnadna <sub-command> --help for help on individual sub-commands.

optional arguments:
  -h, --help       show this help message and exit
  --plugin PLUGIN  load a plugin module; the plugin may be specified either as the file path to a Python module, or the name of a module importable on the current Python module path (i.e. sys.path); plugins are just Python modules which may load arbitrary code (new simulators, loss functions, etc.) during DNADNA startup); --plugin may be passed multiple times to load multiple plugins
  --trace-plugins  enable tracing of plugin loading; for most commands this is enabled by default, but for other commands is disabled to reduce noise; this forces it to be enabled
  -V, --version    show the dnadna version and exit

sub-commands:
  init          Initialize a new model training configuration and directory
                structure.
  preprocess    Prepare a training run from an existing model training
                configuration.
  train         Train a model on a simulation using a specified pre-processed
                training config.
  predict       Make parameter predictions on existing SNP data using an already
                trained model.
  sumstats      Compute summary statistics on a simulation data set.
  simulation    Run a registered simulation.

This implements a number of different sub-commands for different training and simulation steps. The dnadna command can be used either starting with an existing simulation dataset (which may need to be first be converted to the DNADNA Dataset Format, or you may use dnadna's simulator interface to create a new simulation dataset.

Here we step through the complete process from configuring and generating a simulation, to running data pre-processing on the simulation, and training a network based off that simulation.

If you already have simulation data in the DNADNA Data Format you can skip straight to the initialization step.

Simulation initialization and configuration

To initialize a simulation, we must first generate a config file and output folder for it, using the dnadna simulation init command:

$ dnadna simulation init my_model one_event
Writing sample simulation config to my_model/my_model_simulation_config.yml ...
Edit the config file as needed, then run this simulation with the command:

    dnadna simulation run my_model/my_model_simulation_config.yml

This will create a directory in the current directory, named my_model/, and initialize it with a config file pre-populated with sample parameters for the built-in one_event example simulator.

Before running the simulation we may want to adjust some of the parameters. Open my_model/my_model_simulation_config.yml in your favorite text editor. By default we see that n_scenarios is 20000 with n_replicates of 100 per scenario. This means 2 million simulations which is a good number for training a real model, but will take a very long time for a simple demo. Change these to some lower numbers like 20 and 2. You may also set the seed option to seed the random number generator for reproducible results. The resulting file (with none of the other settings changed) should look like:

# my_model/my_model_simulation_config.yml
data_root: .
n_scenarios: 20
n_replicates: 2
seed: 2
...

Running the simulation

Now to run the simulation we configured, we run dnadna simulation run, passing it the path to the config file we just edited. If you run this in a terminal it will also display a progress bar:

$ dnadna simulation run my_model/my_model_simulation_config.yml
... INFO;  Running one_event simulator with n_scenarios=20 and n_replicates=2
... INFO;  Simulation complete!
... INFO;  Initialize model training with the command:
... INFO;
... INFO;      dnadna init --simulation-config=my_model/my_model_simulation_config.yml

Model initialization

The main command for initialize DNADNA is dnadna init, which assumes we already have a simulation (such as the one we just generated) in the standard DNADNA Data Format. Although this command can be run without any arguments (producing a default config file), if we pass it the path to our simulation config file it will output a config file appropriate for use with that simulation:

$ dnadna init --simulation-config=my_model/my_model_simulation_config.yml
Writing sample preprocessing config to my_model/my_model_preprocessing_config.yml ...
Edit the dataset and/or preprocessing config files as needed, then run preprocessing with the command:

    dnadna preprocess my_model/my_model_preprocessing_config.yml

After running dnadna init, it is expected that the user will manually edit the sample config file that it outputs, in order to exactly specify how they want to train their model, and on which parameters. In fact, the default template is going to be good enough for our demo simulation, except for one bit that will give us trouble.

The option dataset_splits: has a default value meaning 70% of our scenarios will be used for training, and only 30% for validation. Since, for this quick demo, we only have 20 scenarios, the validation set will be too small. Open the file my_model/my_model_processing_config.yml in your editor and change this so that our dataset is split 50/50 between training and validation:

# my_model/my_model_preprocessing_config.yml
# ...
dataset_splits:
    training: 0.5
    validation: 0.5

Under normal use you would set these ratios however you prefer. You can also include a test set of scenarios to be set aside for testing your model.

Pre-processing

Before training a model, some data pre-processing must be performed on the data set; the output of this pre-processing can depend on the settings in the preprocessing config file that was output by dnadna init. To do this, simply run:

$ dnadna preprocess my_model/my_model_preprocessing_config.yml
... INFO;  Removing scenarios with:
... INFO;   - Missing replicates
... INFO;   - Fewer than 500 SNPs
... INFO;  ...
... INFO;  Using ... CPU for checking scenarios
... INFO;  20 scenarios out of 20 have been kept, representing 40 simulations
... INFO;  Splitting scenarios between training and validation set
... INFO;  Standardizing continuous parameters
... INFO;  Writing preprocessed scenario parameters to: .../my_model/my_model_preprocessed_params.csv
... INFO;  Writing sample training config to: .../my_model/my_model_training_config.yml
... INFO;  Edit the training config file as needed, then start the training run with the command:
... INFO;
... INFO;      dnadna train .../my_model/my_model_training_config.yml

This will produce a <model_name>_training_config.yml file containing the config file prepared for training your model.

Training

To run a model training, after pre-processing use dnadna train, giving it the path to the pre-processed training config file as output by the last step.

In order to make the training run a little faster (just for this example) let's also edit the training config file to limit it to one epoch:

# my_model/my_model_training_config.yml
# ...
# name and parameters of the neural net model to train
network:
    name: CustomCNN

# number of epochs over which to repeat the training process
n_epochs: 1
...

Then run dnadna train on the training config file:

$ dnadna train my_model/my_model_training_config.yml
... INFO;  Preparing training run
... INFO;  20 samples in the validation set and 20 in the training set
... INFO;  Start training
... INFO;  Networks states are saved after each validation step
... INFO;  Starting Epoch #1
... INFO;  Validation at epoch: 1 and batch: 1
... INFO;  Compute all outputs for validation dataset...
... INFO;  Done
... INFO;  training loss = 1.0222865343093872 // validation loss = 1.2975229024887085
... INFO;  Better loss found on validation set: None --> 1.2975229024887085
... INFO;  Saving model to ".../my_model/run_000/my_model_run_000_best_net.pth" ...
... INFO;  Compute all outputs for validation dataset...
... INFO;  Done
... INFO;  --- 3.185938596725464 seconds ---
... INFO;  --- Best loss: 3.892427444458008
... INFO;  Saving model to ".../my_model/run_000/my_model_run_000_last_epoch_net.pth" ...
... INFO;  You can test the model's predictions on a test dataset by running the command:
... INFO;
... INFO;      dnadna predict .../my_model/run_000/my_model_run_000_last_epoch_net.pth <dataset config file or paths to .npz files>

By default this will output a directory for your training run under model_name/run_NNN where NNN is an integer run ID. The run ID starts at 0, and by default the next unused run ID is used. However, you may also pass the --run-id argument to give a custom run ID, which may be either an integer, or an arbitrary string.

Following a successful training run will output a <model_name>_run_<run_id>_last_epoch_net.pth file in the run directory, containing the final trained model in a pickled format, which can be loaded by the torch.load function.

Under the run directory this will also produce a <model_name>_<run_id>_training_config.yml file containing the final config file prepared for this training run. This contains a complete copy of the "base" training config use used to run dnadna train as well as a complete copy of the simulation config. This information is copied in full for the purpose of provenance and reproducibility of a training run.

The expectation is that between multiple training runs, you may modify the "base" config to tune the training, either by modifying the original config file directly, or by copying it and editing the copy. In any case, the final configuration used to perform the training run is saved in the run directory and should not be modified.

Prediction

Given the trained network, we can now use it to make (or confirm) predictions on new datasets. To demonstrate we'll run the dnadna predict model over part of the existing dataset we just used to train the model, though in practice it could be run on any data that conforms (e.g. in dimensions) to the dataset the model was trained on. The output is a CSV file containing the parameter predictions for each input:

$ dnadna predict my_model/run_000/my_model_run_000_last_epoch_net.pth \
                 my_model/scenario_04/*.npz
path,event_time,recent_size,event_size
.../my_model/scenario_04/my_model_04_0.npz,-0.06392800807952881,-0.11097482591867447,-0.12720556557178497
.../my_model/scenario_04/my_model_04_1.npz,-0.06392764300107956,-0.11097370833158493,-0.12720371782779694

Summary statistics

Compute summary statistics on the simulation data set:

$ dnadna sumstats my_model/my_model_simulation_config.yml
... INFO;  Starting sumstats with ... CPU(s) on 20 scenario(s)
... INFO;  Done

By default, the summary statistics files are output to the sumstats/ sub-directory of your model directory, as CSV files:

$ ls -1 my_model/sumstats/scenario_00/
my_model_00_ld.csv
my_model_00_sel.csv
my_model_00_sfs.csv

The summary statistics config file can either be a stand-alone config file (which references the simulation config file), or it can be included directly in the simulation config file, in which case the command is passed the path to the simulation config file as in the example above.

Data format

DNADNA has a prescribed filesystem layout and file format for the datasets its works on. Some of the details of this layout can be modified in the configuration files, and in a future version will be further customizable by plugins.

But the default format assumes that SNP data (SNP matrices and associated SNP position arrays) are stored in NumPy's NPZ format with one file per SNP. They are organized on disk by scenario like:

\_ my_simulation/
    \_ my_simulation_params.csv  # the scenario parameters table
    |_ my_simulation_dataset_config.yml  # the simulation config file
    |_ scenario_000/
        \_ my_simulation_000_00.npz  # scenario 0 replicate 0
        ...
        |_ my_simulation_000_NN.npz
    |_ scenario_001/
        \_ my_simulation_001_00.npz  # scenario 1 replicate 0
        ...
        |_ my_simulation_001_NN.npz
    ...
    |_ scenario_NNN/
        \_ my_simulation_NNN_00.npz
        ...
        |_ my_simulation_NNN_NN.npz

The file my_simulation_params.csv contains the known target values for each parameter of the simulation, on a per-scenario basis. It is currently a plain CSV file which must contain at a minimum 3 columns:

  • A scenario_idx column giving the scenario number.
  • An n_replicates column which specifies the number of replicates in that scenario.
  • One or more additional columns containing arbitrary parameter names, and their values each scenario in the dataset.

For example:

scenario_idx,mutation_rate,recombination_rate,event_time,n_replicates
0,1e-08,1e-08,0.3865300203456797,-0.497464473948751,100
1,1e-08,1e-08,0.19344551118300793,0.16419897912977574,100
...

An associated config file here named my_simulation_dataset_config.yml provides further details about how to load the dataset to the DNADNA software. An example dataset config can be generated by running dnadna init.

See The DNADNA Dataset Format for more details.

Definitions:

  • Simulation: data generated by a simulator
  • (Population) Model: general model describing demography and selection (e.g. bottleneck + presence of selection)
  • Scenario: set of parameters characterizing demography and selection with respect to the model
  • Replicate: repetition of a given scenario with a different seed
  • Run: set of hyperparameters used to train and test the network for a given model

Development

See the development documentation for full details on how to set up and use a development environment and contribute to DNADNA.

Detailed usage

For the full usage manual see the DNADNA Documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dnadna-1.0.0rc0.tar.gz (261.1 kB view hashes)

Uploaded Source

Built Distribution

dnadna-1.0.0rc0-py3-none-any.whl (198.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page