Genetic population assignment using neural networks
Project description
popfinder
The popfinder
Python package performs genetic population assignment using neural networks. Using popfinder
, you can load genetic information and sample information to train either a classifier neural network or a regressor neural network. A classifier neural network will try to identify the population of samples of unknown origin. The regressor neural network will try to identify latitudinal and longitudinal coordinates of samples of unknown origin. The regressor module comes with additional functionality that will perform classification of samples of unknown origin using kernel density estimates of predicted locations.
Table of Contents
Installation
popfinder
can be installed using either the conda
or pip
package managers. conda
is a general package manager capable of installing packages from many sources, but pip
is strictly a Python package manager. While the installation instructions below are based on a Windows 10 operating system, similar steps can be used to install pysyncrosim
for Linux.
Dependencies
popfinder
was developed using Python 3.10 and the following python packages:
numpy=1.24.0
pandas=1.5.2
pytorch=1.13.1
scikit-learn
dill=0.3.6
seaborn=0.12.1
matplotlib=3.6.2
scikit-allel
zarr=2.13.3
h5py=1.12.2
scipy=1.9.3
Using conda
Follow these steps to get started with conda
and use conda
to install popfinder
.
-
Install
conda
using the Miniconda or Anaconda installer (in this tutorial we use Miniconda). To install Miniconda, follow this link and under the Latest Miniconda Installer Links, download Miniconda for your operating system. Open the Miniconda installer and follow the default steps to installconda
. For more information, see the conda documentation. -
To use
conda
, open the command prompt that was installed with the Miniconda installer. To find this prompt, type "anaconda prompt" in the Windows Search Bar. You should see an option appear called Anaconda Prompt (miniconda3). Select this option to open a command line window. All code in the next steps will be typed in this window. -
You can either install
popfinder
and its dependencies into your base environment, or set up a newconda
environment (recommended). Run the code below to set up and activate a newconda
environment called "popfinder_env" that uses Python 3.10.
# Create new conda environment
conda create -n popfinder_env python=3.10
# Activate environment
conda activate popfinder_env
You should now see that "(base)" has been replaced with "(popfinder_env)" at the beginning of each prompt.
- Set the package channel for
conda
. To be able to install the dependencies forpopfinder
, you need to access theconda-forge
package channel. To configure this channel, run the following code in the Anaconda Prompt.
# Set conda-forge package channel
conda config --add channels conda-forge
- Install
popfinder
usingconda install
. Installingpopfinder
will also install its dependencies.
# Install popfinder
conda install popfinder
popfinder
should now be installed and ready to use!
Using pip
Use pip
to install popfinder
to your default python installation. You can install Python from https://www.python.org/downloads/. You can also find information on how to install pip
from the pip documentation.
Install popfinder
using pip install
. Installing popfinder
will also install its dependencies.
# Make sure you are using the latest version of pip
pip install --upgrade pip
# Install popfinder
pip install popfinder
Usage
The following usage examples use the genetic data and sample data found in this folder. The data used in the following example is actual genomic data obtained from RAD-seq analysis of Leach's storm-petrels from 5 unique populations.
Python IDE
Set Up
First, install popfinder
using either conda install
or pip install
. See the installation instructions above for more information.
Then, in a new Python script, import the 3 classes of popfinder
.
from popfinder.dataloader import GeneticData
from popfinder.classifier import PopClassifier
from popfinder.regressor import PopRegressor
Load Data
The dataloader
module contains the GeneticData
class. This class is used for loading all genetic data and sample data, as well as preprocessing the data in preparation for running the neural networks.
When creating a new instance of the GeneticData
class, it must be initialized with a path to the genetic_data
and a path to the sample_data
. The genetic data can come in the form of a .vcf, .h5py, or .zarr file, and contains allelic information for each sample. The sample data is a tab-delimited .txt file with the following columns: x
, y
, pop
, and sampleID
. The sample IDs in the .txt file must match the sample IDs in the genetic data file. If the sample is from an unknown location, then the x
, y
, and pop
columns should have NA
values.
Run the below code to create an instance of the GeneticData
class.
data_object = GeneticData(genetic_data="tests/test_data/test.vcf",
sample_data="tests/test_data/testNA.txt")
Upon creating the GeneticData
instance with the given data, the class will split the data into samples of known versus unknown origin, and of the samples of known origin, it will further split the data into a training and testing dataset. You can access these datasets using the following class attributes.
# View all loaded data
data_object.data
# View data corresponding to samples of unknown origin
data_object.unknowns
# View data corresponding to samples of known origin
data_object.knowns
# View training dataset
data_object.train
# View testing dataset
data_object.test
Use the classifier module
The classifier
module contains the PopClassifier
class. This class is used for training a classifier neural network, using this neural network to perform population assignment, and visualizing the end results.
The only required argument for initializing an instance of this class is an instance of the GeneticData
class. In our case, this instance is the data_object
we created in the previous step.
Run the below code to create an instance of the PopClassifier
class.
classifier = PopClassifier(data_object)
Next, we will train our classifier
. This will allow the neural network to learn our data so it can make more accurate predictions.
classifier.train()
We can view the training history of our classifier
using the below method. This will generate a plot that shows the loss of the neural network on the training data versus the loss on the validation data. A well-trained model should show converging loss values for the training and validation by the last epoch.
classifier.plot_training_curve()
Once we are satisfied with the training of our model, we can use the test()
method to evaluate our trained model.
classifier.test()
We can visualize the accuracy, precision, and recall of the model by plotting a confusion matrix from the test results. The confusion matrix has the true population of origin along the Y-axis and the predicted population of origin along the X-axis. The scores along the diagonal represent the proportion of times samples from a given population were correctly assigned to that population.
classifier.plot_confusion_matrix()
Finally, we can use our trained and tested model to assign individuals of unknown origin to populations.
classifier.assign_unknowns()
After running the above code, we can either display a dataframe or view a plot of assignment probabilities for each sample.
classifier.plot_assignment()
You can also retrieve information about which SNPs were most influential in training the model using the rank_site_importance()
method. This method will return a dataframe containing information about each SNP and the corresponding error when the SNP is randomized during model training and validation. In the dataframe, SNPs that have a higher error value also have greater influence on the model, and by extension play a greater role in population assignment.
classifier.rank_site_importance()
Use the regressor module
The regressor
module contains the PopRegressor
class. This class is used for training a regressor neural network, using this neural network to perform population assignment, and visualizing the end results.
The only required argument for initializing an instance of this class is an instance of the GeneticData
class. In our case, this instance is the data_object
we created in the previous step.
Run the below code to create an instance of the PopRegressor
class.
regressor = PopRegressor(data_object)
The regressor
module can be used in two different ways: (1) to retrieve predicted latitudinal/longitudinal coordinates of each sample of unknown origin; or (2) to retrieve predicted population classifications of each sample of unknown origin using kernel density estimates.
Option 1
To use the regressor
module to retrieve predicted geographic coordinates of each sample, you will follow a similar workflow as with the classifier
module. First, you will need to train the model using your training data.
regressor.train()
Next, evaluate the trained model using the test dataset.
regressor.test()
Finally, use the assign_unknown()
method to predict locations of samples of unknown origin.
regressor.assign_unknown()
You can view the predicted location in reference to the populations included in your sample data using the plot_location()
method.
regressor.plot_location()
Option 2
The second way to use the regressor
module is by generating many predicted geographic locations for each sample, then using the kernel density estimates (i.e. contour lines) to classify the population of origin as the one "closest" to the center of the kernel density estimate.
This second option requires training/testing regressor neural networks on many bootstrapped samples. This method requires that you specify the number of bootstrap samples using the nboots
parameter. The greater the number of bootstraps, the greater the number of predictions and more certain population classifications. Run the below code to implement this method.
regressor.classify_by_contours(nboots=100)
Once completed, you can view the contour maps for each sample of unknown origin to see how the classifications were made.
regressor.plot_contour_map()
Command Line
You can also run popfinder
from the command line. To run the classifier from the command line, run the pop_classifier
function. To run the regressor from the command line, run the pop_regressor
function. For a full list of methods and arguments for each function, run the --help
command.
pop_classifier --help
The general workflow for using the command line version of popfinder
is similar to using it in the Python IDE. At each step below, the updated model is loaded from and saved to the output_folder
. If no output_folder
is given, the current working directory is used.
- Load the data.
pop_classifier --load_data --genetic_data="tests/test_data/test.vcf" --sample_data="tests/test_data/testNA.txt"
- Train the model.
pop_classifier --train
- Evaluate the model on the test dataset.
pop_classifier --test
- Perform population assignment with the trained/tested model.
pop_classifier --assign
The output folder will contain results files, such as model evaluation statistics and a dataframe of sample population assignments. You can also generate plots based on the model results that will be saved to the output folder, such as the following:
pop_classifier --plot_assignment
Reference
TODO: document all classes/methods/command line parameters
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for popfinder-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3a99316c8257d96973372eba198429e7f8be42efe6d706f3049bdb82240c320f |
|
MD5 | 5de41cff9dbe0e0c9a1ad5a8f1de2133 |
|
BLAKE2b-256 | 2ae0428ebd23b42373e2bd72060ae47ef3b5743668b8f9b4227f55ee3d26c57a |