# ezancestry

Easily visualize your direct-to-consumer genetics next to 2500+ samples from the 1000 genomes project. Evaluate the performance of a custom set of ancestry-informative snps (AISNPs) at classifying the genetic ancestry of the 1000 genomes samples using a machine learning model.

A subset of 1000 Genomes Project samples' single nucleotide polymorphism(s), or, SNP(s) have been parsed from the publicly available .bcf files.
The subset of SNPs, AISNPs (ancestry-informative snps), were chosen from two publications:

ezancestry ships with pretrained k-nearest neighbor models for all combinations of following:

* Kidd (55 AISNPs)
* Seldin (128 AISNPs)

* continental-level population (superpopulation)
* regional population (population)

* principal componentanalysis (PCA)
* neighborhood component analysis (NCA)
* uniform manifold approximation and projection (UMAP)


## Installation

Install ezancestry with pip:

pip install ezancestry


Or clone the repository and run pip install from the directory:

git clone git@github.com:arvkevi/ezancestry.git
cd ezancestry
pip install .


## Config

The first time ezancestry is run it will generate a config.ini file and data/ directory in your home directory under ${HOME}/.ezancestry. You can edit conf.ini to change the default settings, but it is not necessary to use ezancestry. The settings are just a utility for the user so they don't have to be verbose when interacting with the software. The settings are also keyword arguments to each of the commands in the ezancestry API, so you can always override the default settings. These will be created in your home directory: ${HOME}/.ezancestry/conf.ini


#### plot

Visualize the output of predict using the plot command. This will open a 3d scatter plot in a browser.

ezancestry plot predictions.csv


#### generate-dependencies

This command will download all of the data required to build a new nearest neighbors model for a custom set of AISNPs. This command will attempt to download all the .bcf files from The 1000 Genomes Project. If you want to use existing models, see predict and plot.

Without any arguments this command will download all necessary data to build new models and put it in the \${HOME}/.ezancestry/data/ directory.

ezancestry generate-dependencies


Now you are ready to build a new model with build-model.

#### build-model

Test the discriminative power of your custom set of AISNPs.

This command will build all the necessary models to visualize and predict the 1000 genomes samples as well as user-uploaded samples. A model performace evaluation report will be generated for a five-fold cross-validation on the training set of the 1000 genomes samples as well as a report for the holdout set.

Create a custom AISNP file here: ~/.ezancestry/data/aisnps/custom.AISNP.txt. The prefix of the filename, custom, can be whatever you want. Note that this value is used as the aisnps-set keyword argument for other ezancestry commands.

The file should look like this:

id      chromosome      position_hg19
rs731257        7       12669251
rs2946788       11      24010530
rs3793451       9       71659280
rs10236187      7       139447377
rs1569175       2       201021954

ezancestry build-model --aisnps-set=custom


See the notebook

### Visualization

http://ezancestry.herokuapp.com/

Open in Streamlit

### Contributing

Contributions are welcome! Please feel free to create an issue for discussion or make a pull request.

