tools for genetic genealogy and the analysis of consumer DNA test results
Project description
lineage provides a framework for analyzing genotype (raw data) files from direct-to-consumer DNA testing companies (e.g., 23andMe, Family Tree DNA, and Ancestry), primarily for the purposes of genetic genealogy.
Capabilities
Merge raw data files from different DNA testing companies, identifying discrepant SNPs in the process
Compute centiMorgans (cMs) of shared DNA between individuals using HapMap tables
Plot shared DNA between individuals
Determine genes shared between individuals (i.e., genes transcribed from shared DNA segments)
Find discordant SNPs between child and parent(s)
Remap SNPs between assemblies / builds (e.g., convert SNPs from build 36 to build 37, etc.)
Dependencies
lineage requires Python 3.4+, pandas, and matplotlib.
On Linux systems, the python3-tk package may also be required:
$ sudo apt-get install python3-tk
Installation
lineage is available on the Python Package Index. Install lineage via pip:
$ pip install lineage
Examples
Initialize the lineage Framework
Import Lineage and instantiate a Lineage object:
>>> from lineage import Lineage >>> l = Lineage()
Download Example Data
Let’s download some example data from openSNP:
>>> l.download_example_datasets() Downloading resources/662.23andme.304.csv.gz Downloading resources/662.23andme.340.csv.gz Downloading resources/662.ftdna-illumina.341.csv.gz Downloading resources/663.23andme.305.csv.gz Downloading resources/4583.ftdna-illumina.3482.csv.gz Downloading resources/4584.ftdna-illumina.3483.csv.gz
We’ll call these datasets User662, User663, User4583, and User4584.
Load Raw Data
Create an Individual in the context of the lineage framework to interact with the User662 dataset:
>>> user662 = l.create_individual('User662', 'resources/662.ftdna-illumina.341.csv.gz') Loading resources/662.ftdna-illumina.341.csv.gz
Here we created user662 with the name User662 and loaded a raw data file.
Remap SNPs
Oops! The data we just loaded is Build 36, but we want Build 37 since the other files in the datasets are Build 37… Let’s remap the SNPs:
>>> user662.remap_snps('NCBI36', 'GRCh37') Remapping chromosome 1... Remapping chromosome 2... Remapping chromosome 3... Remapping chromosome 4... Remapping chromosome 5... Remapping chromosome 6... Remapping chromosome 7... Remapping chromosome 8... Remapping chromosome 9... Remapping chromosome 10... Remapping chromosome 11... Remapping chromosome 12... Remapping chromosome 13... Remapping chromosome 14... Remapping chromosome 15... Remapping chromosome 16... Remapping chromosome 17... Remapping chromosome 18... Remapping chromosome 19... Remapping chromosome 20... Remapping chromosome 21... Remapping chromosome 22...
SNPs can be re-mapped between Build 36 (NCBI36), Build 37 (GRCh37), and Build 38 (GRCh38).
Merge Raw Data Files
The dataset for User662 consists of three raw data files from two different DNA testing companies. Let’s load the remaining two files.
As the data gets added, it’s compared to the existing data and discrepancies are saved to CSV files. (The discrepancy thresholds can be tuned via parameters.)
>>> user662.load_snps(['resources/662.23andme.304.csv.gz', 'resources/662.23andme.340.csv.gz'], ... discrepant_genotypes_threshold=160) Loading resources/662.23andme.304.csv.gz 3 SNP positions being added differ; keeping original positions Saving output/User662_discrepant_positions_1.csv 8 genotypes were discrepant; marking those as null Saving output/User662_discrepant_genotypes_1.csv Loading resources/662.23andme.340.csv.gz 27 SNP positions being added differ; keeping original positions Saving output/User662_discrepant_positions_2.csv 156 genotypes were discrepant; marking those as null Saving output/User662_discrepant_genotypes_2.csv
All output files are saved to the output directory.
Save SNPs
Ok, so far we’ve remapped the SNPs to the same build and merged the SNPs from three files, identifying discrepancies along the way. Let’s save the merged dataset consisting of over 1M+ SNPs to a CSV file:
>>> user662.save_snps() Saving output/User662.csv
Compare Individuals
Let’s create another Individual for the User663 dataset:
>>> user663 = l.create_individual('User663', 'resources/663.23andme.305.csv.gz') Loading resources/663.23andme.305.csv.gz
Now we can perform some analysis between the User662 and User663 datasets.
Find Discordant SNPs
First, let’s find discordant SNPs (i.e., SNP data that is not consistent with Mendelian inheritance):
>>> discordant_snps = l.find_discordant_snps(user662, user663, save_output=True) Saving output/discordant_snps_User662_User663.csv
This method also returns a pandas DataFrame, and it can be inspected interactively at the prompt, although the same output is available in the CSV file.
>>> len(discordant_snps.loc[discordant_snps['chrom'] != 'MT']) 37
Not counting mtDNA SNPs, there are 37 discordant SNPs between these two datasets.
Documentation
Documentation is available here.
Acknowledgements
Thanks to Whit Athey, Ryan Dale, Mike Agostino, Padma Reddy, Binh Bui, Gopal Vashishtha, CS50, and openSNP.
License
Copyright (C) 2016 Andrew Riha
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.