tools for reading, writing, merging, and remapping SNPs
tools for reading, writing, merging, and remapping SNPs 🧬
snps strives to be an easy-to-use and accessible open-source library for working with genotype data
Input / Output
- Read raw data (genotype) files from a variety of direct-to-consumer (DTC) DNA testing sources with a SNPs object
- Read and write VCF files (e.g., convert 23andMe to VCF)
- Merge raw data files from different DNA tests, identifying discrepant SNPs in the process
- Read data in a variety of formats (e.g., files, bytes, compressed with gzip or zip)
- Handle several variations of file types, validated via openSNP parsing analysis
Build / Assembly Detection and Remapping
- Detect the build / assembly of SNPs (supports builds 36, 37, and 38)
- Remap SNPs between builds / assemblies
- Fix several common issues when loading SNPs
- Sort SNPs based on chromosome and position
- Deduplicate RSIDs
- Deduplicate alleles in the non-PAR regions of the X and Y chromosomes for males
- Deduplicate alleles on MT
- Assign PAR SNPs to the X or Y chromosome
- Derive sex from SNPs
- Predict ancestry from SNPs (when installed with ezancestry)
Supported Genotype Files
snps supports VCF files and genotype files from the following DNA testing sources:
- Código 46
- Family Tree DNA
- Genes for Good
- Sano Genetics
Additionally, snps can read a variety of “generic” CSV and TSV files.
snps requires Python 3.7.1+ and the following Python packages:
$ pip install snps
$ pip install snps[ezancestry]
Download Example Data
First, let’s setup logging to get some helpful output:
>>> import logging, sys >>> logger = logging.getLogger() >>> logger.setLevel(logging.INFO) >>> logger.addHandler(logging.StreamHandler(sys.stdout))
Now we’re ready to download some example data from openSNP:
>>> from snps.resources import Resources >>> r = Resources() >>> paths = r.download_example_datasets() Downloading resources/662.23andme.340.txt.gz Downloading resources/662.ftdna-illumina.341.csv.gz
Load Raw Data
Load a 23andMe raw data file:
>>> from snps import SNPs >>> s = SNPs("resources/662.23andme.340.txt.gz") >>> s.source '23andMe' >>> s.count 991786
The SNPs class accepts a path to a file or a bytes object. A Reader class attempts to infer the data source and load the SNPs. The loaded SNPs are normalized and available via a pandas.DataFrame:
>>> df = s.snps >>> df.columns.values array(['chrom', 'pos', 'genotype'], dtype=object) >>> df.index.name 'rsid' >>> df.chrom.dtype.name 'object' >>> df.pos.dtype.name 'uint32' >>> df.genotype.dtype.name 'object' >>> len(df) 991786
snps also attempts to detect the build / assembly of the data:
>>> s.build 37 >>> s.build_detected True >>> s.assembly 'GRCh37'
Merge Raw Data Files
The dataset consists of raw data files from two different DNA testing sources - let’s combine these files. Specifically, we’ll update the SNPs object with SNPs from a Family Tree DNA file.
>>> merge_results = s.merge([SNPs("resources/662.ftdna-illumina.341.csv.gz")]) Merging SNPs('662.ftdna-illumina.341.csv.gz') SNPs('662.ftdna-illumina.341.csv.gz') has Build 36; remapping to Build 37 Downloading resources/NCBI36_GRCh37.tar.gz 27 SNP positions were discrepant; keeping original positions 151 SNP genotypes were discrepant; marking those as null >>> s.source '23andMe, FTDNA' >>> s.count 1006960 >>> s.build 37 >>> s.build_detected True
If the SNPs being merged have a build that differs from the destination build, the SNPs to merge will be remapped automatically. After this example merge, the build is still detected, since the build was detected for all SNPs objects that were merged.
As the data gets added, it’s compared to the existing data, and SNP position and genotype discrepancies are identified. (The discrepancy thresholds can be tuned via parameters.) These discrepant SNPs are available for inspection after the merge via properties of the SNPs object.
>>> len(s.discrepant_merge_genotypes) 151
Additionally, any non-called / null genotypes will be updated during the merge, if the file being merged has a called genotype for the SNP.
Moreover, merge takes a chrom parameter - this enables merging of only SNPs associated with the specified chromosome (e.g., “Y” or “MT”).
Finally, merge returns a list of dict, where each dict has information corresponding to the results of each merge (e.g., SNPs in common).
>>> sorted(list(merge_results.keys())) ['common_rsids', 'discrepant_genotype_rsids', 'discrepant_position_rsids', 'merged'] >>> merge_results["merged"] True >>> len(merge_results["common_rsids"]) 692918
Now, let’s remap the merged SNPs to change the assembly / build:
>>> s.snps.loc["rs3094315"].pos 752566 >>> chromosomes_remapped, chromosomes_not_remapped = s.remap(38) Downloading resources/GRCh37_GRCh38.tar.gz >>> s.build 38 >>> s.assembly 'GRCh38' >>> s.snps.loc["rs3094315"].pos 817186
SNPs can be remapped between Build 36 (NCBI36), Build 37 (GRCh37), and Build 38 (GRCh38).
Ok, so far we’ve merged the SNPs from two files (ensuring the same build in the process and identifying discrepancies along the way). Then, we remapped the SNPs to Build 38. Now, let’s save the merged and remapped dataset consisting of 1M+ SNPs to a tab-separated values (TSV) file:
>>> saved_snps = s.save("out.txt") Saving output/out.txt >>> print(saved_snps) output/out.txt
Moreover, let’s get the reference sequences for this assembly and save the SNPs as a VCF file:
>>> saved_snps = s.save("out.vcf", vcf=True) Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.1.fa.gz Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.2.fa.gz Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.3.fa.gz Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.4.fa.gz Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.5.fa.gz Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.6.fa.gz Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.7.fa.gz Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.8.fa.gz Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.9.fa.gz Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.10.fa.gz Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.12.fa.gz Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.13.fa.gz Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.14.fa.gz Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.15.fa.gz Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.16.fa.gz Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.17.fa.gz Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.18.fa.gz Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.19.fa.gz Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.20.fa.gz Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.X.fa.gz Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.Y.fa.gz Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.MT.fa.gz Saving output/out.vcf 1 SNP positions were found to be discrepant when saving VCF
When saving a VCF, if any SNPs have positions outside of the reference sequence, they are marked as discrepant and are available via a property of the SNPs object.
All output files are saved to the output directory.
Documentation is available here.
Release history Release notifications | RSS feed
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.