Skip to main content

tools for reading, writing, merging, and remapping SNPs

Project description

GATCACAGGTCTATCAC     CCTATTAACCACTCAC     GGGAGCTCTCCATGCAT     TTGGTATTTTCGTCTGG
GGGGTATGCACGCGATA     GCATTGCGAGACGCTG     GAGCCGGAGCACCCTAT     GTCGCAGTATCTGTCTT
TGA                   TTC          CTG     CCT           CAT     CCT
ATTATTTATCGCACCTA     CGT          TCA     ATATTACAGGCGAACAT     ACTTACTAAAGTGTGTT
AATTAATTAATGCTTGT     AGG          ACA     TAATAATAACAATTGAA     TGTCTGCACAGCCACTT
              TCC     ACA          CAG     ACA                                 TCA
TAACAAAAAATTTCCAC     CAA          ACC     CCC                   CCTCCCCCGCTTCTGGC
CACAGCACTTAAACACA     TCT          CTG     CCA                   AACCCCAAAAACAAAGA

build codecov docs pypi python downloads

snps

tools for reading, writing, merging, and remapping SNPs 🧬

Capabilities

  • Read raw data (genotype) files from a variety of direct-to-consumer (DTC) DNA testing sources

  • Read and write VCF files for Builds 36, 37, and 38 (e.g., convert 23andMe to VCF)

  • Merge raw data files from different DNA tests, identifying discrepant SNPs in the process

  • Remap SNPs between assemblies / builds (e.g., convert SNPs from Build 36 to Build 37, etc.)

Supported Genotype Files

snps supports VCF files and genotype files from the following DNA testing sources:

Dependencies

snps requires Python 3.5+ and the following Python packages:

Installation

snps is available on the Python Package Index. Install snps (and its required Python dependencies) via pip:

$ pip install snps

Examples

Download Example Data

Let’s download some example data from openSNP:

>>> from snps.resources import Resources
>>> r = Resources()
>>> paths = r.download_example_datasets()
Downloading resources/662.23andme.340.txt.gz
Downloading resources/662.ftdna-illumina.341.csv.gz

Load Raw Data

Load a 23andMe raw data file:

>>> from snps import SNPs
>>> s = SNPs('resources/662.23andme.340.txt.gz')

The loaded SNPs are available via a pandas.DataFrame:

>>> df = s.snps
>>> df.columns.values
array(['chrom', 'pos', 'genotype'], dtype=object)
>>> df.index.name
'rsid'
>>> len(df)
991786

snps also attempts to detect the build / assembly of the data:

>>> s.build
37
>>> s.build_detected
True
>>> s.assembly
'GRCh37'

Remap SNPs

Let’s remap the SNPs to change the assembly / build:

>>> s.snps.loc["rs3094315"].pos
752566
>>> chromosomes_remapped, chromosomes_not_remapped = s.remap_snps(38)
Downloading resources/GRCh37_GRCh38.tar.gz
>>> s.build
38
>>> s.assembly
'GRCh38'
>>> s.snps.loc["rs3094315"].pos
817186

SNPs can be remapped between Build 36 (NCBI36), Build 37 (GRCh37), and Build 38 (GRCh38).

Merge Raw Data Files

The dataset consists of raw data files from two different DNA testing sources. Let’s combine these files using a SNPsCollection.

>>> from snps import SNPsCollection
>>> sc = SNPsCollection("resources/662.ftdna-illumina.341.csv.gz", name="User662")
Loading resources/662.ftdna-illumina.341.csv.gz
>>> sc.build
36
>>> chromosomes_remapped, chromosomes_not_remapped = sc.remap_snps(37)
Downloading resources/NCBI36_GRCh37.tar.gz
>>> sc.snp_count
708092

As the data gets added, it’s compared to the existing data, and SNP position and genotype discrepancies are identified. (The discrepancy thresholds can be tuned via parameters.)

>>> sc.load_snps(["resources/662.23andme.340.txt.gz"], discrepant_genotypes_threshold=300)
Loading resources/662.23andme.340.txt.gz
27 SNP positions were discrepant; keeping original positions
151 SNP genotypes were discrepant; marking those as null
>>> len(sc.discrepant_snps)  # SNPs with discrepant positions and genotypes, dropping dups
169
>>> sc.snp_count
1006960

Save SNPs

Ok, so far we’ve remapped the SNPs to the same build and merged the SNPs from two files, identifying discrepancies along the way. Let’s save the merged dataset consisting of over 1M+ SNPs to a CSV file:

>>> saved_snps = sc.save_snps()
Saving output/User662_GRCh37.csv

Moreover, let’s get the reference sequences for this assembly and save the SNPs as a VCF file:

>>> saved_snps = sc.save_snps(vcf=True)
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.1.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.2.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.3.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.4.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.5.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.6.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.7.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.8.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.9.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.10.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.11.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.12.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.13.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.14.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.15.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.16.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.17.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.18.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.19.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.20.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.21.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.22.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.X.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.Y.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.MT.fa.gz
Saving output/User662_GRCh37.vcf

All output files are saved to the output directory.

Documentation

Documentation is available here.

Acknowledgements

Thanks to Mike Agostino, Padma Reddy, Kevin Arvai, openSNP, and Open Humans. Logo composed of nucleotides from GRCh38 mitochondrial DNA.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snps-0.2.0.tar.gz (65.8 kB view details)

Uploaded Source

Built Distribution

snps-0.2.0-py3-none-any.whl (35.3 kB view details)

Uploaded Python 3

File details

Details for the file snps-0.2.0.tar.gz.

File metadata

  • Download URL: snps-0.2.0.tar.gz
  • Upload date:
  • Size: 65.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.3

File hashes

Hashes for snps-0.2.0.tar.gz
Algorithm Hash digest
SHA256 1ae1b738325858a27d8d7cc682f30171174badda761cd32fe6a6bbed853d4269
MD5 416712570b646633c9459ab5c7644287
BLAKE2b-256 468c3a8cc337995931da456b029834de166d13b116f4ea7bd62758bb24408953

See more details on using hashes here.

File details

Details for the file snps-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: snps-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 35.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.3

File hashes

Hashes for snps-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b180a1c3c087b92913b22c93ed1bd84ed81f9cc5264a125835f9eee6769a48af
MD5 b83ec6d501c606a7406653e0d4efdf0e
BLAKE2b-256 560531f22e8d4d91fbfc962801ab70d88e27d2e80746332d340b403903739772

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page