snps·PyPI

tools for reading, writing, merging, and remapping SNPs

These details have not been verified by PyPI

Project links

Project description

snps

tools for reading, writing, merging, and remapping SNPs 🧬

Features

Input / Output

Read raw data (genotype) files from a variety of direct-to-consumer (DTC) DNA testing sources with a SNPs object
Read and write VCF files (e.g., convert 23andMe to VCF)
Merge raw data files from different DNA tests, identifying discrepant SNPs in the process with a SNPsCollection object
Read data in a variety of formats (e.g., files, bytes, compressed with gzip or zip)
Handle several variations of file types, validated via openSNP parsing analysis

Build / Assembly Detection and Remapping

Detect the build / assembly of SNPs (supports builds 36, 37, and 38)
Remap SNPs between builds / assemblies

Data Cleaning

Fix several common issues when loading SNPs
Sort SNPs based on chromosome and position
Deduplicate RSIDs
Deduplicate alleles in the non-PAR regions of the X and Y chromosomes for males
Assign PAR SNPs to the X or Y chromosome

Supported Genotype Files

snps supports VCF files and genotype files from the following DNA testing sources:

Additionally, snps can read a variety of “generic” CSV and TSV files.

Dependencies

snps requires Python 3.5+ and the following Python packages:

Installation

snps is available on the Python Package Index. Install snps (and its required Python dependencies) via pip:

$ pip install snps

Examples

Download Example Data

First, let’s setup logging to get some helpful output:

>>> import logging, sys
>>> logger = logging.getLogger()
>>> logger.setLevel(logging.INFO)
>>> logger.addHandler(logging.StreamHandler(sys.stdout))

Now we’re ready to download some example data from openSNP:

>>> from snps.resources import Resources
>>> r = Resources()
>>> paths = r.download_example_datasets()
Downloading resources/662.23andme.340.txt.gz
Downloading resources/662.ftdna-illumina.341.csv.gz

Load Raw Data

Load a 23andMe raw data file:

>>> from snps import SNPs
>>> s = SNPs('resources/662.23andme.340.txt.gz')

The SNPs class accepts a path to a file or a bytes object. A Reader class attempts to infer the data source and load the SNPs. The loaded SNPs are available via a pandas.DataFrame:

>>> df = s.snps
>>> df.columns.values
array(['chrom', 'pos', 'genotype'], dtype=object)
>>> df.index.name
'rsid'
>>> len(df)
991786

snps also attempts to detect the build / assembly of the data:

>>> s.build
37
>>> s.build_detected
True
>>> s.assembly
'GRCh37'

Remap SNPs

Let’s remap the SNPs to change the assembly / build:

>>> s.snps.loc["rs3094315"].pos
752566
>>> chromosomes_remapped, chromosomes_not_remapped = s.remap_snps(38)
Downloading resources/GRCh37_GRCh38.tar.gz
>>> s.build
38
>>> s.assembly
'GRCh38'
>>> s.snps.loc["rs3094315"].pos
817186

SNPs can be remapped between Build 36 (NCBI36), Build 37 (GRCh37), and Build 38 (GRCh38).

Merge Raw Data Files

The dataset consists of raw data files from two different DNA testing sources. Let’s combine these files using a SNPsCollection.

>>> from snps import SNPsCollection
>>> sc = SNPsCollection("resources/662.ftdna-illumina.341.csv.gz", name="User662")
Loading resources/662.ftdna-illumina.341.csv.gz
>>> sc.build
36
>>> chromosomes_remapped, chromosomes_not_remapped = sc.remap_snps(37)
Downloading resources/NCBI36_GRCh37.tar.gz
>>> sc.snp_count
708092

As the data gets added, it’s compared to the existing data, and SNP position and genotype discrepancies are identified. (The discrepancy thresholds can be tuned via parameters.)

>>> sc.load_snps(["resources/662.23andme.340.txt.gz"], discrepant_genotypes_threshold=300)
Loading resources/662.23andme.340.txt.gz
27 SNP positions were discrepant; keeping original positions
151 SNP genotypes were discrepant; marking those as null
>>> len(sc.discrepant_snps)  # SNPs with discrepant positions and genotypes, dropping dups
169
>>> sc.snp_count
1006960

Save SNPs

Ok, so far we’ve remapped the SNPs to the same build and merged the SNPs from two files, identifying discrepancies along the way. Let’s save the merged dataset consisting of over 1M+ SNPs to a CSV file:

>>> saved_snps = sc.save_snps()
Saving output/User662_GRCh37.csv

Moreover, let’s get the reference sequences for this assembly and save the SNPs as a VCF file:

>>> saved_snps = sc.save_snps(vcf=True)
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.1.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.2.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.3.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.4.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.5.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.6.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.7.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.8.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.9.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.10.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.11.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.12.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.13.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.14.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.15.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.16.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.17.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.18.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.19.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.20.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.21.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.22.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.X.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.Y.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.MT.fa.gz
Saving output/User662_GRCh37.vcf

All output files are saved to the output directory.

Documentation

Documentation is available here.

Acknowledgements

Thanks to Mike Agostino, Padma Reddy, Kevin Arvai, openSNP, Open Humans, and Sano Genetics.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.10.0

Mar 22, 2025

2.9.0

Aug 26, 2024

2.8.2

Jul 19, 2024

2.8.1

May 5, 2024

2.8.0

Mar 5, 2024

2.7.2

Apr 20, 2023

2.7.1

Feb 13, 2023

2.7.0

Nov 4, 2022

2.6.0

Aug 24, 2022

2.5.0

Jun 7, 2022

2.4.4

Jun 6, 2022

2.4.3

May 10, 2022

2.4.2

Feb 1, 2022

2.4.1

Nov 6, 2021

2.4.0

Nov 6, 2021

2.3.0

Jul 11, 2021

2.2.0

Jul 6, 2021

2.1.4

Jun 22, 2021

2.1.3

Apr 17, 2021

2.1.2

Mar 29, 2021

2.1.1

Jan 24, 2021

2.1.0

Nov 10, 2020

2.0.3

Oct 12, 2020

2.0.2

Oct 7, 2020

2.0.1

Sep 25, 2020

2.0.0

Sep 20, 2020

1.2.3

Sep 2, 2020

This version

1.2.2

Jun 14, 2020

1.2.1

Jun 14, 2020

1.2.0

Jun 5, 2020

1.1.1

Feb 25, 2020

1.1.0

Feb 8, 2020

1.0.2

Jan 31, 2020

1.0.1

Jan 26, 2020

1.0.0

Jan 8, 2020

0.7.0

Dec 4, 2019

0.6.2

Nov 15, 2019

0.6.1

Nov 11, 2019

0.6.0

Nov 2, 2019

0.5.0

Oct 17, 2019

0.4.0

Oct 6, 2019

0.3.0

Jul 29, 2019

0.2.1

Jul 8, 2019

0.2.0

Jun 20, 2019

0.1.1

Jun 16, 2019

0.1.0

Jun 12, 2019

0.0.0

Jun 9, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snps-1.2.2.tar.gz (114.8 kB view details)

Uploaded Jun 14, 2020 Source

Built Distribution

snps-1.2.2-py3-none-any.whl (47.4 kB view details)

Uploaded Jun 14, 2020 Python 3

File details

Details for the file snps-1.2.2.tar.gz.

File metadata

Download URL: snps-1.2.2.tar.gz
Upload date: Jun 14, 2020
Size: 114.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.6.7

File hashes

Hashes for snps-1.2.2.tar.gz
Algorithm	Hash digest
SHA256	`6cb0320ee8135da3b1ec474a402af1403c0ce9e14a30e99e6c7c1d6108f30e27`
MD5	`08a409b6e3948313f0496630faee4e55`
BLAKE2b-256	`4b47822e0144f2af2dc0e65975f4da2876d862134f14fcf8cc966fc8068a450c`

See more details on using hashes here.

File details

Details for the file snps-1.2.2-py3-none-any.whl.

File metadata

Download URL: snps-1.2.2-py3-none-any.whl
Upload date: Jun 14, 2020
Size: 47.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.6.7

File hashes

Hashes for snps-1.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3dba7cfc69b8de1605de480109acb29454bda17a0abaa1f11323bb34fc91894b`
MD5	`130cb1d96a12ac614bee09ae613140e5`
BLAKE2b-256	`a4235e2e317b13aaed95be03e21f129495e412a83a1599b4debde25e94210e90`

See more details on using hashes here.

snps 1.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

snps

Features

Input / Output

Build / Assembly Detection and Remapping

Data Cleaning

Supported Genotype Files

Dependencies

Installation

Examples

Download Example Data

Load Raw Data

Remap SNPs

Merge Raw Data Files

Save SNPs

Documentation

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes