Skip to main content

A fast 23andMe raw genome file parser

Project description

arv — a fast 23andMe parser for Python

Travis build status Supported Python versions Project License pypi

Arv (Norwegian; “heritage” or “inheritance”) is a Python module for parsing raw 23andMe genome files. It lets you lookup SNPs from RSIDs.

from arv import load, unphased_match as match

genome = load("genome.txt")

print("You are a {gender} with {color} eyes and {complexion} skin.".format(
  gender     = "man" if genome.y_chromosome else "woman",
  complexion = "light" if genome["rs1426654"] == "AA" else "dark",
  color      = match(genome["rs12913832"], {"AA": "brown",
                                            "AG": "brown or green",
                                            "GG": "blue"})))

For my genome, this little program produces:

You are a man with blue eyes and light skin.

The parser is insanely fast, having been written in finely tuned C++, exposed via Cython. A 2013 Xeon machine I’ve tested on parses a 24 Mb file into a hash table in 70 ms!

Works with Python 2.7+ and 3+. Installable with pip!

$ pip install --upgrade arv

See below for software requirements.

Important disclaimer

It’s very important to tell you that I, the author of arv, am merely a hobbyist! I’m a professional software developer, not a geneticist, medical doctor or anything like that.

Because of that, this software may not only look weird to people in the field, it may also contain serious errors. If you find any problem whatsoever, please submit a GitHub issue!

This a slightly modified version of what I wrote for the original software called “dna-traits”, and the same goes this software:

In addition to the GPL v3 license terms, and given that this code deals with health-related issues, I want to stress that the provided code most likely contains errors, or invalid genome reports. Results from this code must be interpreted as HIGHLY SPECULATIVE and may even be downright INCORRECT. Always consult an expert (medical doctor, geneticist, etc.) for guidance. I take NO RESPONSIBILITY whatsoever for any consequences of using this code, including but not limited to loss of life, money, spouses, self-esteem and so on. Use at YOUR OWN RISK.

The indended use is for casual, educational purposes. If this code is used for research purposes, please cross-check key results with other software: The parser code may contain serious errors, for example.

An interesting story about the research part: I once released a pretty good Mersenne Twister PRNG that ended up being used in research. Turned out the engine had bugs, and by the time I had fixed them, the poor researcher already had results (hopefully not published; I don’t know). The guy had to go back and fix hi stuff, and I felt terribly bad about it.

So beware!

Installation

The recommended way is to install from PyPi.

$ pip install arv

This will most likely build Arv from source. The package requires Cython, but it doesn’t check if you have a C++ compiler. Currently, it expects that you have clang++ or g++.

If you have problems running pip install arv, please open an issue on GitHub with as much detail as possible (g++/clang++ --version, uname -a, python --version and so on).

If you set the environment variable ARV_DEBUG, it will build with full warnings and debug symbols.

Usage

First you need to dump the raw genome file from 23andMe. You’ll find it under the raw genome browser, and download the file. You may have to unzip it first: The parser works on the pure text files.

Then you load the genome in Python with

… code:: python

>>> genome = arv.load("filename.txt")
>>> genome
<Genome: SNPs=960614, name='filename.txt'>

To see if there are any Y-chromosomes present in the genome,

>>> genome.y_chromosome
True

The genome provides a dict-like interface. To get the genotype of a given SNP, just enter the RSID. It will return it as a string.

>>> genome["rs123"]
'AA'

You can also access the SNP as an object:

>>> genome.get_snp("rs123")
>>> snp
<SNP: chromosome=7 position=24966446 genotype='AA'>
>>> snp.chromosome
7
>>> snp.position
24966446
>>> snp.genotype
'AA'

The last line actually returns a PyGenotype object, but its repr returns something that looks like a string. This lets you perform a few operations on the nucleotides. For example, you can get its complement with the ~-operator.

>>> type(snp.genotype)
<type '_arv.PyGenotype'>
>>> snp.genotype
'AA'
>>> ~snp.genotype
'TT'

The complement is important due to eah SNPs orientation. All of 23andMe SNPs are oriented towards the positive (“plus”) strand, based on the GRCh37 reference human genome assembly build. But some SNPs on SNPedia are given with the minus orientation <http://snpedia.com/index.php/Orientation>.

For example, to determine if the human in question is likely lactose tolerant or not, we can look at rs4988235 <http://snpedia.com/index.php/Rs4988235>. SNPedia reports its _Stabilized_ orientation to be minus, so we need to use the complement:

>>> genome.get_snp("rs4988235").genotype
'AA'
>>> ~genome.get_snp("rs4988235").genotype
'TT'

By reading a few GWAS <> research papers, we can build a rule to determine a human’s likelihood for lactose tolerance:

>>> arv.unphased_match(~genome.get_snp("rs4988235").genotype, {
    "TT": "Likely lactose tolerant",
    "TC": "Likely lactose tolerant",
    "CC": "Likely lactose intolerant",
    None: "Unable to determine (genotype not present)"})

Note that for non-professionals, reading GWAS papers can be a bit tricky. . To create a usually requi Note that reading GWAS papers can be a bit tricky, and it is very tempting to jump to conclusions or not understand all of it. But for casual users, it’s very educational and most of all very fun to try your hand at inferring various results.

Command line interface

You can also invoke arv from the command line:

$ python -m arv --help

For example, you can drop into a Python REPL like so:

$ python -m arv --repl genome.txt
genome.txt ... 960614 SNPs, male
Type `genome` to see the parsed 23andMe raw genome file
>>> genome
<Genome: SNPs=960614, name='genome.txt'>
>>> genome["rs123"]
'AA'

If you specify several files, you can access them through the variable genomes.

The example at the top of this document can be run with --example:

$ python -m arv --example genome.txt
genome.txt ... 960614 SNPs, male

genome.txt ... A man with blue eyes and light skin

License

Copyright 2017 Christian Stigen Larsen

Distributed under the GNU GPL v3 or later. See the file COPYING for the full license text. This software makes use of open source software; see LICENSES for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arv-0.4.tar.gz (250.8 kB view details)

Uploaded Source

Built Distribution

arv-0.4-cp27-none-macosx_10_12_intel.whl (88.0 kB view details)

Uploaded CPython 2.7 macOS 10.12+ intel

File details

Details for the file arv-0.4.tar.gz.

File metadata

  • Download URL: arv-0.4.tar.gz
  • Upload date:
  • Size: 250.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for arv-0.4.tar.gz
Algorithm Hash digest
SHA256 2d6cb7d7578b5af3bdcb1191d146a391901eff84a2a6d4457c12132735063fcc
MD5 3271750ca109b01e689d3d9d2a5932d0
BLAKE2b-256 e0ed4f466615e32e971598cf6cceefaf01c8b3b65c50c3e1fbc2481dde289ac8

See more details on using hashes here.

Provenance

File details

Details for the file arv-0.4-cp27-none-macosx_10_12_intel.whl.

File metadata

File hashes

Hashes for arv-0.4-cp27-none-macosx_10_12_intel.whl
Algorithm Hash digest
SHA256 d95f4f72c815376124abf42d2bbdcc4b5d77c6a94c73d184cacfc613e5254b67
MD5 7384d715a6e20654c8f5255b5ebe6d53
BLAKE2b-256 8241a607266511cfc54d7c07527c81d2a0730b07af7eab1d93accb5ca993f08c

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page