This is a pre-production deployment of Warehouse. Changes made here affect the production instance of PyPI (pypi.python.org).
Help us improve Python packaging - Donate today!

Load numpy arrays from a VCF (variant call file).

Project Description

Load data from a VCF (variant call format) file into numpy arrays, and (optionally) from there into an HDF5 file.

Installation

Installation requires numpy and cython:

$ pip install cython
$ pip install numpy
$ pip install vcfnp

…or:

$ git clone --recursive git://github.com/alimanfoo/vcfnp.git
$ cd vcfnp
$ python setup.py build_ext --inplace

Usage

For usage from Python, see the IPython notebook example, or try:

>>> from __future__ import print_function, division
>>> import numpy as np
>>> import matplotlib
>>> matplotlib.use('TkAgg')
>>> import matplotlib.pyplot as plt
>>> import vcfnp
>>> vcfnp.__version__
'2.0.0'
>>> filename = 'fixture/sample.vcf'
>>> # load data from fixed fields (including INFO)
... v = vcfnp.variants(filename, cache=True).view(np.recarray)
[vcfnp] 2015-01-23 11:10:46.670723 :: caching is enabled
[vcfnp] 2015-01-23 11:10:46.670830 :: cache file available
[vcfnp] 2015-01-23 11:10:46.670866 :: loading from cache file fixture/sample.vcf.vcfnp_cache/variants.npy
>>> # print some simple variant metrics
... print('found %s variants (%s SNPs)' % (v.size, np.count_nonzero(v.is_snp)))
found 9 variants (5 SNPs)
>>> print('QUAL mean (std): %s (%s)' % (np.mean(v.QUAL), np.std(v.QUAL)))
QUAL mean (std): 25.0667 (22.816)
>>> # plot a histogram of variant depth
... fig = plt.figure(1)
>>> ax = fig.add_subplot(111)
>>> ax.hist(v.DP)
(array([ 4.,  0.,  0.,  0.,  0.,  0.,  1.,  2.,  0.,  2.]), array([  0. ,   1.4,   2.8,   4.2,   5.6,   7. ,   8.4,   9.8,  11.2,
        12.6,  14. ]), <a list of 10 Patch objects>)
>>> ax.set_title('DP histogram')
<matplotlib.text.Text object at 0x7f28f18f5c50>
>>> ax.set_xlabel('DP')
<matplotlib.text.Text object at 0x7f28f207c3c8>
>>> plt.show()
>>> # load data from sample columns
... c = vcfnp.calldata_2d(filename, cache=True).view(np.recarray)
>>> # print some simple genotype metrics
... count_phased = np.count_nonzero(c.is_phased)
>>> count_variant = np.count_nonzero(np.any(c.genotype > 0, axis=2))
>>> count_missing = np.count_nonzero(~c.is_called)
>>> print('calls (phased, variant, missing): %s (%s, %s, %s)'
...     % (c.flatten().size, count_phased, count_variant, count_missing))
calls (phased, variant, missing): 27 (14, 12, 2)
>>> # plot a histogram of genotype quality
... fig = plt.figure(2)
>>> ax = fig.add_subplot(111)
>>> ax.hist(c.GQ.flatten())
(array([ 15.,   0.,   1.,   1.,   0.,   1.,   2.,   4.,   2.,   1.]), array([  0. ,   6.1,  12.2,  18.3,  24.4,  30.5,  36.6,  42.7,  48.8,
        54.9,  61. ]), <a list of 10 Patch objects>)
>>> ax.set_title('GQ histogram')
<matplotlib.text.Text object at 0x7f28f1eb1cc0>
>>> ax.set_xlabel('GQ')
<matplotlib.text.Text object at 0x7f28f18d4fd0>
>>> plt.show()

Command line scripts are also provided to facilitate parallelizing the conversion of a VCF file to NPY arrays split by genome region. For example, the following command will create an NPY file containing a variants array for the second 100kb on chromosome 2:

$ vcf2npy \
    --vcf /path/to/my.vcf \
    --fasta /path/to/ref.fa \
    --output-dir /path/to/npy/output \
    --array-type variants \
    --chromosome chr20 \
    --task-size 100000 \
    --task-index 2 \
    --progress 1000

For those with access to a cluster running Sun Grid Engine a script is provided to submit a job array parallelizing the conversion, e.g.:

$ qsub_vcf2npy \
    --vcf /path/to/my.vcf \
    --fasta /path/to/ref.fa \
    --output-dir /path/to/npy/output \
    --array-type variants \
    --chromosome chr20 \
    --task-size 100000 \
    --progress 1000 \
    -l h_vmem=1G \
    -N test_vcfnp \
    -j y \
    -o /path/to/sge/logs \
    -q shortrun.q

It should be straightforward to adapt this script to run on other parallel computing platforms, see the scripts folder for the source code.

A script is also provided to load data from multiple NPY files into a single HDF5 file. E.g., after having converted a VCF file to 100kb variants and calldata_2d NPY splits, run something like:

$ vcfnpy2hdf5 \
    --vcf /path/to/my.vcf \
    --input-dir /path/to/npy/output \
    --output /path/to/my.h5

If you want to group the data by chromosome, do something like the following for each chromosome separately:

$ vcfnpy2hdf5 \
    --vcf /path/to/my.vcf \
    --input-dir /path/to/npy/output \
    --input-filename-template {array_type}.chr20*.npy \
    --output /path/to/my.h5 \
    --group chr20

There is also a script which will process a VCF file in parallel on the local machine and load into an HDF5 file, e.g.:

$ vcf2hdf5_parallel \
    --vcf /path/to/my.vcf \
    --fasta /path/to/refseq.fa

Finally, there is a script fo converting the fixed fields of a VCF file to CSV, e.g.:

$ vcf2csv \
    --vcf /path/to/my.vcf \
    --dialect excel-tab \
    --flatten-filter

Acknowledgments

Based on Erik Garrison’s vcflib.

Release History

Release History

This version
History Node

2.3.0

History Node

2.2.0

History Node

2.1.5

History Node

2.1.4

History Node

2.1.3

History Node

2.1.2

History Node

2.1.1

History Node

2.1.0

History Node

2.0.1

History Node

2.0.0

History Node

1.12

History Node

1.11.5

History Node

1.11.4

History Node

1.11.3

History Node

1.11.2

History Node

1.11.1

History Node

1.11

History Node

1.10.2

History Node

1.10.1

History Node

1.10

History Node

1.9.1

History Node

1.9

History Node

1.8

History Node

1.7

History Node

1.6

History Node

1.5

History Node

1.4

History Node

1.3

History Node

1.2

History Node

1.1

History Node

1.0.1

History Node

1.0

History Node

0.16

History Node

0.15

History Node

0.14

History Node

0.13

History Node

0.12

History Node

0.11.2

History Node

0.11.1

History Node

0.11

History Node

0.10

History Node

0.9

History Node

0.8

History Node

0.7

History Node

0.6

History Node

0.5

History Node

0.4

History Node

0.3

History Node

0.4-SNAPSHOT

Download Files

Download Files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
vcfnp-2.3.0.tar.gz (8.1 MB) Copy SHA256 Checksum SHA256 Source Jul 20, 2016

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting