fast vcf parsing with cython + htslib
Project description
cyvcf2
======
[](https://travis-ci.org/brentp/cyvcf2)
cyvcf2 is a cython wrapper around [htslib](https://github.com/samtools/htslib) built for fast parsing of [Variant Call Format](https://en.m.wikipedia.org/wiki/Variant_Call_Format) (VCF) files.
It is targetted toward our use-case in [gemini](http://gemini.rtfd.org) but should also be of general utility.
On a file with 189 samples that takes [cyvcf](https://github.com/arq5x/cyvcf) **21 seconds** to parse and extract all sample information, it takes `cyvcf2` **1.4 seconds**.
Attributes like `variant.gt_ref_depths` return a numpy array directly so they are immediately ready for downstream use.
**note** that the array is backed by the underlying C data, so, once `variant` goes out of scope. The array will contain nonsense.
To persist a copy, use: `cpy = np.array(variant.gt_ref_depths)` instead of just `arr = variant.gt_ref_depths`.
Example
=======
```Python
from cyvcf2 import VCF
for variant in VCF('some.vcf.gz'):
variant.gt_types # numpy array
variant.gt_ref_depths, variant.gt_alt_depths # numpy arrays
variant.gt_phases, variant.gt_quals # numpy arrays
variant.gt_bases # numpy array
variant.CHROM, variant.start, variant.end, variant.ID, \
variant.REF, variant.ALT, variant.FILTER, variant.QUAL
variant.INFO.get('DP') # int
variant.INFO.get('FS') # float
variant.INFO.get('AC') # float
a = variant.gt_phred_ll_homref # numpy array
b = variant.gt_phred_ll_het # numpy array
c = variant.gt_phred_ll_homalt # numpy array
str(variant)
# Get a numpy array of the depth per sample:
dp = variant.format('DP', int)
# or of any other format field:
sb = variant.format('SB', float)
assert sb.shape == (n_samples, 4) # 4-values per
```
Installation
============
```
pip install cyvcf2
```
Testing
=======
Tests can be run with:
```
python setup.py test
```
See Also
========
Pysam also [has a cython wrapper to htslib](https://github.com/pysam-developers/pysam/blob/master/pysam/cbcf.pyx) and one block of code here is taken directly from that library. But, the optimizations that we want for gemini are very specific so we have chosen to create a separate project.
======
[](https://travis-ci.org/brentp/cyvcf2)
cyvcf2 is a cython wrapper around [htslib](https://github.com/samtools/htslib) built for fast parsing of [Variant Call Format](https://en.m.wikipedia.org/wiki/Variant_Call_Format) (VCF) files.
It is targetted toward our use-case in [gemini](http://gemini.rtfd.org) but should also be of general utility.
On a file with 189 samples that takes [cyvcf](https://github.com/arq5x/cyvcf) **21 seconds** to parse and extract all sample information, it takes `cyvcf2` **1.4 seconds**.
Attributes like `variant.gt_ref_depths` return a numpy array directly so they are immediately ready for downstream use.
**note** that the array is backed by the underlying C data, so, once `variant` goes out of scope. The array will contain nonsense.
To persist a copy, use: `cpy = np.array(variant.gt_ref_depths)` instead of just `arr = variant.gt_ref_depths`.
Example
=======
```Python
from cyvcf2 import VCF
for variant in VCF('some.vcf.gz'):
variant.gt_types # numpy array
variant.gt_ref_depths, variant.gt_alt_depths # numpy arrays
variant.gt_phases, variant.gt_quals # numpy arrays
variant.gt_bases # numpy array
variant.CHROM, variant.start, variant.end, variant.ID, \
variant.REF, variant.ALT, variant.FILTER, variant.QUAL
variant.INFO.get('DP') # int
variant.INFO.get('FS') # float
variant.INFO.get('AC') # float
a = variant.gt_phred_ll_homref # numpy array
b = variant.gt_phred_ll_het # numpy array
c = variant.gt_phred_ll_homalt # numpy array
str(variant)
# Get a numpy array of the depth per sample:
dp = variant.format('DP', int)
# or of any other format field:
sb = variant.format('SB', float)
assert sb.shape == (n_samples, 4) # 4-values per
```
Installation
============
```
pip install cyvcf2
```
Testing
=======
Tests can be run with:
```
python setup.py test
```
See Also
========
Pysam also [has a cython wrapper to htslib](https://github.com/pysam-developers/pysam/blob/master/pysam/cbcf.pyx) and one block of code here is taken directly from that library. But, the optimizations that we want for gemini are very specific so we have chosen to create a separate project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
cyvcf2-0.2.7.tar.gz
(3.9 MB
view details)
File details
Details for the file cyvcf2-0.2.7.tar.gz
.
File metadata
- Download URL: cyvcf2-0.2.7.tar.gz
- Upload date:
- Size: 3.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 53f8e1d4cd3cccf363b0a6756d3e2fa455f5780edf189bc594c4f5792c9cb8ed |
|
MD5 | 5f373336996c1bd419f83e1c4b7ed1f0 |
|
BLAKE2b-256 | 1812bd09382d770f6a290981bd18d62db75e18540ed5a30a0dadee5c2ca9ac7d |