fast vcf parsing with cython + htslib
Project description
cyvcf2
======
<!-- ghp-import -p docs/build/html/ -->
[](http://brentp.github.io/cyvcf2/)
If you use cyvcf2, please cite the [paper](https://academic.oup.com/bioinformatics/article/2971439/cyvcf2)
Fast python **(2 and 3)** parsing of VCF and BCF including region-queries.
[](https://travis-ci.org/brentp/cyvcf2)
cyvcf2 is a cython wrapper around [htslib](https://github.com/samtools/htslib) built for fast parsing of [Variant Call Format](https://en.m.wikipedia.org/wiki/Variant_Call_Format) (VCF) files.
Attributes like `variant.gt_ref_depths` return a numpy array directly so they are immediately ready for downstream use.
**note** that the array is backed by the underlying C data, so, once `variant` goes out of scope. The array will contain nonsense.
To persist a copy, use: `cpy = np.array(variant.gt_ref_depths)` instead of just `arr = variant.gt_ref_depths`.
Example
=======
The example below shows much of the use of cyvcf2.
```Python
from cyvcf2 import VCF
for variant in VCF('some.vcf.gz'): # or VCF('some.bcf')
variant.REF, variant.ALT # e.g. REF='A', ALT=['C', 'T']
variant.CHROM, variant.start, variant.end, variant.ID, \
variant.FILTER, variant.QUAL
# numpy arrays of specific things we pull from the sample fields.
# gt_types is array of 0,1,2,3==HOM_REF, HET, UNKNOWN, HOM_ALT
variant.gt_types, variant.gt_ref_depths, variant.gt_alt_depths # numpy arrays
variant.gt_phases, variant.gt_quals, variant.gt_bases # numpy array
## INFO Field.
## extract from the info field by it's name:
variant.INFO.get('DP') # int
variant.INFO.get('FS') # float
variant.INFO.get('AC') # float
# convert back to a string.
str(variant)
## sample info...
# Get a numpy array of the depth per sample:
dp = variant.format('DP')
# or of any other format field:
sb = variant.format('SB')
assert sb.shape == (n_samples, 4) # 4-values per
# to do a region-query:
vcf = VCF('some.vcf.gz')
for v in vcf('11:435345-556565'):
if v.INFO["AF"] > 0.1: continue
print(str(v))
```
Installation
============
## pip
```
pip install cyvcf2
```
## github
```
git clone https://github.com/brentp/cyvcf2
cd cyvcf2
pip install --editable .
```
Testing
=======
Tests can be run with:
```
python setup.py test
```
CLI
=======
Run with `cyvcf2 path_to_vcf`
```
$ cyvcf2 --help
Usage: cyvcf2 [OPTIONS] <vcf_file> or -
fast vcf parsing with cython + htslib
Options:
-c, --chrom TEXT Specify what chromosome to include.
-s, --start INTEGER Specify the start of region.
-e, --end INTEGER Specify the end of the region.
--include TEXT Specify what info field to include.
--exclude TEXT Specify what info field to exclude.
--loglevel [DEBUG|INFO|WARNING|ERROR|CRITICAL]
Set the level of log output. [default:
INFO]
--silent Skip printing of vcf.
--help Show this message and exit.
```
See Also
========
Pysam also [has a cython wrapper to htslib](https://github.com/pysam-developers/pysam/blob/master/pysam/cbcf.pyx) and one block of code here is taken directly from that library. But, the optimizations that we want for gemini are very specific so we have chosen to create a separate project.
Performance
===========
For the performance comparison in the paper, we used [thousand genomes chromosome 22](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz)
With the full comparison runner [here](https://github.com/brentp/cyvcf2/blob/master/scripts/compare.sh).
======
<!-- ghp-import -p docs/build/html/ -->
[](http://brentp.github.io/cyvcf2/)
If you use cyvcf2, please cite the [paper](https://academic.oup.com/bioinformatics/article/2971439/cyvcf2)
Fast python **(2 and 3)** parsing of VCF and BCF including region-queries.
[](https://travis-ci.org/brentp/cyvcf2)
cyvcf2 is a cython wrapper around [htslib](https://github.com/samtools/htslib) built for fast parsing of [Variant Call Format](https://en.m.wikipedia.org/wiki/Variant_Call_Format) (VCF) files.
Attributes like `variant.gt_ref_depths` return a numpy array directly so they are immediately ready for downstream use.
**note** that the array is backed by the underlying C data, so, once `variant` goes out of scope. The array will contain nonsense.
To persist a copy, use: `cpy = np.array(variant.gt_ref_depths)` instead of just `arr = variant.gt_ref_depths`.
Example
=======
The example below shows much of the use of cyvcf2.
```Python
from cyvcf2 import VCF
for variant in VCF('some.vcf.gz'): # or VCF('some.bcf')
variant.REF, variant.ALT # e.g. REF='A', ALT=['C', 'T']
variant.CHROM, variant.start, variant.end, variant.ID, \
variant.FILTER, variant.QUAL
# numpy arrays of specific things we pull from the sample fields.
# gt_types is array of 0,1,2,3==HOM_REF, HET, UNKNOWN, HOM_ALT
variant.gt_types, variant.gt_ref_depths, variant.gt_alt_depths # numpy arrays
variant.gt_phases, variant.gt_quals, variant.gt_bases # numpy array
## INFO Field.
## extract from the info field by it's name:
variant.INFO.get('DP') # int
variant.INFO.get('FS') # float
variant.INFO.get('AC') # float
# convert back to a string.
str(variant)
## sample info...
# Get a numpy array of the depth per sample:
dp = variant.format('DP')
# or of any other format field:
sb = variant.format('SB')
assert sb.shape == (n_samples, 4) # 4-values per
# to do a region-query:
vcf = VCF('some.vcf.gz')
for v in vcf('11:435345-556565'):
if v.INFO["AF"] > 0.1: continue
print(str(v))
```
Installation
============
## pip
```
pip install cyvcf2
```
## github
```
git clone https://github.com/brentp/cyvcf2
cd cyvcf2
pip install --editable .
```
Testing
=======
Tests can be run with:
```
python setup.py test
```
CLI
=======
Run with `cyvcf2 path_to_vcf`
```
$ cyvcf2 --help
Usage: cyvcf2 [OPTIONS] <vcf_file> or -
fast vcf parsing with cython + htslib
Options:
-c, --chrom TEXT Specify what chromosome to include.
-s, --start INTEGER Specify the start of region.
-e, --end INTEGER Specify the end of the region.
--include TEXT Specify what info field to include.
--exclude TEXT Specify what info field to exclude.
--loglevel [DEBUG|INFO|WARNING|ERROR|CRITICAL]
Set the level of log output. [default:
INFO]
--silent Skip printing of vcf.
--help Show this message and exit.
```
See Also
========
Pysam also [has a cython wrapper to htslib](https://github.com/pysam-developers/pysam/blob/master/pysam/cbcf.pyx) and one block of code here is taken directly from that library. But, the optimizations that we want for gemini are very specific so we have chosen to create a separate project.
Performance
===========
For the performance comparison in the paper, we used [thousand genomes chromosome 22](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz)
With the full comparison runner [here](https://github.com/brentp/cyvcf2/blob/master/scripts/compare.sh).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
cyvcf2-0.7.7.tar.gz
(1.0 MB
view details)
File details
Details for the file cyvcf2-0.7.7.tar.gz
.
File metadata
- Download URL: cyvcf2-0.7.7.tar.gz
- Upload date:
- Size: 1.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 50ceccae1e62724c22bf20df122ce23ba382108728ace70f142424171bde115b |
|
MD5 | f4e4b009a1e92981faf374ee2ee28b3e |
|
BLAKE2b-256 | 2859184a694356306fa25dc6f78a8d686512d45191daa63e3e1e729b24716b08 |