Skip to main content

Python library for working with VCF files

Project description

vcforge

vcforge is a python library for working with VCF files.

It provides a variety of tools and methods for importing, exploring, manipulating and analyzing Variant Call Format (VCF) data.

VCF parsing is handled by the cyvcf2 library (https://github.com/brentp/cyvcf2), which is based on the htslib library (https://github.com/samtools/htslib) written in C, resulting in very fast parsing.

vcforge utilizes pandas for producing tables and statistics.

The library is designed for explorative analyses, by linking the information about the samples with the variant data.

Features

  • Import VCF data along with sample information
  • Automatically extract basic information about the variants and optionally assign variant IDs
  • Extract INFO fields for all variants
  • Extract selected FORMAT fields for all samples
  • Split the variant data based on sample information (i.e. breed, population, etc)
  • Get variant statistics (n called, call rate, allele frequency, nucleotide diversity, variant types and subtypes )

How to install

Install with pip:

pip install vcforge

Dependencies

  • pandas
  • numpy
  • cyvcf2

Why vcforge?

I started working on this library for myself, seeing how I found myself increasingly frustrated with the lack of python VCF parsing libraries that would allow me to easily explore VCF files in a readable and workable format, such as pandas DataFrames, and allow me to connect sample-level information with the variants. For example, splitting the variants by sample group is something that allows me to quickly and easily explore variants without having to rely on command-line tools that produce files that need to be examined.

How to use

!!! Documentation is still a work in progress !!!

Import the library and read a VCF with its sample information

from vcforge import VCFClass

dataset=VCFClass(vcf_path='variants.vcf.gz', sample_info='samples.tsv',sample_id_column="sample", add_info=True, create_ids_if_none=True, threads=4)

Show the variant information

var_info=dataset.variants

Split the dataset by sample column(s) and get all genotypes of the variants in one of the populations as a pandas dataframe

split_dataset_by_breed=dataset.split_by_sample_column(column='population')

split_dataset_by_breed['POP_1'].show_genotypes()

Calculate some variant statistics about the selected population, that will be added to the variant information in the VCFClass

split_dataset_by_breed['POP_1'].get_var_stats(add_to_info=True)

Filter out variants with call rate < 0.9

var_info=split_dataset_by_breed['POP_1'].variants

selected_variants=var_info[var_info['CALL_RATE']>0.9]

Save the filtered subset as a VCF, with the ID column previously created.

split_dataset_by_breed['POP_1'].save_vcf(save_path='pop_1.vcf.gz', add_ids=True, var_ids=selected_variants.index, samples=None)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vcforge-0.1.7.1.tar.gz (10.2 kB view details)

Uploaded Source

Built Distribution

vcforge-0.1.7.1-py3-none-any.whl (11.2 kB view details)

Uploaded Python 3

File details

Details for the file vcforge-0.1.7.1.tar.gz.

File metadata

  • Download URL: vcforge-0.1.7.1.tar.gz
  • Upload date:
  • Size: 10.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.12.2 Linux/6.8.0-48-generic

File hashes

Hashes for vcforge-0.1.7.1.tar.gz
Algorithm Hash digest
SHA256 9c93a6bc3b30b322213a6f83633814dc459d1fc155565ec8c97e03b6e1afecf7
MD5 a8c4923a833e359b74de80aa4729b6e4
BLAKE2b-256 e2d58b1b2455cc78405a28d379332d95e8808c6f99787c945f737a92dbbbf2f3

See more details on using hashes here.

File details

Details for the file vcforge-0.1.7.1-py3-none-any.whl.

File metadata

  • Download URL: vcforge-0.1.7.1-py3-none-any.whl
  • Upload date:
  • Size: 11.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.12.2 Linux/6.8.0-48-generic

File hashes

Hashes for vcforge-0.1.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 826a8e1f63a75da6d7e5e193fe2cd198ac0aeac09bedfc4b89571deb2b0b042e
MD5 acc2a7eafe79f89039037ca256d1073c
BLAKE2b-256 d765b532bf8c2ec484a96102c28e00103ed1971d349809e341e7706301b46f2f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page