Skip to main content

Powerful statistics for VCF files

Project description

vcfstats - powerful statistics for VCF files

Pypi Github PythonVers docs github action Codacy Codacy coverage

Documentation | CHANGELOG

Motivation

There are a couple of tools that can plot some statistics of VCF files, including bcftools and jvarkit. However, none of them could:

  1. plot specific metrics
  2. customize the plots
  3. focus on variants with certain filters

R package vcfR can do some of the above. However, it has to load entire VCF into memory, which is not friendly to large VCF files.

Installation

pip install -U vcfstats

Or run with docker:

docker run \
  -w /vcfstats/workdir \
  -v $(pwd):/vcfstats/workdir \
  --rm justold/vcfstats:latest \
  vcfstats \
  --vcf myfile.vcf \
  -o outputs \
  --formula 'COUNT(1) ~ CONTIG' \
  --title 'Number of variants on each chromosome'

Gallery

Number of variants on each chromosome

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'COUNT(1) ~ CONTIG' \
    --title 'Number of variants on each chromosome' \
    --config examples/config.toml

Number of variants on each chromosome

Changing labels and ticks

vcfstats uses plotnine for plotting, read more about it on how to specify --ggs to modify the plots.

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'COUNT(1) ~ CONTIG' \
    --title 'Number of variants on each chromosome (modified)' \
    --config examples/config.toml \
    --ggs 'scale_x_discrete(name ="Chromosome", \
        limits=["1","2","3","4","5","6","7","8","9","10","X"]); \
        ylab("# Variants")'

Number of variants on each chromosome (modified)

Number of variants on first 5 chromosome

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'COUNT(1) ~ CONTIG[1,2,3,4,5]' \
    --title 'Number of variants on each chromosome (first 5)' \
    --config examples/config.toml
# or
vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'COUNT(1) ~ CONTIG[1-5]' \
    --title 'Number of variants on each chromosome (first 5)' \
    --config examples/config.toml
# or
# require vcf file to be tabix-indexed.
vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'COUNT(1) ~ CONTIG' \
    --title 'Number of variants on each chromosome (first 5)' \
    --config examples/config.toml -r 1 2 3 4 5

Number of variants on each chromosome (first 5)

Number of substitutions of SNPs

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'COUNT(1, VARTYPE[snp]) ~ SUBST[A>T,A>G,A>C,T>A,T>G,T>C,G>A,G>T,G>C,C>A,C>T,C>G]' \
    --title 'Number of substitutions of SNPs' \
    --config examples/config.toml

Number of substitutions of SNPs

Only with SNPs PASS all filters

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'COUNT(1, VARTYPE[snp]) ~ SUBST[A>T,A>G,A>C,T>A,T>G,T>C,G>A,G>T,G>C,C>A,C>T,C>G]' \
    --title 'Number of substitutions of SNPs (passed)' \
    --config examples/config.toml \
    --passed

Number of substitutions of SNPs (passed)

Alternative allele frequency on each chromosome

# using a dark theme
vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'AAF ~ CONTIG' \
    --title 'Allele frequency on each chromosome' \
    --config examples/config.toml --ggs 'theme_dark()'

Allele frequency on each chromosome

Using boxplot

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'AAF ~ CONTIG' \
    --title 'Allele frequency on each chromosome (boxplot)' \
    --config examples/config.toml \
    --figtype boxplot

Allele frequency on each chromosome

Using density plot/histogram to investigate the distribution:

You can plot the distribution, using density plot or histogram

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'AAF ~ CONTIG[1,2]' \
    --title 'Allele frequency on chromosome 1,2' \
    --config examples/config.toml \
    --figtype density

Allele frequency on chromosome 1,2

Overall distribution of allele frequency

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'AAF ~ 1' \
    --title 'Overall allele frequency distribution' \
    --config examples/config.toml

Overall allele frequency distribution

Excluding some low/high frequency variants

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'AAF[0.05, 0.95] ~ 1' \
    --title 'Overall allele frequency distribution (0.05-0.95)' \
    --config examples/config.toml

Overall allele frequency distribution

Counting types of variants on each chromosome

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'COUNT(1, group=VARTYPE) ~ CHROM' \
    # or simply
    # --formula 'VARTYPE ~ CHROM' \
    --title 'Types of variants on each chromosome' \
    --config examples/config.toml

Types of variants on each chromosome

Using bar chart if there is only one chromosome

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'COUNT(1, group=VARTYPE) ~ CHROM[1]' \
    # or simply
    # --formula 'VARTYPE ~ CHROM[1]' \
    --title 'Types of variants on chromosome 1' \
    --config examples/config.toml \
    --figtype pie

Types of variants on chromosome 1

Counting variant types on whole genome

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    # or simply
    # --formula 'VARTYPE ~ 1' \
    --formula 'COUNT(1, group=VARTYPE) ~ 1' \
    --title 'Types of variants on whole genome' \
    --config examples/config.toml

Types of variants on whole genome

Counting type of mutant genotypes (HET, HOM_ALT) for sample 1 on each chromosome

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    # or simply
    # --formula 'GTTYPEs[HET,HOM_ALT]{0} ~ CHROM' \
    --formula 'COUNT(1, group=GTTYPEs[HET,HOM_ALT]{0}) ~ CHROM' \
    --title 'Mutant genotypes on each chromosome (sample 1)' \
    --config examples/config.toml

Mutant genotypes on each chromosome

Exploration of mean(genotype quality) and mean(depth) on each chromosome for sample 1

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'MEAN(GQs{0}) ~ MEAN(DEPTHs{0}, group=CHROM)' \
    --title 'GQ vs depth (sample 1)' \
    --config examples/config.toml

GQ vs depth (sample 1)

Exploration of depths for sample 1,2

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'DEPTHs{0} ~ DEPTHs{1}' \
    --title 'Depths between sample 1 and 2' \
    --config examples/config.toml

Depths between sample 1 and 2

See more examples:

https://github.com/pwwang/vcfstats/issues/15#issuecomment-1029367903

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vcfstats-0.7.1.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vcfstats-0.7.1-py3-none-any.whl (17.2 kB view details)

Uploaded Python 3

File details

Details for the file vcfstats-0.7.1.tar.gz.

File metadata

  • Download URL: vcfstats-0.7.1.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.12.3 Linux/6.11.0-1018-azure

File hashes

Hashes for vcfstats-0.7.1.tar.gz
Algorithm Hash digest
SHA256 eac82893f32ffbe18c5cbb7f8d725e2181a1735df276a87b3dd11d4f1b809281
MD5 c645520e65e2855b940340245474e59c
BLAKE2b-256 a1ac42d10e6eb8ac3f5ea827dd8c9cfa01b4f0108143bb77162d25903ac9cdfd

See more details on using hashes here.

File details

Details for the file vcfstats-0.7.1-py3-none-any.whl.

File metadata

  • Download URL: vcfstats-0.7.1-py3-none-any.whl
  • Upload date:
  • Size: 17.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.12.3 Linux/6.11.0-1018-azure

File hashes

Hashes for vcfstats-0.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8b7f0b0022b287ad12b4101c1baf4894a147806f65eae9e915c42802bcad9249
MD5 3939aac88ca56b0990f09b12b05de1c3
BLAKE2b-256 db417e00b2eb7dbce1f1417e553831fd8c033f2b7165d645c7077d839e84f98e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page