Skip to main content

Filter VCF/BCF files with Python expressions.

Project description

vembrane: variant filtering using python expressions

Vembrane allows to simultaneously filter variants based on any INFO field, CHROM, POS, REF, ALT, QUAL, and the annotation field ANN. When filtering based on ANN, annotation entries are filtered first. If no annotation entry remains, the entire variant is deleted.

Filter expression

The filter expression can be any valid python expression that evaluates to bool. However, functions and symbols available have been restricted to the following:

  • all, any
  • abs, len, max, min, round, sum
  • enumerate, filter, iter, map, next, range, reversed, sorted, zip
  • dict, list, set, tuple
  • bool, chr, float, int, ord, str
  • Any function or symbol from math
  • Regular expressions via re

Available fields

The following VCF fields can be accessed in the filter expression:

Name Type Interpretation Example expression
INFO Dict[str, Any¹] INFO field -> Value INFO["DP"] > 0
ANN Dict[str, Any²] ANN field -> Value ANN["Gene_Name"] == "CDH2"
CHROM str Chromosome Name CHROM == "chr2"
POS int Chromosomal position 24 < POS < 42
ID str Variant ID ID == "rs11725853"
REF str Reference allele REF == "A"
ALT List[str] Alternative alleles "C" in ALT or ALT[0] == "G"
QUAL float Quality QUAL >= 60
FILTER
FORMAT Dict[str, Dict[str, Any¹]] Format -> (Sample -> Value) FORMAT["DP"][SAMPLES[0]] > 0
SAMPLES List[str] [Sample] "Tumor" in SAMPLES

¹ depends on type specified in VCF header

² for the usual snpeff and vep annotations, custom types have been specified; any unknown ANN field will simply be of type str. If something lacks a custom parser/type, please consider filing an issue in the issue tracker.

Examples

  • Only keep annotations and variants where gene equals "CDH2" and its impact is "HIGH":
    vembrane 'ANN["Gene_Name"] == "CDH2" and ANN["Annotation_Impact"] == "HIGH"' variants.bcf
    
  • Only keep variants with quality at least 30:
    vembrane 'QUAL >= 30' variants.vcf
    
  • Only keep annotations and variants where feature (transcript) is ENST00000307301:
    vembrane 'ANN["Feature"] == "ENST00000307301"' variants.bcf
    
  • Only keep annotations and variants where protein position is less than 10:
    vembrane 'ANN["Protein"].start < 10' variants.bcf
    
  • Only keep variants where mapping quality is exactly 60:
    vembrane 'INFO["MQ"] == 60' variants.bcf
    
  • Only keep annotations and variants where consequence contains the word "stream" (matching "upstream" and "downstream"):
    vembrane 're.search("stream", ANN["Consequence"])' variants.vcf
    
  • Only keep annotations and variants where CLIN_SIG contains "pathogenic", "likely_pathogenic" or "drug_response":
    vembrane 'any(entry in ANN["CLIN_SIG"] for entry in ("pathogenic", "likely_pathogenic", "drug_response"))' variants.vcf
    

Custom ANN types

vembrane parses the following annotation fields to a custom type:

  • (snpeff) cDNA.pos / cDNA.length, CDS.pos / CDS.length and AA.pos / AA.length are re-exposed as cDNA, CDS and AA respectively with properties start, end and length, e.g. can be accessed like this: ANN["cDNA"].start
  • (vep) cDNA_position, CDS_position and Protein_position are re-exposed as cDNA, CDS and Protein respectively with properties start, end and length, e.g. can be accessed like this: ANN["cDNA"].start
  • CLIN_SIG is split at '&' into a list of entries

Any unknown annotation field will be left as is.

Missing values in annotations

If a certain annotation field lacks a value, it will be replaced with the special value of NA. Comparing with this value will always result in False, e.g. ANN["cDNA"].start > 0 will always evaluate to False if there was no value in the "cDNA.pos / cDNA.length" field of (snpeff) ANN (otherwise the comparison will be carried out with the usual semantics). One way to handle optional values is by asserting that the field is not None, e.g ID and "foo" in ID.

Development

pre-commit hooks

Since we enforce code formatting with black by checking for that in CI, we can avoid "fmt" commits by ensuring formatting is done upon comitting changes:

  1. make sure pre-commit is installed on your machine / in your env (should be available in pip, conda, archlinux repos, ...)
  2. run pre-commit install. This will activate pre-commit hooks to your local .git

Now when calling git commit, your changed code will be formatted with black, checked withflake8, get trailing whitespace removed and trailing newlines added (if needed)

Authors

  • Marcel Bargull (@mbargull)
  • Jan Forster (@jafors)
  • Till Hartmann (@tedil)
  • Johannes Köster (@johanneskoester)
  • Elias Kuthe (@eqt)
  • Felix Mölder (@felixmoelder)
  • Christopher Schröder (@christopher-schroeder)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vembrane-0.3.2.tar.gz (15.3 kB view details)

Uploaded Source

Built Distribution

vembrane-0.3.2-py3-none-any.whl (14.0 kB view details)

Uploaded Python 3

File details

Details for the file vembrane-0.3.2.tar.gz.

File metadata

  • Download URL: vembrane-0.3.2.tar.gz
  • Upload date:
  • Size: 15.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.10 CPython/3.8.5 Linux/5.3.0-1032-azure

File hashes

Hashes for vembrane-0.3.2.tar.gz
Algorithm Hash digest
SHA256 bf15806d52a82036cbb130893b85de5358b53ea1ce40d0ee54fcc3f72c8a7ce0
MD5 8dca1651b83ce29a5cabb8d462108677
BLAKE2b-256 8c415b82f946bd1a70197cf9166da2dacd2b630bb55c62fd0b55465211cac6db

See more details on using hashes here.

File details

Details for the file vembrane-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: vembrane-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 14.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.10 CPython/3.8.5 Linux/5.3.0-1032-azure

File hashes

Hashes for vembrane-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 030e84bf06910d1e0b49aede9853a02f85782dcf339ecdc217d6ec132e307287
MD5 7580b6d4b2e7011d9d50a7ec9fdff432
BLAKE2b-256 39ab7132a41b2f365f50d2423d88ccfa9ffe155f8ab95b10023170526ed982ce

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page