Skip to main content

Filter VCF/BCF files with Python expressions.

Project description

vembrane: variant filtering using python expressions

vembrane allows to simultaneously filter variants based on any INFO field, CHROM, POS, REF, ALT, QUAL, and the annotation field ANN. When filtering based on ANN, annotation entries are filtered first. If no annotation entry remains, the entire variant is deleted.

vembrane filter

Filter expression

The filter expression can be any valid python expression that evaluates to bool. However, functions and symbols available have been restricted to the following:

  • all, any
  • abs, len, max, min, round, sum
  • enumerate, filter, iter, map, next, range, reversed, sorted, zip
  • dict, list, set, tuple
  • bool, chr, float, int, ord, str
  • Any function or symbol from math
  • Regular expressions via re

Available fields

The following VCF fields can be accessed in the filter expression:

Name Type Interpretation Example expression
INFO Dict[str, Any¹] INFO field -> Value INFO["DP"] > 0
ANN Dict[str, Any²] ANN field -> Value ANN["Gene_Name"] == "CDH2"
CHROM str Chromosome Name CHROM == "chr2"
POS int Chromosomal position 24 < POS < 42
ID str Variant ID ID == "rs11725853"
REF str Reference allele REF == "A"
ALT str Alternative allele³ ALT == "C"
QUAL float Quality QUAL >= 60
FILTER List[str] Filter tags "PASS" in FILTER
FORMAT Dict[str, Dict[str, Any¹]] Format -> (Sample -> Value) FORMAT["DP"][SAMPLES[0]] > 0
SAMPLES List[str] [Sample] "Tumor" in SAMPLES
INDEX int Index of variant in the file INDEX < 10

¹ depends on type specified in VCF header

² for the usual snpeff and vep annotations, custom types have been specified; any unknown ANN field will simply be of type str. If something lacks a custom parser/type, please consider filing an issue in the issue tracker.

³ vembrane does not handle multi-allelic records itself. Instead, such files should be preprocessed by either of the following tools (preferably even before annotation):

Examples

  • Only keep annotations and variants where gene equals "CDH2" and its impact is "HIGH":
    vembrane filter 'ANN["Gene_Name"] == "CDH2" and ANN["Annotation_Impact"] == "HIGH"' variants.bcf
    
  • Only keep variants with quality at least 30:
    vembrane filter 'QUAL >= 30' variants.vcf
    
  • Only keep annotations and variants where feature (transcript) is ENST00000307301:
    vembrane filter 'ANN["Feature"] == "ENST00000307301"' variants.bcf
    
  • Only keep annotations and variants where protein position is less than 10:
    vembrane filter 'ANN["Protein"].start < 10' variants.bcf
    
  • Only keep variants where mapping quality is exactly 60:
    vembrane filter 'INFO["MQ"] == 60' variants.bcf
    
  • Only keep annotations and variants where consequence contains the word "stream" (matching "upstream" and "downstream"):
    vembrane filter 're.search("(up|down)stream", ANN["Consequence"])' variants.vcf
    
  • Only keep annotations and variants where CLIN_SIG contains "pathogenic", "likely_pathogenic" or "drug_response":
    vembrane filter 'any(entry in ANN["CLIN_SIG"] for entry in ("pathogenic", "likely_pathogenic", "drug_response"))' variants.vcf
    

Custom ANN types

vembrane parses entries in the annotation field as outlined in Types.md

Missing values in annotations

If a certain annotation field lacks a value, it will be replaced with the special value of NA. Comparing with this value will always result in False, e.g. ANN["MOTIF_POS"] > 0 will always evaluate to False if there was no value in the "MOTIF_POS" field of ANN (otherwise the comparison will be carried out with the usual semantics).

Since you may want to use the regex module to search for matches, NA also acts as an empty str, such that re.search("nothing", NA) returns nothing instead of raising an exception.

Explicitly handling missing/optional values in INFO or FORMAT fields can be done by checking for NA, e.g.: INFO["DP"] is NA.

Handling missing/optional values in fields other than INFO or FORMAT can be done by checking for None, e.g ID is not None.

vembrane table

In addition to the filter subcommand, vembrane (≥ 0.5) also supports writing tabular data with the table subcommand. In this case, an expression which evaluates to tuple is expected, for example:

vembrane table 'CHROM, POS, 10**(-QUAL/10)', ANN["CLIN_SIG"] > table.tsv`.

Development

pre-commit hooks

Since we enforce code formatting with black by checking for that in CI, we can avoid "fmt" commits by ensuring formatting is done upon comitting changes:

  1. make sure pre-commit is installed on your machine / in your env (should be available in pip, conda, archlinux repos, ...)
  2. run pre-commit install. This will activate pre-commit hooks to your local .git

Now when calling git commit, your changed code will be formatted with black, checked withflake8, get trailing whitespace removed and trailing newlines added (if needed)

Authors

  • Marcel Bargull (@mbargull)
  • Jan Forster (@jafors)
  • Till Hartmann (@tedil)
  • Johannes Köster (@johanneskoester)
  • Elias Kuthe (@eqt)
  • Felix Mölder (@felixmoelder)
  • Christopher Schröder (@christopher-schroeder)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vembrane-0.6.0.tar.gz (21.2 kB view details)

Uploaded Source

Built Distribution

vembrane-0.6.0-py3-none-any.whl (21.5 kB view details)

Uploaded Python 3

File details

Details for the file vembrane-0.6.0.tar.gz.

File metadata

  • Download URL: vembrane-0.6.0.tar.gz
  • Upload date:
  • Size: 21.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.9.2 Linux/5.4.0-1039-azure

File hashes

Hashes for vembrane-0.6.0.tar.gz
Algorithm Hash digest
SHA256 edc7edacd23057347a067f1e07f8c118e233c202a273242d20b9f57e1e0291b8
MD5 e25944852ace7f3600c2873514476f93
BLAKE2b-256 521c7adf9c092d89ff443aa236dbc9211ef0ce70c5db68eb7c6fc06b29a0a1a1

See more details on using hashes here.

File details

Details for the file vembrane-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: vembrane-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 21.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.9.2 Linux/5.4.0-1039-azure

File hashes

Hashes for vembrane-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7a430ad5e0993965a08a03f049785131e3a9a9514aa68ce6a2150704a726252b
MD5 0becd297da734a3fb33bf950386e7bd9
BLAKE2b-256 6fe3434ba8bde0b9335af4a5ac7868eeb2baa631909b32e1296bdc20b0e2cb54

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page