Filter VCF/BCF files with Python expressions.
Project description
vembrane: variant filtering using python expressions
vembrane allows to simultaneously filter variants based on any INFO
field, CHROM
, POS
, REF
, ALT
, QUAL
, and the annotation field ANN
. When filtering based on ANN
, annotation entries are filtered first. If no annotation entry remains, the entire variant is deleted.
vembrane filter
Filter expression
The filter expression can be any valid python expression that evaluates to bool
. However, functions and symbols available have been restricted to the following:
all
,any
abs
,len
,max
,min
,round
,sum
enumerate
,filter
,iter
,map
,next
,range
,reversed
,sorted
,zip
dict
,list
,set
,tuple
bool
,chr
,float
,int
,ord
,str
- Any function or symbol from
math
- Regular expressions via
re
Available fields
The following VCF fields can be accessed in the filter expression:
Name | Type | Interpretation | Example expression |
---|---|---|---|
INFO |
Dict[str, Any¹] |
INFO field -> Value |
INFO["DP"] > 0 |
ANN |
Dict[str, Any²] |
ANN field -> Value |
ANN["Gene_Name"] == "CDH2" |
CHROM |
str |
Chromosome Name | CHROM == "chr2" |
POS |
int |
Chromosomal position | 24 < POS < 42 |
ID |
str |
Variant ID | ID == "rs11725853" |
REF |
str |
Reference allele | REF == "A" |
ALT |
str |
Alternative allele³ | ALT == "C" |
QUAL |
float |
Quality | QUAL >= 60 |
FILTER |
List[str] |
Filter tags | "PASS" in FILTER |
FORMAT |
Dict[str, Dict[str, Any¹]] |
Format -> (Sample -> Value) |
FORMAT["DP"][SAMPLES[0]] > 0 |
SAMPLES |
List[str] |
[Sample] |
"Tumor" in SAMPLES |
INDEX |
int |
Index of variant in the file |
INDEX < 10 |
¹ depends on type specified in VCF header
² for the usual snpeff and vep annotations, custom types have been specified; any unknown ANN field will simply be of type str
. If something lacks a custom parser/type, please consider filing an issue in the issue tracker.
³ vembrane does not handle multi-allelic records itself. Instead, such files should be preprocessed by either of the following tools (preferably even before annotation):
bcftools norm -m-any […]
gatk LeftAlignAndTrimVariants […] --split-multi-allelics
vcfmulti2oneallele […]
Examples
- Only keep annotations and variants where gene equals "CDH2" and its impact is "HIGH":
vembrane filter 'ANN["Gene_Name"] == "CDH2" and ANN["Annotation_Impact"] == "HIGH"' variants.bcf
- Only keep variants with quality at least 30:
vembrane filter 'QUAL >= 30' variants.vcf
- Only keep annotations and variants where feature (transcript) is ENST00000307301:
vembrane filter 'ANN["Feature"] == "ENST00000307301"' variants.bcf
- Only keep annotations and variants where protein position is less than 10:
vembrane filter 'ANN["Protein"].start < 10' variants.bcf
- Only keep variants where mapping quality is exactly 60:
vembrane filter 'INFO["MQ"] == 60' variants.bcf
- Only keep annotations and variants where consequence contains the word "stream" (matching "upstream" and "downstream"):
vembrane filter 're.search("(up|down)stream", ANN["Consequence"])' variants.vcf
- Only keep annotations and variants where CLIN_SIG contains "pathogenic", "likely_pathogenic" or "drug_response":
vembrane filter 'any(entry in ANN["CLIN_SIG"] for entry in ("pathogenic", "likely_pathogenic", "drug_response"))' variants.vcf
Custom ANN
types
vembrane
parses entries in the annotation field as outlined in Types.md
Missing values in annotations
If a certain annotation field lacks a value, it will be replaced with the special value of NA
. Comparing with this value will always result in False
, e.g.
ANN["MOTIF_POS"] > 0
will always evaluate to False
if there was no value in the "MOTIF_POS" field of ANN (otherwise the comparison will be carried out with the usual semantics).
Since you may want to use the regex module to search for matches, NA
also acts as an empty str
, such that re.search("nothing", NA)
returns nothing instead of raising an exception.
Explicitly handling missing/optional values in INFO or FORMAT fields can be done by checking for NA, e.g.: INFO["DP"] is NA
.
Handling missing/optional values in fields other than INFO or FORMAT can be done by checking for None, e.g ID is not None
.
vembrane table
In addition to the filter
subcommand, vembrane (≥ 0.5
) also supports writing tabular data with the table
subcommand.
In this case, an expression which evaluates to tuple
is expected, for example:
vembrane table 'CHROM, POS, 10**(-QUAL/10)', ANN["CLIN_SIG"] > table.tsv`.
Development
pre-commit hooks
Since we enforce code formatting with black
by checking for that in CI, we can avoid "fmt" commits by ensuring formatting is done upon comitting changes:
- make sure
pre-commit
is installed on your machine / in your env (should be available in pip, conda, archlinux repos, ...) - run
pre-commit install
. This will activate pre-commit hooks to your local .git
Now when calling git commit
, your changed code will be formatted with black
, checked withflake8
, get trailing whitespace removed and trailing newlines added (if needed)
Authors
- Marcel Bargull (@mbargull)
- Jan Forster (@jafors)
- Till Hartmann (@tedil)
- Johannes Köster (@johanneskoester)
- Elias Kuthe (@eqt)
- Felix Mölder (@felixmoelder)
- Christopher Schröder (@christopher-schroeder)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file vembrane-0.6.0.tar.gz
.
File metadata
- Download URL: vembrane-0.6.0.tar.gz
- Upload date:
- Size: 21.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.4 CPython/3.9.2 Linux/5.4.0-1039-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | edc7edacd23057347a067f1e07f8c118e233c202a273242d20b9f57e1e0291b8 |
|
MD5 | e25944852ace7f3600c2873514476f93 |
|
BLAKE2b-256 | 521c7adf9c092d89ff443aa236dbc9211ef0ce70c5db68eb7c6fc06b29a0a1a1 |
File details
Details for the file vembrane-0.6.0-py3-none-any.whl
.
File metadata
- Download URL: vembrane-0.6.0-py3-none-any.whl
- Upload date:
- Size: 21.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.4 CPython/3.9.2 Linux/5.4.0-1039-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7a430ad5e0993965a08a03f049785131e3a9a9514aa68ce6a2150704a726252b |
|
MD5 | 0becd297da734a3fb33bf950386e7bd9 |
|
BLAKE2b-256 | 6fe3434ba8bde0b9335af4a5ac7868eeb2baa631909b32e1296bdc20b0e2cb54 |