Filter VCF/BCF files with Python expressions.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

vembrane: variant filtering using python expressions

Vembrane allows to simultaneously filter variants based on any INFO field, CHROM, POS, REF, ALT, QUAL, and the annotation field ANN. When filtering based on ANN, annotation entries are filtered first. If no annotation entry remains, the entire variant is deleted.

Filter expression

The filter expression can be any valid python expression that evaluates to bool. However, functions and symbols available have been restricted to the following:

all, any
abs, len, max, min, round, sum
enumerate, filter, iter, map, next, range, reversed, sorted, zip
dict, list, set, tuple
bool, chr, float, int, ord, str
Any function or symbol from math
Regular expressions via re

Available fields

The following VCF fields can be accessed in the filter expression:

Name	Type	Interpretation	Example expression
`INFO`	`Dict[str, Any¹]`	`INFO field -> Value`	`INFO["DP"] > 0`
`ANN`	`Dict[str, Any²]`	`ANN field -> Value`	`ANN["Gene_Name"] == "CDH2"`
`CHROM`	`str`	Chromosome Name	`CHROM == "chr2"`
`POS`	`int`	Chromosomal position	`24 < POS < 42`
`ID`	`str`	Variant ID	`ID == "rs11725853"`
`REF`	`str`	Reference allele	`REF == "A"`
`ALT`	`str`	Alternative allele³	`ALT == "C"`
`QUAL`	`float`	Quality	`QUAL >= 60`
`FILTER`
`FORMAT`	`Dict[str, Dict[str, Any¹]]`	`Format -> (Sample -> Value)`	`FORMAT["DP"][SAMPLES[0]] > 0`
`SAMPLES`	`List[str]`	`[Sample]`	`"Tumor" in SAMPLES`

¹ depends on type specified in VCF header

² for the usual snpeff and vep annotations, custom types have been specified; any unknown ANN field will simply be of type str. If something lacks a custom parser/type, please consider filing an issue in the issue tracker.

³ vembrane does not handle multi-allelic records itself. Instead, such files should be preprocessed by either of the following tools (preferably even before annotation):

Examples

Only keep annotations and variants where gene equals "CDH2" and its impact is "HIGH":

vembrane 'ANN["Gene_Name"] == "CDH2" and ANN["Annotation_Impact"] == "HIGH"' variants.bcf

Only keep variants with quality at least 30:
```
vembrane 'QUAL >= 30' variants.vcf
```
Only keep annotations and variants where feature (transcript) is ENST00000307301:
```
vembrane 'ANN["Feature"] == "ENST00000307301"' variants.bcf
```
Only keep annotations and variants where protein position is less than 10:
```
vembrane 'ANN["Protein"].start < 10' variants.bcf
```
Only keep variants where mapping quality is exactly 60:
```
vembrane 'INFO["MQ"] == 60' variants.bcf
```
Only keep annotations and variants where consequence contains the word "stream" (matching "upstream" and "downstream"):
```
vembrane 're.search("stream", ANN["Consequence"])' variants.vcf
```

Only keep annotations and variants where CLIN_SIG contains "pathogenic", "likely_pathogenic" or "drug_response":

vembrane 'any(entry in ANN["CLIN_SIG"] for entry in ("pathogenic", "likely_pathogenic", "drug_response"))' variants.vcf

Custom ANN types

vembrane parses the following annotation fields to a custom type:

(snpeff) cDNA.pos / cDNA.length, CDS.pos / CDS.length and AA.pos / AA.length are re-exposed as cDNA, CDS and AA respectively with properties start, end and length, e.g. can be accessed like this: ANN["cDNA"].start
(vep) cDNA_position, CDS_position and Protein_position are re-exposed as cDNA, CDS and Protein respectively with properties start, end and length, e.g. can be accessed like this: ANN["cDNA"].start
CLIN_SIG is split at '&' into a list of entries

Any unknown annotation field will be left as is.

Missing values in annotations

If a certain annotation field lacks a value, it will be replaced with the special value of NA. Comparing with this value will always result in False, e.g. ANN["cDNA"].start > 0 will always evaluate to False if there was no value in the "cDNA.pos / cDNA.length" field of (snpeff) ANN (otherwise the comparison will be carried out with the usual semantics).

Explicitly handling missing/optional values in INFO or FORMAT fields can be done by checking for NA, e.g.: INFO["DP"] is NA.

Handling missing/optional values in fields other than INFO or FORMAT can be done by checking for None, e.g ID is not None.

Development

pre-commit hooks

Since we enforce code formatting with black by checking for that in CI, we can avoid "fmt" commits by ensuring formatting is done upon comitting changes:

make sure pre-commit is installed on your machine / in your env (should be available in pip, conda, archlinux repos, ...)
run pre-commit install. This will activate pre-commit hooks to your local .git

Now when calling git commit, your changed code will be formatted with black, checked withflake8, get trailing whitespace removed and trailing newlines added (if needed)

Authors

Marcel Bargull (@mbargull)
Jan Forster (@jafors)
Till Hartmann (@tedil)
Johannes Köster (@johanneskoester)
Elias Kuthe (@eqt)
Felix Mölder (@felixmoelder)
Christopher Schröder (@christopher-schroeder)

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.0.5

Mar 26, 2024

1.0.4

Dec 7, 2023

1.0.3

Jul 26, 2023

1.0.2

Jun 12, 2023

1.0.1

Apr 27, 2023

1.0.0

Mar 31, 2023

0.14.0

Mar 15, 2023

0.13.2

Oct 24, 2022

0.13.1

Oct 13, 2022

0.13.0

Sep 22, 2022

0.12.1

Aug 17, 2022

0.12.0

Aug 16, 2022

0.11.2

Aug 11, 2022

0.11.1

Aug 9, 2022

0.11.0

Aug 8, 2022

0.10.1

Jul 12, 2022

0.10.0

Jul 11, 2022

0.9.0

Jul 11, 2022

0.8.0

May 17, 2022

0.7.1

Mar 24, 2022

0.7.0

Oct 7, 2021

0.6.1

Mar 24, 2021

0.6.0

Feb 23, 2021

0.5.3

Nov 17, 2020

0.5.2

Nov 6, 2020

0.5.1

Oct 22, 2020

0.5.0

Sep 29, 2020

This version

0.4.1

Jul 28, 2020

0.4.0

Jul 27, 2020

0.3.2

Jul 23, 2020

0.3.1

Jul 21, 2020

0.2.1

Jul 15, 2020

0.2.0

Jul 14, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vembrane-0.4.1.tar.gz (16.7 kB view hashes)

Uploaded Jul 28, 2020 Source

Built Distribution

vembrane-0.4.1-py3-none-any.whl (15.7 kB view hashes)

Uploaded Jul 28, 2020 Python 3

Hashes for vembrane-0.4.1.tar.gz

Hashes for vembrane-0.4.1.tar.gz
Algorithm	Hash digest
SHA256	`a264b9c6cb0c302976af39455c06bfc9107ac03fc3be87f308c700ced054fa11`
MD5	`26fc5199c4f781ba799ded807a9554e2`
BLAKE2b-256	`de3df573529190060fa7ca9fcc328429c65424bee23d547d39d7ae15c20fec7a`

Hashes for vembrane-0.4.1-py3-none-any.whl

Hashes for vembrane-0.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7d1c9163b706de824f06ca0a386a7e24a9ff35f81c9cade8679085304bc7ed8b`
MD5	`0e331aa5c3d0bb7f0027edf92ce7bdac`
BLAKE2b-256	`f832ad878c0f7573bb66f899b3a3ae6d78c5fbdee1cc449f62a82cb3adc2aac4`