Skip to main content

A set of variant file helper utilities built on top of bcftools and htslib.

Project description

bcf_extras

Test Status

Extra variant file helper utilities built on top of bcftools and htslib.

License

bcf_extras is licensed under GPLv3. See the license for more information.

Dependencies

The package requires that bcftools and htslib are installed on your operating system.

For the str extra, TRtools is required as a pip dependency. This should be handled automatically upon installation, but I have encountered installation issues before (especially when copies of htslib interfere with one another.) Feel free to file an issue to ask about this.

Installation

One can use pip to install either the base bcf-extras or bcf-extras[str], which has additional dependencies and adds some more utilities related to STR calling.

pip install bcf-extras
# or
pip install bcf-extras[str]

What's Included (Base)

copy-compress-index

Creates .vcf.gz files with a corresponding tabix indices from one or more VCFs, sorting the VCFs if necessary.

For example, the following would generate sample-1.vcf.gz and sample-1.vcf.gz.tbi:

bcf-extras copy-compress-index sample-1.vcf

add-header-lines

Adds header lines from a text file to a particular position in the VCF header. Useful for e.g. inserting missing ##contig lines to a bunch of VCFs at once (taking advantage of this + something like GNU parallel.)

For the ##contig lines example, inserting the contents of tests/vcfs/new_lines.txt, we could run the following command on tests/vcfs/ahl.vcf, replacing the file with a new copy:

bcf-extras add-header-lines tests/vcfs/ahl.vcf tests/vcfs/new_lines.txt

There is also a flag, --tmp-dir, for specifying a temporary folder location into which header artifacts will be placed. This is especially useful when running jobs on clusters, which may have specific locations for temporary I/O.

Using GNU parallel, we can do multiple VCFs at once, e.g.:

parallel 'bcf-extras add-header-lines {} tests/vcfs/new_lines.txt --keep-old' ::: /path/to/my/vcfs/*.vcf

The --keep-old flag keeps the original VCFs as a copy.

arg-join

Some bioinformatics utilities take in comma-separated file lists rather than the more standard whitespace-separated lists that something like a glob (*.vcf.gz) generates.

This command can be run by itself, e.g.:

bcf-extras arg-join --sep ";" *.vcf
# Outputs e.g. sample1.vcf;sample2.vcf

It can be used embedded in another command, e.g. with mergeSTR, a tool for merging STR caller VCFs which takes as input a comma-separated list of files:

mergeSTR --vcfs $(bcf-extras arg-join *.vcf) --out my_merge

The default separator (specified via --sep) is ,.

filter-gff3

This command can filter a GFF3 (or similarly formatted) file and filter it by various columns using regular expressions.

It prints the filtered lines to stdout, which can then be redirected to a file or piped to another process.

Currently, you can filter by the seqid, source, type, strand, and phrase columns using Python-formatted regular expressions, e.g. the following, which filters type to be either gene or exon and stores that in a new file:

bcf-extras filter-gff3 --type '^(gene|exon)$' example.gff3 > example-genes-exons.gff3

For help, run the sub-command with no arguments:

bcf-extras filter-gff3

reformat-vcf-contigs

TODO

What's Included (STR)

parallel-mergeSTR

mergeSTR is a tool by the Gymrek lab used to merge STR call VCFs. It proceeds linearly over a list of files, which cannot easily take advantage of multiple cores. This utility merges VCFs in a tree fashion to produce a final merged result, and is handy when merging 100s of STR call VCFs at once.

bcf-extras parallel-mergeSTR *.vcf.gz --out my_merge --ntasks 10

In a dataset of 148 single-sample gangSTR call VCFs, merging with parallel-mergeSTR on 10 cores resulted in an 60% speedup versus running on a single core (~2 hours versus ~5 hours.)

Speedup is not linear with number of cores used, so it only makes sense to use this if turnaround time is important and resources are available.

To not over-allocate resources on a cluster, the process can be split further into first a parallelized task, and then a task which only uses one core:

# Runs on multiple cores
bcf-extras parallel-mergeSTR *.vcf.gz --ntasks 10 --out my_merge --step1-only

# Intermediate files generated by the first step will feed into the second step.

# Bottlenecked single process step; the ntasks argument is still needed to 
# calculate the names of the intermediate output files (but sub-processes are 
# not spawned) - thus, the provided value must match the value above.
bcf-extras parallel-mergeSTR --ntasks 10 --out my_merge --step2-only

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bcf_extras-0.2.0.tar.gz (27.4 kB view hashes)

Uploaded Source

Built Distribution

bcf_extras-0.2.0-py3-none-any.whl (24.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page