A set of variant file helper utilities built on top of bcftools and htslib.
Project description
bcf_extras
Extra variant file helper utilities built on top of bcftools
and htslib
.
License
bcf_extras
is licensed under GPLv3. See the license for more
information.
Dependencies
The package requires that bcftools
and htslib
are installed on your
operating system.
For the str
extra, TRtools
is required as a pip dependency. This should be
handled automatically upon installation, but I have encountered installation
issues before (especially when copies of htslib
interfere with one another.)
Feel free to file an issue to ask about this.
Installation
One can use pip to install either the base bcf-extras
or bcf-extras[str]
,
which has additional dependencies and adds some more utilities related to STR
calling.
pip install bcf-extras
# or
pip install bcf-extras[str]
What's Included (Base)
copy-compress-index
Creates .vcf.gz
files with a corresponding tabix indices from one or more
VCFs, sorting the VCFs if necessary.
For example, the following would generate sample-1.vcf.gz
and sample-1.vcf.gz.tbi
:
bcf-extras copy-compress-index sample-1.vcf
add-header-lines
Adds header lines from a text file to a particular position in the VCF header.
Useful for e.g. inserting missing ##contig
lines to a bunch of VCFs at once
(taking advantage of this + something like GNU parallel.)
For the ##contig
lines example, inserting the contents of
tests/vcfs/new_lines.txt
, we could run the
following command on tests/vcfs/ahl.vcf
, replacing the
file with a new copy:
bcf-extras add-header-lines tests/vcfs/ahl.vcf tests/vcfs/new_lines.txt
There is also a flag, --tmp-dir
, for specifying a temporary folder location
into which header artifacts will be placed. This is especially useful when
running jobs on clusters, which may have specific locations for temporary I/O.
Using GNU parallel, we can do multiple VCFs at once, e.g.:
parallel 'bcf-extras add-header-lines {} tests/vcfs/new_lines.txt --keep-old' ::: /path/to/my/vcfs/*.vcf
The --keep-old
flag keeps the original VCFs as a copy.
arg-join
Some bioinformatics utilities take in comma-separated file lists rather than
the more standard whitespace-separated lists that something like a glob
(*.vcf.gz
) generates.
This command can be run by itself, e.g.:
bcf-extras arg-join --sep ";" *.vcf
# Outputs e.g. sample1.vcf;sample2.vcf
It can be used embedded in another command, e.g. with mergeSTR
,
a tool for merging STR caller VCFs
which takes as input a comma-separated list of files:
mergeSTR --vcfs $(bcf-extras arg-join *.vcf) --out my_merge
The default separator (specified via --sep
) is ,
.
filter-gff3
This command can filter a GFF3 (or similarly formatted) file and filter it by various columns using regular expressions.
It prints the filtered lines to stdout
, which can then be redirected to a
file or piped to another process.
Currently, you can filter by the seqid
, source
, type
, strand
, and
phrase
columns using Python-formatted regular expressions, e.g. the
following, which filters type
to be either gene
or exon
and stores that
in a new file:
bcf-extras filter-gff3 --type '^(gene|exon)$' example.gff3 > example-genes-exons.gff3
For help, run the sub-command with no arguments:
bcf-extras filter-gff3
reformat-vcf-contigs
TODO
What's Included (STR)
parallel-mergeSTR
mergeSTR is a tool by the Gymrek lab used to merge STR call VCFs. It proceeds linearly over a list of files, which cannot easily take advantage of multiple cores. This utility merges VCFs in a tree fashion to produce a final merged result, and is handy when merging 100s of STR call VCFs at once.
bcf-extras parallel-mergeSTR *.vcf.gz --out my_merge --ntasks 10
In a dataset of 148 single-sample gangSTR call VCFs, merging with
parallel-mergeSTR
on 10 cores resulted in an 60% speedup versus
running on a single core (~2 hours versus ~5 hours.)
Speedup is not linear with number of cores used, so it only makes sense to use this if turnaround time is important and resources are available.
To not over-allocate resources on a cluster, the process can be split further into first a parallelized task, and then a task which only uses one core:
# Runs on multiple cores
bcf-extras parallel-mergeSTR *.vcf.gz --ntasks 10 --out my_merge --step1-only
# Intermediate files generated by the first step will feed into the second step.
# Bottlenecked single process step; the ntasks argument is still needed to
# calculate the names of the intermediate output files (but sub-processes are
# not spawned) - thus, the provided value must match the value above.
bcf-extras parallel-mergeSTR --ntasks 10 --out my_merge --step2-only
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file bcf_extras-0.2.0.tar.gz
.
File metadata
- Download URL: bcf_extras-0.2.0.tar.gz
- Upload date:
- Size: 27.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/57.4.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4aecbcc015d481416857313b8e8806f4b74a1d0043ab1265af514416e153bd0a |
|
MD5 | 8d3bb01b284dddc4067e7cf189445bbb |
|
BLAKE2b-256 | 98a40feb9dbe80bcba058f88dfc3aa0a812703c65d44e402490ef0b76673cbd1 |
File details
Details for the file bcf_extras-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: bcf_extras-0.2.0-py3-none-any.whl
- Upload date:
- Size: 24.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/57.4.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0329dbd8787f94d2316e363cc1da031bb6ff8bb0e240299661199bfbea653314 |
|
MD5 | 1e1f709b85c55bbb818e9155f56b1502 |
|
BLAKE2b-256 | ee4163503d9b1cce684b48046635ee7a29fcb29aa006155fa33e0bc00ce87e0b |