Skip to main content

A set of variant file helper utilities built on top of bcftools and htslib.

Project description

bcf_extras

Test Status

Extra variant file helper utilities built on top of bcftools and htslib.

License

bcf_extras is licensed under GPLv3. See the license for more information.

Dependencies

The package requires that bcftools and htslib are installed on your operating system.

For the str extra, TRtools is required as a pip dependency. This should be handled automatically upon installation, but I have encountered installation issues before (especially when copies of htslib interfere with one another.) Feel free to file an issue to ask about this.

Installation

One can use pip to install either the base bcf-extras or bcf-extras[str], which has additional dependencies and adds some more utilities related to STR calling.

pip install bcf-extras
# or
pip install bcf-extras[str]

What's Included (Base)

copy-compress-index

Creates .vcf.gz files with a corresponding tabix indices from one or more VCFs, sorting the VCFs if necessary.

For example, the following would generate sample-1.vcf.gz and sample-1.vcf.gz.tbi:

bcf-extras copy-compress-index sample-1.vcf

add-header-lines

Adds header lines from a text file to a particular position in the VCF header. Useful for e.g. inserting missing ##contig lines to a bunch of VCFs at once (taking advantage of this + something like GNU parallel.)

For the ##contig lines example, inserting the contents of tests/vcfs/new_lines.txt, we could run the following command on tests/vcfs/ahl.vcf, replacing the file with a new copy:

bcf-extras add-header-lines tests/vcfs/ahl.vcf tests/vcfs/new_lines.txt

There is also a flag, --tmp-dir, for specifying a temporary folder location into which header artifacts will be placed. This is especially useful when running jobs on clusters, which may have specific locations for temporary I/O.

Using GNU parallel, we can do multiple VCFs at once, e.g.:

parallel 'bcf-extras add-header-lines {} tests/vcfs/new_lines.txt --keep-old' ::: /path/to/my/vcfs/*.vcf

The --keep-old flag keeps the original VCFs as a copy.

arg-join

Some bioinformatics utilities take in comma-separated file lists rather than the more standard whitespace-separated lists that something like a glob (*.vcf.gz) generates.

This command can be run by itself, e.g.:

bcf-extras arg-join --sep ";" *.vcf
# Outputs e.g. sample1.vcf;sample2.vcf

It can be used embedded in another command, e.g. with mergeSTR, a tool for merging STR caller VCFs which takes as input a comma-separated list of files:

mergeSTR --vcfs $(bcf-extras arg-join *.vcf) --out my_merge

The default separator (specified via --sep) is ,.

filter-gff3

This command can filter a GFF3 (or similarly formatted) file and filter it by various columns using regular expressions.

It prints the filtered lines to stdout, which can then be redirected to a file or piped to another process.

Currently, you can filter by the seqid, source, type, strand, and phrase columns using Python-formatted regular expressions, e.g. the following, which filters type to be either gene or exon and stores that in a new file:

bcf-extras filter-gff3 --type '^(gene|exon)$' example.gff3 > example-genes-exons.gff3

For help, run the sub-command with no arguments:

bcf-extras filter-gff3

reformat-vcf-contigs

TODO

What's Included (STR)

parallel-mergeSTR

mergeSTR is a tool by the Gymrek lab used to merge STR call VCFs. It proceeds linearly over a list of files, which cannot easily take advantage of multiple cores. This utility merges VCFs in a tree fashion to produce a final merged result, and is handy when merging 100s of STR call VCFs at once.

bcf-extras parallel-mergeSTR *.vcf.gz --out my_merge --ntasks 10

In a dataset of 148 single-sample gangSTR call VCFs, merging with parallel-mergeSTR on 10 cores resulted in an 60% speedup versus running on a single core (~2 hours versus ~5 hours.)

Speedup is not linear with number of cores used, so it only makes sense to use this if turnaround time is important and resources are available.

To not over-allocate resources on a cluster, the process can be split further into first a parallelized task, and then a task which only uses one core:

# Runs on multiple cores
bcf-extras parallel-mergeSTR *.vcf.gz --ntasks 10 --out my_merge --step1-only

# Intermediate files generated by the first step will feed into the second step.

# Bottlenecked single process step; the ntasks argument is still needed to 
# calculate the names of the intermediate output files (but sub-processes are 
# not spawned) - thus, the provided value must match the value above.
bcf-extras parallel-mergeSTR --ntasks 10 --out my_merge --step2-only

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bcf_extras-0.2.0.tar.gz (27.4 kB view details)

Uploaded Source

Built Distribution

bcf_extras-0.2.0-py3-none-any.whl (24.7 kB view details)

Uploaded Python 3

File details

Details for the file bcf_extras-0.2.0.tar.gz.

File metadata

  • Download URL: bcf_extras-0.2.0.tar.gz
  • Upload date:
  • Size: 27.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/57.4.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.9.7

File hashes

Hashes for bcf_extras-0.2.0.tar.gz
Algorithm Hash digest
SHA256 4aecbcc015d481416857313b8e8806f4b74a1d0043ab1265af514416e153bd0a
MD5 8d3bb01b284dddc4067e7cf189445bbb
BLAKE2b-256 98a40feb9dbe80bcba058f88dfc3aa0a812703c65d44e402490ef0b76673cbd1

See more details on using hashes here.

File details

Details for the file bcf_extras-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: bcf_extras-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 24.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/57.4.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.9.7

File hashes

Hashes for bcf_extras-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0329dbd8787f94d2316e363cc1da031bb6ff8bb0e240299661199bfbea653314
MD5 1e1f709b85c55bbb818e9155f56b1502
BLAKE2b-256 ee4163503d9b1cce684b48046635ee7a29fcb29aa006155fa33e0bc00ce87e0b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page