Tools to parallelize operations on large BAM files
Project description
Parallelizing operations on SAM/BAM files
SAM/BAM files are typically large, thus, operations on these files are time intensive. This project provides tools to parallelize operations on SAM/BAM files. The workflow follows:
- Split BAM/SAM file in n chunks
- Perform operation in each chunk in a dedicated process and save resulting SAM/BAM chunk
- Merge results back into a single SAM/BAM file
Depends on:
- Samtools
Installation
pip3 install parallelbam
or
- Git clone project
- cd to cloned project directory
sudo python3 setup.py install
Better to install within an environment, such as a conda environment, to avoid path conflicts with the included bash scripts.
Usage
There is one main function named parallelizedBAMoperation
. This function takes as mandatory arguments:
- path to original bam file (should be ordered)
- a callable function to perform the operation on each bam file chunk
The callable function must accept the following two first arguments:
- path to input bam file and
- path to resulting output bam file
in this order.
TODO
- The current way to include bash scripts in the package, while working, seems awkward. Perhaps including bash code directly in subprocess would be simpler
- Having permission error in some installations upon calling splitBAM.sh, can one make it executable during installation?
from parallelbam.parallelbam import parallelizeBAMoperation, getNumberOfReads
As an example, let's create a function that simply copies a bam file to another directory (does nothing to the bam file). When calling this function in parallelizeBAMoperation
it will imply split the BAM file in chunks and the merge them back into a single BAM, whih sould be identical to the first one. We will split the BAM file in 8 chunks, and are dummy function will be called in separate process for each chunk.
import shutil
def foo(input_bam, output_bam):
shutil.copyfile(input_bam, output_bam)
parallelizeBAMoperation('sample.bam',
foo, output_dir=None,
n_processes=8)
To check that the processed bam file, after merging the 8 chunks, contains the same number of reads we can call getNumberOfReads
.
getNumberOfReads('sample.bam')
11825588
getNumberOfReads('processed.bam')
11825588
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file parallelbam-0.0.19.tar.gz
.
File metadata
- Download URL: parallelbam-0.0.19.tar.gz
- Upload date:
- Size: 10.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8a6eaf419f0baf3627a8b1c79d1f5641ba1d4c60813ab129e4690a15490fd746 |
|
MD5 | 68e79f16ed5fc31d45be1736afd94010 |
|
BLAKE2b-256 | c0f43eef0ad8d066b1ee4e3b15397416412a17f19ef804dd594fed626d5f45fa |
File details
Details for the file parallelbam-0.0.19-py3-none-any.whl
.
File metadata
- Download URL: parallelbam-0.0.19-py3-none-any.whl
- Upload date:
- Size: 10.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 52b3372366023793b8df92f08011f1bc30a9a9e1d8bd2c9acbefc5406185c092 |
|
MD5 | c7e2edfe5e7f5ff0756bfc83045a2fab |
|
BLAKE2b-256 | e36af3fbeeaa20b28db9d9014333247e9e13e9621ff5fa8c03e44828ba96c057 |