Skip to main content

Tools to parallelize operations on large BAM files

Project description

Parallelizing operations on SAM/BAM files

SAM/BAM files are typically large, thus, operations on these files are time intensive. This project provides tools to parallelize operations on SAM/BAM files. The workflow follows:

  1. Split BAM/SAM file in n chunks
  2. Perform operation in each chunk in a dedicated process and save resulting SAM/BAM chunk
  3. Merge results back into a single SAM/BAM file

Depends on:

  1. Samtools

Installation

pip3 install parallelbam

or

  1. Git clone project
  2. cd to cloned project directory
  3. sudo python3 setup.py install

Better to install within an environment, such as a conda environment, to avoid path conflicts with the included bash scripts.

Usage

There is one main function named parallelizedBAMoperation. This function takes as mandatory arguments:

  1. path to original bam file (should be ordered)
  2. a callable function to perform the operation on each bam file chunk

The callable function must accept the following two first arguments:

  1. path to input bam file and
  2. path to resulting output bam file

in this order.

TODO

  1. The current way to include bash scripts in the package, while working, seems awkward. Perhaps including bash code directly in subprocess would be simpler
  2. Having permission error in some installations upon calling splitBAM.sh, can one make it executable during installation?
from parallelbam.parallelbam import parallelizeBAMoperation, getNumberOfReads

As an example, let's create a function that simply copies a bam file to another directory (does nothing to the bam file). When calling this function in parallelizeBAMoperation it will imply split the BAM file in chunks and the merge them back into a single BAM, whih sould be identical to the first one. We will split the BAM file in 8 chunks, and are dummy function will be called in separate process for each chunk.

import shutil

def foo(input_bam, output_bam):
    shutil.copyfile(input_bam, output_bam)
    
    
parallelizeBAMoperation('sample.bam',
                        foo, output_dir=None,
                        n_processes=8)

To check that the processed bam file, after merging the 8 chunks, contains the same number of reads we can call getNumberOfReads.

getNumberOfReads('sample.bam')
11825588
getNumberOfReads('processed.bam')
11825588

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parallelbam-0.0.19.tar.gz (10.1 kB view details)

Uploaded Source

Built Distribution

parallelbam-0.0.19-py3-none-any.whl (10.9 kB view details)

Uploaded Python 3

File details

Details for the file parallelbam-0.0.19.tar.gz.

File metadata

  • Download URL: parallelbam-0.0.19.tar.gz
  • Upload date:
  • Size: 10.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.7

File hashes

Hashes for parallelbam-0.0.19.tar.gz
Algorithm Hash digest
SHA256 8a6eaf419f0baf3627a8b1c79d1f5641ba1d4c60813ab129e4690a15490fd746
MD5 68e79f16ed5fc31d45be1736afd94010
BLAKE2b-256 c0f43eef0ad8d066b1ee4e3b15397416412a17f19ef804dd594fed626d5f45fa

See more details on using hashes here.

File details

Details for the file parallelbam-0.0.19-py3-none-any.whl.

File metadata

File hashes

Hashes for parallelbam-0.0.19-py3-none-any.whl
Algorithm Hash digest
SHA256 52b3372366023793b8df92f08011f1bc30a9a9e1d8bd2c9acbefc5406185c092
MD5 c7e2edfe5e7f5ff0756bfc83045a2fab
BLAKE2b-256 e36af3fbeeaa20b28db9d9014333247e9e13e9621ff5fa8c03e44828ba96c057

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page