Skip to main content

High-throughput, zero-IO parallel dispatcher for SAM/BAM streams. Distributes reads by index (round-robin) to persistent workers with backpressure management.

Project description

sam-dealer

High-throughput, zero-IO parallel dispatcher for SAM/BAM/CRAM streams. Distributes reads by index (round-robin) to persistent workers with backpressure management.

Install

sam-dealer relies on lightweight system-level concurrency tools (GNU Parallel and mbuffer) to achieve its performance. The following installs sam-dealer with mamba. You can also use conda as a (slower) drop-in replacement. click is a widely used Python package and can be installed with pip. samtools, parallel, and mbuffer can be manually installed if not using mamba/conda.

mamba install -c conda-forge samtools parallel mbuffer click
pip install sam-dealer

Example

This command initiates 10 parallel persistent wc -l jobs receiving the input.bam header and a continuous stream of records, 100,000 at a time.

sam-dealer input.bam -N 100000 --jobs 10 "wc -l"

Help

sam-dealer --help

Description

sam-dealer facilitates parallel dispatch over blocks of consecutive SAM/BAM/CRAM records. SAM/BAM/CRAM files facilitate fetching genomic regions or individual read names, but not blocks of consecutive records. However, some bioinformatics tools, such as pairtools parse, which extracts Hi-C pairs from alignments, require name-sorted input. Even when records can be treated independently, batching over genomic regions is non-trivial and requires careful engineering for load-balancing to avoid job idling.

Existing solutions to this problem introduce their own problems. Splitting alignment files on disk results in substantial write amplification, requires disk I/O made slower by the need to seek between many inputs, and may delay data processing until the file has been completely split, slowing the development cycle. Although standard utilities like pysam can stream and dispatch records after serializing them as strings, this is slow and requires complex manual implementation.

sam-dealer combines standard tools to stream-dispatch SAM/BAM/CRAM records to persistent jobs simply, rapidly, and with low memory pressure. Conceptually, it works as follows:

  • Spin up J persistent jobs (user-specified CLI commands that each receive distinct block of records from the input SAM/BAM/CRAM file)
  • The input file is divided into J input streams, one for each job.
  • Each input stream is formed by round-robin batching of N linearly consecutive records at a time. The first job gets the first N records, the second job gets the next N records, and so on in a loop until all records have been processed.
  • Input streams flow through a memory buffer (one buffer per job) that spills to disk under memory pressure. This allows all jobs to run independently at maximum speed as long as their buffers are not full, without waiting for the previous job to finish consuming its next batch.
  • Importantly, we do not initiate one job per batch. Instead, we initiate J jobs that receive the concatenation of the header and all the batches that belong to the job as a continuous stream via stdin. From the job's perspective, it is streaming in every Jth block of N records over the entire input file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sam_dealer-0.1.0.tar.gz (3.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sam_dealer-0.1.0-py2.py3-none-any.whl (5.7 kB view details)

Uploaded Python 2Python 3

File details

Details for the file sam_dealer-0.1.0.tar.gz.

File metadata

  • Download URL: sam_dealer-0.1.0.tar.gz
  • Upload date:
  • Size: 3.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for sam_dealer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 20bcc6178a88643bacb0957edf27a611ab9d1e028583fba84374593d982e0471
MD5 bea503ac0cee3359480eab5e51b0d31b
BLAKE2b-256 798b30786a848b59fac7f6b7a01542127ef9f015cc56dad2668bc9c18132c7dc

See more details on using hashes here.

File details

Details for the file sam_dealer-0.1.0-py2.py3-none-any.whl.

File metadata

  • Download URL: sam_dealer-0.1.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 5.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for sam_dealer-0.1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 0f98eafca61ede7849c2d450e71558dba56cacd425806aafc700e8b26fdf2bbd
MD5 4eab891d4d410bfaf0d326b1e78c1881
BLAKE2b-256 23276300b9e63ee562c4ed30550196ccd030cea59005b6228ce772a66d3c668e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page