High-throughput, zero-IO parallel dispatcher for SAM/BAM streams. Distributes reads by index (round-robin) to persistent workers with backpressure management.
Project description
sam-dealer
High-throughput, zero-IO parallel dispatcher for SAM/BAM/CRAM streams. Distributes reads by index (round-robin) to persistent workers with backpressure management.
Install
sam-dealer relies on lightweight system-level concurrency tools (GNU Parallel and mbuffer) to achieve its performance. The following installs sam-dealer with mamba. You can also use conda as a (slower) drop-in replacement. click is a widely used Python package and can be installed with pip. samtools, parallel, and mbuffer can be manually installed if not using mamba/conda.
mamba install -c conda-forge samtools parallel mbuffer click
pip install sam-dealer
Example
This command initiates 10 parallel persistent wc -l jobs receiving the input.bam header and a continuous stream of records, 100,000 at a time.
sam-dealer input.bam -N 100000 --jobs 10 "wc -l"
Help
sam-dealer --help
Description
sam-dealer facilitates parallel dispatch over blocks of consecutive SAM/BAM/CRAM records. SAM/BAM/CRAM files facilitate fetching genomic regions or individual read names, but not blocks of consecutive records. However, some bioinformatics tools, such as pairtools parse, which extracts Hi-C pairs from alignments, require name-sorted input. Even when records can be treated independently, batching over genomic regions is non-trivial and requires careful engineering for load-balancing to avoid job idling.
Existing solutions to this problem introduce their own problems. Splitting alignment files on disk results in substantial write amplification, requires disk I/O made slower by the need to seek between many inputs, and may delay data processing until the file has been completely split, slowing the development cycle. Although standard utilities like pysam can stream and dispatch records after serializing them as strings, this is slow and requires complex manual implementation.
sam-dealer combines standard tools to stream-dispatch SAM/BAM/CRAM records to persistent jobs simply, rapidly, and with low memory pressure. Conceptually, it works as follows:
- Spin up
Jpersistent jobs (user-specified CLI commands that each receive distinct block of records from the input SAM/BAM/CRAM file) - The input file is divided into
Jinput streams, one for each job. - Each input stream is formed by round-robin batching of
Nlinearly consecutive records at a time. The first job gets the firstNrecords, the second job gets the nextNrecords, and so on in a loop until all records have been processed. - Input streams flow through a memory buffer (one buffer per job) that spills to disk under memory pressure. This allows all jobs to run independently at maximum speed as long as their buffers are not full, without waiting for the previous job to finish consuming its next batch.
- Importantly, we do not initiate one job per batch. Instead, we initiate
Jjobs that receive the concatenation of the header and all the batches that belong to the job as a continuous stream viastdin. From the job's perspective, it is streaming in everyJth block ofNrecords over the entire input file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sam_dealer-0.1.0.tar.gz.
File metadata
- Download URL: sam_dealer-0.1.0.tar.gz
- Upload date:
- Size: 3.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
20bcc6178a88643bacb0957edf27a611ab9d1e028583fba84374593d982e0471
|
|
| MD5 |
bea503ac0cee3359480eab5e51b0d31b
|
|
| BLAKE2b-256 |
798b30786a848b59fac7f6b7a01542127ef9f015cc56dad2668bc9c18132c7dc
|
File details
Details for the file sam_dealer-0.1.0-py2.py3-none-any.whl.
File metadata
- Download URL: sam_dealer-0.1.0-py2.py3-none-any.whl
- Upload date:
- Size: 5.7 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f98eafca61ede7849c2d450e71558dba56cacd425806aafc700e8b26fdf2bbd
|
|
| MD5 |
4eab891d4d410bfaf0d326b1e78c1881
|
|
| BLAKE2b-256 |
23276300b9e63ee562c4ed30550196ccd030cea59005b6228ce772a66d3c668e
|