Depth-aware dynamic BAM file downsampling
Project description
samsampleX
A Python-based tool for dynamic BAM file downsampling, unlike existing tools that only downsample uniformly, based on a single global fraction value. Sample reads from a source BAM file to match the depth of coverage distribution of one or more template BAM file(s) through a created BED template.
Features:
- Reproducable, integer seed-based deterministic downsampling
- Uniform sampling mode: downsample to generate an even depth distribution with a depth target.
- Map depth from multiple BAM files to a single BED template via common aggregation statistics (
min,mean,median,max,random). - Downsampling accuracy calculation (
stats): per-window signed depth difference (raw and relative to template mean depth), TSV output, and a stderr summary of the top windows by absolute value for each of those two metrics (signed values shown; row count configurable with--rows, default 10). - Plotting for visual sampling comparisons, with an option to emit a TSV file of the same data instead.
Installation
Requirements
- pysam
- xxHash
- numpy
- matplotlib
- scipy
- Snakemake (benchmarking only)
- pytest (testing only)
Install samsampleX from PyPI
pip install samsampleX
Usage
Mapping
Extract depth of coverage from one or more template BAM file(s) to a single BED template. When multiple BAMs are provided, per-position depths are combined using the selected --mode.
# Single BAM
samsampleX map \
--template-bam template.bam \
--region chr1:1000-2000 \
--out-bed template.bed
# Multiple BAMs (combined per-position using mean)
samsampleX map \
--template-bam a.bam b.bam c.bam \
--region chr1:1000-2000 \
--mode mean \
--out-bed template.bed
| Option | Description | Default |
|---|---|---|
--template-bam FILE [FILE ...] |
Input BAM file(s) (required) | - |
--region REGION |
Target region, samtools-style (required) | - |
--out-bed FILE |
Output BED file (required) | - |
--collapse INT |
Merge consecutive positions with depth diff <= INT | 0 |
--mode MODE |
Combine mode when multiple BAMs: min, mean, median, max, random |
mean |
--seed INT |
Random seed for --mode random |
42 |
Sampling
Downsample BAM based on provided BED template, using selected metric if multiple BEDs provided. Alternatively, use --uniform with a target depth (e.g. 30) for even-depth downsampling without a template BED.
Depth-based sampling (template required):
samsampleX sample \
--source-bam high_depth.bam \
--template-bed template.bed \
--region chr1:1000-2000 \
--out-bam sampled.bam
Target-depth sampling (no template):
samsampleX sample \
--source-bam high_depth.bam \
--uniform 30 \
--region chr1:1000-2000 \
--out-bam sampled.bam
Downsamples toward approximately 30x coverage using per-position ratios derived from source depth.
| Option | Description | Default |
|---|---|---|
--source-bam FILE |
Input BAM to sample reads from (required) | - |
--template-bed FILE |
Template BED file; required unless --uniform is used |
- |
--uniform DEPTH |
Target depth in x (e.g. 30). Depth-aware downsampling without a template BED. | - |
--region REGION |
Target region, samtools-style (required) | - |
--out-bam FILE |
Output BAM file to write reads to (required) | - |
--mode MODE |
Combine mode for multiple templates: min, mean, median, max, random |
random |
--stat STAT |
Statistic for summarising ratio over read span: min, mean, median, max, random |
mean |
--seed INT |
Random seed for reproducibility | 42 |
Plotting
Compare depth of coverage between source, template, and output BAM files. Output either as PNG plot or TSV data.
Green is source, orange is template and blue is output depth.
TSV output contains one column for position, and three for respective depths of source, template and output.
# Generate PNG plot
samsampleX plot \
--source-bam high_depth.bam \
--template-bam template.bam \
--out-bam sampled.bam \
--region chr1:1000-2000 \
--out-png coverage_plot.png
| Option | Description | Default |
|---|---|---|
--source-bam FILE |
Source BAM file (required) | - |
--template-bam FILE |
Template BAM file (mutually exclusive with --template-bed) | - |
--template-bed FILE |
Template BED file (mutually exclusive with --template-bam) | - |
--out-bam FILE |
Output BAM file from sampling (required) | - |
--region REGION |
Target region, samtools-style (required) | - |
--out-png FILE |
Output PNG plot (mutually exclusive with --out-tsv) | - |
--out-tsv FILE |
Output TSV data (mutually exclusive with --out-png) | - |
Mapback
If you do not use HLA*LA and its specific read processing method, fignore this section.
Remap HLA*LA PRG-mapped reads back to canonical chr6 coordinates. This is a preprocessing step for BAM files produced by HLA*LA, which maps reads to a pangenome reference graph (PRG) with synthetic contig names (PRG_1, PRG_2, ...). The mapback subcommand translates these back to chr6 positions using the HLA*LA sequences.txt file and known HLA gene / alt contig boundaries.
The output BAM can then be used as input to sample for depth-aware downsampling on chr6.
# Step 1: remap PRG reads to chr6
samsampleX mapback \
--source-bam hlala_output.bam \
--region chr6:28000000-34000000 \
--genome-build GRCh38 \
--out-bam remapped.bam
# Step 2: sample from the remapped BAM
samsampleX sample \
--source-bam remapped.bam \
--template-bed template.bed \
--region chr6:28000000-34000000 \
--out-bam sampled.bam
| Option | Description | Default |
|---|---|---|
--source-bam FILE |
HLA*LA-remapped BAM file (required) | - |
--region REGION |
Target region on chr6, samtools-style (required) | - |
--out-bam FILE |
Output BAM file (required) | - |
--genome-build BUILD |
Reference genome build: GRCh38 or GRCh37 (required) |
- |
--prg-seq FILE |
Path to HLA*LA sequences.txt |
HLA-LA/graphs/PRG_MHC_GRCh38_withIMGT/sequences.txt |
Stats
Compare depth between a template track and a result track over a region. Each input can be a BAM or BED file (inferred from file extension). Depths are aggregated into non-overlapping windows of --window base pairs.
TSV output is one row per window with a header line:
chrom, start, end, mean_depth_temp, depth_diff, rel_depth_diff, mean_depth_res
Per window, depth_diff is the mean signed depth difference (result − template). rel_depth_diff is depth_diff divided by mean_depth_temp when that mean is positive; otherwise it is not a finite value. mean_depth_temp and mean_depth_res are the mean depths of the template and result in the window.
Standard error prints the command line and, if any window has zero mean template or zero mean result depth, a warning. It then lists up to --rows windows (default 10) with the largest |depth_diff|, and up to --rows with the largest |rel_depth_diff|, in each case printing the signed value (explicit + or -). Fewer lines appear when there are not enough windows or not enough finite relative values.
# BAM vs BAM
samsampleX stats \
--template template.bam \
--result sampled.bam \
--region chr1:1000-2000
# BED vs BAM (e.g. cohort template against sampled output)
samsampleX stats \
--template template.bed \
--result sampled.bam \
--region chr1:1000-2000 \
--window 100 \
--out-tsv per_window.tsv
| Option | Description | Default |
|---|---|---|
--template FILE |
Template track — BAM or BED (reference depths) (required) | - |
--result FILE |
Result track — BAM or BED (comparison depths) (required) | - |
--region REGION |
Target region, samtools-style (required) | - |
--window INT |
Window size in bp for non-overlapping aggregation | 100 |
--out-tsv FILE |
Per-window metrics TSV; use - for stdout |
- |
--rows INT |
Top windows per metric to print on stderr (by ` | depth_diff |
Example
The following commands showcase an example workflow of a short, arbitrary region on chromosome 21. Three 1000 Genomes Project 30X WGS samples are downloaded and mapped to a template, then used to downsample a GIAB 300X WGS sample in the same region. The results are finally displayed on a plot.
cd examples/
# Download reference genome (GRCh38)
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
# Download three first three 1K Genomes 30X WGS samples from
# https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/1000G_2504_high_coverage.sequence.index
wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239480/NA12718.final.cram -O NA12718.cram && samtools index NA12718.cram
wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239481/NA12748.final.cram -O NA12748.cram && samtools index NA12748.cram
wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239482/NA12775.final.cram -O NA12775.cram && samtools index NA12775.cram
# Convert to BAM, restrict to target region and index
samtools view NA12718.cram chr21:10000000-10010000 -b -o NA12718.bam -T GRCh38_full_analysis_set_plus_decoy_hla.fa && samtools index NA12718.bam
samtools view NA12748.cram chr21:10000000-10010000 -b -o NA12748.bam -T GRCh38_full_analysis_set_plus_decoy_hla.fa && samtools index NA12748.bam
samtools view NA12775.cram chr21:10000000-10010000 -b -o NA12775.bam -T GRCh38_full_analysis_set_plus_decoy_hla.fa && samtools index NA12775.bam
# Run samsampleX workflow
samsampleX map \
--template-bam NA12718.bam NA12748.bam NA12775.bam \
--region chr21:10000000-10010000 \
--mode mean \
--collapse 0 \
--out-bed template.bed
# template.bed should match example-template.bed
# Source BAM+index is provided in the examples directory, created by subsetting to target region from
# https://ftp.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/NHGRI_Illumina300X_AJtrio_novoalign_bams/HG002.GRCh38.300x.bam
samsampleX sample \
--source-bam HG002.GRCh38.300x.chr21:10000000-10010000.bam \
--template-bed template.bed \
--region chr21:10000000-10010000 \
--seed 42 \
--out-bam sampled.bam
samtools index sampled.bam
samsampleX plot \
--source-bam HG002.GRCh38.300x.chr21:10000000-10010000.bam \
--template-bed template.bed \
--out-bam sampled.bam \
--region chr21:10000000-10010000 \
--out-png plot.png
# plot.png should match example-plot.png
Testing
A pytest test suite is available. Run with the -v flag for a detailed report.
pytest -v
Algorithm rundown
Mapping
- Parse target region from first BAM header
- Compute per-position depth of coverage for each BAM over the region
- If multiple BAMs: combine depths per-position using
--mode(min, mean, median, max, random) - Optionally collapse consecutive similar depths (
--collapse) - Write to BED4 format (
chrom,start,end,depthcolumns)
Sampling
- Target-depth mode (
--uniform DEPTH): Compute source depths from BAM. Per-position coefficient: $ratio(i) = \min(1,; DEPTH ;/; depth_{source}(i))$. Positions with zero source depth get coefficient 0.0. Then follow steps 5–6 below (including--stat). - Template mode: Load template depths from BED file(s); if multiple templates are provided, combine them per-position using the selected
--mode - Compute source depths from BAM
- Calculate per-position sampling coefficient: $ratio(i) = \min(1,; depth_{template}(i) ;/; depth_{source}(i))$
- Positions where the template depth meets or exceeds the source depth get coefficient 1.0 (keep all reads)
- Positions with zero source depth get coefficient 0.0
- Build a cumulative sum of the coefficient array for O(1) range queries
- For each read in the source BAM:
- Hash read name with xxHash32 to produce a deterministic fraction $f_{read} \in [0, 1)$
- Summarise the coefficient over the read's covered positions using
--stat(min, mean, median, max, random; default mean via cumsum for mean).randompicks one overlap ratio from a deterministic index (read span + seed). - Keep the read if $f_{read} < ratio_{read}$
Metrics
Windowed statistics from stats (each value is a mean over positions inside that window, except rel_depth_diff, which scales that window’s mean signed error by the template mean depth):
| Metric | Significance |
|---|---|
mean_depth_temp |
Mean depth for the template in the window |
depth_diff |
Mean signed depth difference (result − template) |
rel_depth_diff |
depth_diff / mean_depth_temp when mean_depth_temp > 0 |
mean_depth_res |
Mean depth for the result in the window |
Benchmarking
Benchmarking is done by a snakemake workflow in the benchmarks directory, and thus snakemake should be installed beforehand (for HPC systems, also install snakemake-executor-plugin-slurm or other plugin compatible with your HPC executor type).
An Apptainer container definition bench.def that contains installs for GATK, samtools, sambamba and samsampleX is included. Build this container using apptainer build bench.sif bench.def before running the workflow.
Configure the benchmarking parameters in config.yaml in the same directory: copy and rename an existing chunk with all parameters and populate the values. All input files are expected to be found in the same directory as config.yaml, BAM files should be indexed using samtools index.
Download the source alignment from [here](https://ftp.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/
HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/
NHGRI_Illumina300X_AJtrio_novoalign_bams/HG002.GRCh38.300x.bam) and the templates from here. The templates in this example config should go in a directory called template-bams on the same level as the Snakefile.
The benchmarking is pattern dependent base on the name of each setup. If a setup starts with templated, the specification of at least one template is required. Otherwise, if a setup starts with uniform, a target_depth value needs to be supplied. Examples for this configuration are below.
# config.yaml
parameters:
seed: 42
source: "HG002.GRCh38.300x.bam"
n_replicates: 10
collapse: 0
cpus_per_task: 2
mem_mb: 16384
runtime: 30
tools:
- gatk-constant-memory
- gatk-high-accuracy
- gatk-chained
- samtools
- sambamba
- samsampleX
benchmarks:
# HLA-A
templated-wgs-hla-a:
region: "chr6:29941260-29945884"
templates:
- template-bams/HG01363.bam
- template-bams/HG01941.bam
- template-bams/HG02122.bam
- template-bams/HG02127.bam
- template-bams/HG02291.bam
- template-bams/HG02678.bam
- template-bams/HG03593.bam
- template-bams/HG03780.bam
- template-bams/HG03965.bam
- template-bams/NA18547.bam
uniform-wgs-hla-a:
region: "chr6:29941260-29945884"
target_depth: 30
# TP53
templated-wegs-tp53:
region: "chr17:7758460-7784220"
templates:
- template-bams/HG002-WEGS-8P5X-R1.bam
# CHR21
uniform-wgs-chr21-30x:
region: "chr21:1-46709982"
target_depth: 30
When executing the workflow, navigate to the benchmarks directory and make sure to use the following arguments:
snakemake -p --use-apptainer --apptainer-args '--bind $(pwd)'
A directory for all intermediate files will be created for each chunk defined in config.yaml and the final benchmark results will be made available in each benchmark directory as summary.tsv.
Development
Get the most up-to-date version of samsampleX from Github and build locally.
git clone https://github.com/sdemiriz/samsampleX.git
cd samsampleX
pip install -e .
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file samsamplex-0.2.0.tar.gz.
File metadata
- Download URL: samsamplex-0.2.0.tar.gz
- Upload date:
- Size: 30.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
799c2c9d3bc496392b2ead8c83ed14375e59a68a371f055f25bedad4f52a4cfb
|
|
| MD5 |
29776cdd749a2525408589b7b3e6050b
|
|
| BLAKE2b-256 |
d88fb3ce8550fcc592eca9e73250166f9cab161bec6ff34c26e04dcd9e5dd961
|
Provenance
The following attestation bundles were made for samsamplex-0.2.0.tar.gz:
Publisher:
publish-pypi.yml on sdemiriz/samsampleX
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
samsamplex-0.2.0.tar.gz -
Subject digest:
799c2c9d3bc496392b2ead8c83ed14375e59a68a371f055f25bedad4f52a4cfb - Sigstore transparency entry: 1735149918
- Sigstore integration time:
-
Permalink:
sdemiriz/samsampleX@f8ec47b01c14a1a8ad255cce8e8189b9a9d0a54c -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/sdemiriz
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@f8ec47b01c14a1a8ad255cce8e8189b9a9d0a54c -
Trigger Event:
release
-
Statement type:
File details
Details for the file samsamplex-0.2.0-py3-none-any.whl.
File metadata
- Download URL: samsamplex-0.2.0-py3-none-any.whl
- Upload date:
- Size: 25.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce7670200ca5544650a0028d4174920626f1b0c6c1e1d56a72d2f6ad4408ac73
|
|
| MD5 |
8c44a23743f1cd3644db23cbc2804dce
|
|
| BLAKE2b-256 |
ac6378272f0433b71605e4379cad942f711c421443ccee10185b6180705b2e06
|
Provenance
The following attestation bundles were made for samsamplex-0.2.0-py3-none-any.whl:
Publisher:
publish-pypi.yml on sdemiriz/samsampleX
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
samsamplex-0.2.0-py3-none-any.whl -
Subject digest:
ce7670200ca5544650a0028d4174920626f1b0c6c1e1d56a72d2f6ad4408ac73 - Sigstore transparency entry: 1735149943
- Sigstore integration time:
-
Permalink:
sdemiriz/samsampleX@f8ec47b01c14a1a8ad255cce8e8189b9a9d0a54c -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/sdemiriz
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@f8ec47b01c14a1a8ad255cce8e8189b9a9d0a54c -
Trigger Event:
release
-
Statement type: