Depth-aware dynamic BAM file downsampling
Project description
samsampleX
A Python-based tool for dynamic BAM file downsampling, unlike existing tools that only downsample uniformly, based on a single global fraction value. Sample reads from a source BAM file to match the depth of coverage distribution of one or more template BAM file(s) through a created BED template.
Features:
- Reproducable, integer seed-based deterministic downsampling
- Uniform sampling mode: retain a fixed fraction of reads, feature parity with existing tools.
- Map depth from multiple BAM files to a single BED template via common aggregation statistics (
min,mean,median,max,random). - Calculation of quality metrics:
- Mean absolute error (MAE): mean per-base absolute difference in depth between two BAMs over the region.
- First-order Wasserstein distance (W1): L1 distance between empirical CDFs of per-base depths.
- Plotting for visual sampling comparisons, with an option to emit a TSV file of the same data instead.
Installation
Requirements
- pysam
- xxHash
- numpy
- matplotlib
- scipy
- Snakemake (benchmarking only)
- pytest (testing only)
Build samsampleX
git clone https://github.com/sdemiriz/samsampleX.git
cd samsampleX
pip install .
Usage
Mapping
Extract depth of coverage from one or more template BAM file(s) to a single BED template. When multiple BAMs are provided, per-position depths are combined using the selected --mode.
# Single BAM
samsampleX map \
--template-bam template.bam \
--region chr1:1000-2000 \
--out-bed template.bed
# Multiple BAMs (combined per-position using mean)
samsampleX map \
--template-bam a.bam b.bam c.bam \
--region chr1:1000-2000 \
--mode mean \
--out-bed template.bed
| Option | Description | Default |
|---|---|---|
--template-bam FILE [FILE ...] |
Input BAM file(s) (required) | - |
--region REGION |
Target region, samtools-style (required) | - |
--out-bed FILE |
Output BED file (required) | - |
--collapse INT |
Merge consecutive positions with depth diff <= INT | 0 |
--mode MODE |
Combine mode when multiple BAMs: min, mean, median, max, random |
mean |
--seed INT |
Random seed for --mode random |
42 |
Sampling
Downsample BAM based on provided BED template, using selected metric if multiple BEDs provided. Alternatively, use --uniform for position-independent uniform sampling similar to existing tools.
Depth-based sampling (template required):
samsampleX sample \
--source-bam high_depth.bam \
--template-bed template.bed \
--region chr1:1000-2000 \
--out-bam sampled.bam
Uniform sampling (no template):
samsampleX sample \
--source-bam high_depth.bam \
--uniform 0.5 \
--region chr1:1000-2000 \
--out-bam sampled.bam
Retains approximately 50% of reads uniformly across the region.
| Option | Description | Default |
|---|---|---|
--source-bam FILE |
Input BAM to sample reads from (required) | - |
--template-bed FILE |
Template BED file; required unless --uniform is used |
- |
--uniform FRACTION |
Uniform sampling: retain fraction of reads. Bypasses template-based downsampling. | - |
--region REGION |
Target region, samtools-style (required) | - |
--out-bam FILE |
Output BAM file to write reads to (required) | - |
--mode MODE |
Combine mode for multiple templates: min, mean, median, max, random |
random |
--stat STAT |
Statistic for summarising ratio over read span: min, mean, median, max, random |
mean |
--seed INT |
Random seed for reproducibility | 42 |
Plotting
Compare depth of coverage between source, template, and output BAM files. Output either as PNG plot or TSV data.
Green is source, orange is template and blue is output depth.
TSV contains one column for position, and three for respective depths of source, template and output.
# Generate PNG plot
samsampleX plot \
--source-bam high_depth.bam \
--template-bam template.bam \
--out-bam sampled.bam \
--region chr1:1000-2000 \
--out-png coverage_plot.png
| Option | Description | Default |
|---|---|---|
--source-bam FILE |
Source BAM file (required) | - |
--template-bam FILE |
Template BAM file (mutually exclusive with --template-bed) | - |
--template-bed FILE |
Template BED file (mutually exclusive with --template-bam) | - |
--out-bam FILE |
Output BAM file from sampling (required) | - |
--region REGION |
Target region, samtools-style (required) | - |
--out-png FILE |
Output PNG plot (mutually exclusive with --out-tsv) | - |
--out-tsv FILE |
Output TSV data (mutually exclusive with --out-png) | - |
Mapback
If you do not use HLA*LA and its specific read processing method, feel free to ignore this section.
Remap HLA*LA PRG-mapped reads back to canonical chr6 coordinates. This is a preprocessing step for BAM files produced by HLA*LA, which maps reads to a pangenome reference graph (PRG) with synthetic contig names (PRG_1, PRG_2, ...). The mapback subcommand translates these back to chr6 positions using the HLA*LA sequences.txt file and known HLA gene / alt contig boundaries.
The output BAM can then be used as input to sample for depth-aware downsampling on chr6.
# Step 1: remap PRG reads to chr6
samsampleX mapback \
--source-bam hlala_output.bam \
--region chr6:28000000-34000000 \
--genome-build GRCh38 \
--out-bam remapped.bam
# Step 2: sample from the remapped BAM
samsampleX sample \
--source-bam remapped.bam \
--template-bed template.bed \
--region chr6:28000000-34000000 \
--out-bam sampled.bam
| Option | Description | Default |
|---|---|---|
--source-bam FILE |
HLA*LA-remapped BAM file (required) | - |
--region REGION |
Target region on chr6, samtools-style (required) | - |
--out-bam FILE |
Output BAM file (required) | - |
--genome-build BUILD |
Reference genome build: GRCh38 or GRCh37 (required) |
- |
--prg-seq FILE |
Path to HLA*LA sequences.txt |
HLA-LA/graphs/PRG_MHC_GRCh38_withIMGT/sequences.txt |
Stats
Compare depth distributions between two inputs over a given region. Each input can be a BAM file or a BED file (auto-detected by extension). Reports mean absolute error (MAE) and Wasserstein-1 distance.
# BAM vs BAM
samsampleX stats \
--a template.bam \
--b sampled.bam \
--region chr1:1000-2000
# BED vs BAM (e.g. combined cohort template against sampled output)
samsampleX stats \
--a template.bed \
--b sampled.bam \
--region chr1:1000-2000
| Option | Description | Default |
|---|---|---|
--a FILE |
First input — BAM or BED file (reference) (required) | - |
--b FILE |
Second input — BAM or BED file (comparison) (required) | - |
--region REGION |
Target region, samtools-style (required) | - |
Example
The following commands showcase an example workflow of a short, arbitrary region on chromosome 21. Three 1000 Genomes Project 30X WGS samples are downloaded and mapped to a template, then used to downsample a GIAB 300X WGS sample in the same region. The results are finally displayed on a plot.
cd examples/
# Download reference genome (GRCh38)
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
# Download three first three 1K Genomes 30X WGS samples from
# https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/1000G_2504_high_coverage.sequence.index
wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239480/NA12718.final.cram -O NA12718.cram && samtools index NA12718.cram
wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239481/NA12748.final.cram -O NA12748.cram && samtools index NA12748.cram
wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239482/NA12775.final.cram -O NA12775.cram && samtools index NA12775.cram
# Convert to BAM, restrict to target region and index
samtools view NA12718.cram chr21:10000000-10010000 -b -o NA12718.bam -T GRCh38_full_analysis_set_plus_decoy_hla.fa && samtools index NA12718.bam
samtools view NA12748.cram chr21:10000000-10010000 -b -o NA12748.bam -T GRCh38_full_analysis_set_plus_decoy_hla.fa && samtools index NA12748.bam
samtools view NA12775.cram chr21:10000000-10010000 -b -o NA12775.bam -T GRCh38_full_analysis_set_plus_decoy_hla.fa && samtools index NA12775.bam
# Run samsampleX workflow
samsampleX map \
--template-bam NA12718.bam NA12748.bam NA12775.bam \
--region chr21:10000000-10010000 \
--mode mean \
--collapse 0 \
--out-bed template.bed
# template.bed should match example-template.bed
# Source BAM+index is provided in the examples directory, created by subsetting to target region from
# https://ftp.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/NHGRI_Illumina300X_AJtrio_novoalign_bams/HG002.GRCh38.300x.bam
samsampleX sample \
--source-bam HG002.GRCh38.300x.chr21:10000000-10010000.bam \
--template-bed template.bed \
--region chr21:10000000-10010000 \
--seed 42 \
--out-bam sampled.bam
samtools index sampled.bam
samsampleX plot \
--source-bam HG002.GRCh38.300x.chr21:10000000-10010000.bam \
--template-bed template.bed \
--out-bam sampled.bam \
--region chr21:10000000-10010000 \
--out-png plot.png
# plot.png should match example-plot.png
Testing
A pytest test suite is available. Run with the -v flag for a detailed report.
pytest -v
Algorithm rundown
Mapping
- Parse target region from first BAM header
- Compute per-position depth of coverage for each BAM over the region
- If multiple BAMs: combine depths per-position using
--mode(min, mean, median, max, random) - Optionally collapse consecutive similar depths (
--collapse) - Write to BED4 format (
chrom,start,end,depthcolumns)
Sampling
- Uniform mode (
--uniform FRACTION): Skip template downsampling. For each read, hash the read name with xxHash32 to get $f_{read} \in [0, 1)$; keep if $f_{read} < FRACTION$. Deterministic and position-independent. - Depth-based mode: Load template depths from BED file(s); if multiple templates are provided, combine them per-position using the selected
--mode - Compute source depths from BAM
- Calculate per-position sampling coefficient: $ratio(i) = \min(1,; depth_{template}(i) ;/; depth_{source}(i))$
- Positions where the template depth meets or exceeds the source depth get coefficient 1.0 (keep all reads)
- Positions with zero source depth get coefficient 0.0
- Build a cumulative sum of the coefficient array for O(1) range queries
- For each read in the source BAM:
- Hash read name with xxHash32 to produce a deterministic fraction $f_{read} \in [0, 1)$
- Summarise the coefficient over the read's covered positions using
--stat(min, mean, median, max, random; default mean via cumsum for mean).randompicks one overlap ratio from a deterministic index (read span + seed). - Keep the read if $f_{read} < ratio_{read}$
Metrics
| Metric | Significance |
|---|---|
| Mean Absolute Error | Average absolute per-base depth difference between the two BAMs |
| Wasserstein-1 Distance | L1 distance between empirical CDFs of depth (scales with region length) |
Benchmarking
Benchmarking is done by a snakemake workflow in the benchmarks directory, and thus snakemake should be installed beforehand (for HPC systems, also install snakemake-executor-plugin-slurm or other plugin compatible with your system type).
An Apptainer container definition bench.def that contains installs for GATK, samtools, sambamba and samsampleX is included. Build this container using apptainer build bench.sif bench.def before running the workflow.
Configure the benchmarking parameters in config.yaml in the same directory: copy and rename an existing chunk with all parameters and populate the values. All input files are expected to be found in the same directory as config.yaml, BAM files should be indexed using samtools index.
# config.yaml
benchmarks: # all chunks should be children of this header
wgs-chr21: # arbitrary name for benchmarking instance, parameters will be children
chr: "chr21" # specify contig
start: 1 # region start coordinate
end: 46709982 # region end coordinate
seed: 42 # random seed (base)
n_replicates: 1 # replicate count, will affect seed
# (e.g. seed=42, n=3 will use seeds 43, 44, 45)
collapse: 0 # define smoothing during mapping step
templates: # specify files to use as templates in sampling
- "template.bam" # all files must be in the benchmarks directory
mode: "mean" # how to determine per-position template depths from multiple template files
source: "source.bam" # specify file to downsample
coefficient: 0.1 # coefficient provided to GATK, samtools, sambamba
cpu: 2 # specify hardware resource (used by all steps)
mem_mb: 16384
time: "10:00"
When executing the workflow, navigate to the benchmarks directory and make sure to use the following arguments:
snakemake -p --use-apptainer --apptainer-args '--bind $(pwd)'
A directory for all intermediate files will be created for each chunk defined in config.yaml and the final benchmark results will be made available in the benchmarks directory as benchmark-{chunk_name}.tsv.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file samsamplex-0.1.0.tar.gz.
File metadata
- Download URL: samsamplex-0.1.0.tar.gz
- Upload date:
- Size: 26.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39f81590fc02b7a4a8eaa8fa97cc14d96db7f3f40bc7a2c3ea01714caab84f19
|
|
| MD5 |
200a9d99b580da01a126ea790b99ec5c
|
|
| BLAKE2b-256 |
208f5e2250c07fc076ba099405269f6b68db09234648788535f25c0f9d63f4ed
|
Provenance
The following attestation bundles were made for samsamplex-0.1.0.tar.gz:
Publisher:
publish-pypi.yml on sdemiriz/samsampleX
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
samsamplex-0.1.0.tar.gz -
Subject digest:
39f81590fc02b7a4a8eaa8fa97cc14d96db7f3f40bc7a2c3ea01714caab84f19 - Sigstore transparency entry: 1453083918
- Sigstore integration time:
-
Permalink:
sdemiriz/samsampleX@284611eecdaf5bda7043d645d3d6090cabbd71a5 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/sdemiriz
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@284611eecdaf5bda7043d645d3d6090cabbd71a5 -
Trigger Event:
release
-
Statement type:
File details
Details for the file samsamplex-0.1.0-py3-none-any.whl.
File metadata
- Download URL: samsamplex-0.1.0-py3-none-any.whl
- Upload date:
- Size: 23.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8bff6e48a133f4c814df87922aca55a176c2d2b24178706129dd6472c5894935
|
|
| MD5 |
bb4e4bb6cfb67d1335d142aaf51372dd
|
|
| BLAKE2b-256 |
ca853961bb0f966dc3bb8d2a4e0b8ba39e398d9849afc4142294156ea4953a17
|
Provenance
The following attestation bundles were made for samsamplex-0.1.0-py3-none-any.whl:
Publisher:
publish-pypi.yml on sdemiriz/samsampleX
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
samsamplex-0.1.0-py3-none-any.whl -
Subject digest:
8bff6e48a133f4c814df87922aca55a176c2d2b24178706129dd6472c5894935 - Sigstore transparency entry: 1453083941
- Sigstore integration time:
-
Permalink:
sdemiriz/samsampleX@284611eecdaf5bda7043d645d3d6090cabbd71a5 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/sdemiriz
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@284611eecdaf5bda7043d645d3d6090cabbd71a5 -
Trigger Event:
release
-
Statement type: