Skip to main content

Depth-aware dynamic BAM file downsampling

Project description

samsampleX

A Python-based tool for dynamic BAM file downsampling, unlike existing tools that only downsample uniformly, based on a single global fraction value. Sample reads from a source BAM file to match the depth of coverage distribution of one or more template BAM file(s) through a created BED template.

Features:

  • Reproducable, integer seed-based deterministic downsampling
  • Uniform sampling mode: retain a fixed fraction of reads, feature parity with existing tools.
  • Map depth from multiple BAM files to a single BED template via common aggregation statistics (min, mean, median, max, random).
  • Calculation of quality metrics:
    • Mean absolute error (MAE): mean per-base absolute difference in depth between two BAMs over the region.
    • First-order Wasserstein distance (W1): L1 distance between empirical CDFs of per-base depths.
  • Plotting for visual sampling comparisons, with an option to emit a TSV file of the same data instead.

Installation

Requirements

  • pysam
  • xxHash
  • numpy
  • matplotlib
  • scipy
  • Snakemake (benchmarking only)
  • pytest (testing only)

Build samsampleX

git clone https://github.com/sdemiriz/samsampleX.git
cd samsampleX
pip install .

Usage

Mapping

Extract depth of coverage from one or more template BAM file(s) to a single BED template. When multiple BAMs are provided, per-position depths are combined using the selected --mode.

# Single BAM
samsampleX map \
    --template-bam template.bam \
    --region chr1:1000-2000 \
    --out-bed template.bed

# Multiple BAMs (combined per-position using mean)
samsampleX map \
    --template-bam a.bam b.bam c.bam \
    --region chr1:1000-2000 \
    --mode mean \
    --out-bed template.bed
Option Description Default
--template-bam FILE [FILE ...] Input BAM file(s) (required) -
--region REGION Target region, samtools-style (required) -
--out-bed FILE Output BED file (required) -
--collapse INT Merge consecutive positions with depth diff <= INT 0
--mode MODE Combine mode when multiple BAMs: min, mean, median, max, random mean
--seed INT Random seed for --mode random 42

Sampling

Downsample BAM based on provided BED template, using selected metric if multiple BEDs provided. Alternatively, use --uniform for position-independent uniform sampling similar to existing tools.

Depth-based sampling (template required):

samsampleX sample \
    --source-bam high_depth.bam \
    --template-bed template.bed \
    --region chr1:1000-2000 \
    --out-bam sampled.bam

Uniform sampling (no template):

samsampleX sample \
    --source-bam high_depth.bam \
    --uniform 0.5 \
    --region chr1:1000-2000 \
    --out-bam sampled.bam

Retains approximately 50% of reads uniformly across the region.

Option Description Default
--source-bam FILE Input BAM to sample reads from (required) -
--template-bed FILE Template BED file; required unless --uniform is used -
--uniform FRACTION Uniform sampling: retain fraction of reads. Bypasses template-based downsampling. -
--region REGION Target region, samtools-style (required) -
--out-bam FILE Output BAM file to write reads to (required) -
--mode MODE Combine mode for multiple templates: min, mean, median, max, random random
--stat STAT Statistic for summarising ratio over read span: min, mean, median, max, random mean
--seed INT Random seed for reproducibility 42

Plotting

Compare depth of coverage between source, template, and output BAM files. Output either as PNG plot or TSV data.

Green is source, orange is template and blue is output depth.

TSV contains one column for position, and three for respective depths of source, template and output.

# Generate PNG plot
samsampleX plot \
    --source-bam high_depth.bam \
    --template-bam template.bam \
    --out-bam sampled.bam \
    --region chr1:1000-2000 \
    --out-png coverage_plot.png
Option Description Default
--source-bam FILE Source BAM file (required) -
--template-bam FILE Template BAM file (mutually exclusive with --template-bed) -
--template-bed FILE Template BED file (mutually exclusive with --template-bam) -
--out-bam FILE Output BAM file from sampling (required) -
--region REGION Target region, samtools-style (required) -
--out-png FILE Output PNG plot (mutually exclusive with --out-tsv) -
--out-tsv FILE Output TSV data (mutually exclusive with --out-png) -

Mapback

If you do not use HLA*LA and its specific read processing method, feel free to ignore this section.

Remap HLA*LA PRG-mapped reads back to canonical chr6 coordinates. This is a preprocessing step for BAM files produced by HLA*LA, which maps reads to a pangenome reference graph (PRG) with synthetic contig names (PRG_1, PRG_2, ...). The mapback subcommand translates these back to chr6 positions using the HLA*LA sequences.txt file and known HLA gene / alt contig boundaries.

The output BAM can then be used as input to sample for depth-aware downsampling on chr6.

# Step 1: remap PRG reads to chr6
samsampleX mapback \
    --source-bam hlala_output.bam \
    --region chr6:28000000-34000000 \
    --genome-build GRCh38 \
    --out-bam remapped.bam

# Step 2: sample from the remapped BAM
samsampleX sample \
    --source-bam remapped.bam \
    --template-bed template.bed \
    --region chr6:28000000-34000000 \
    --out-bam sampled.bam
Option Description Default
--source-bam FILE HLA*LA-remapped BAM file (required) -
--region REGION Target region on chr6, samtools-style (required) -
--out-bam FILE Output BAM file (required) -
--genome-build BUILD Reference genome build: GRCh38 or GRCh37 (required) -
--prg-seq FILE Path to HLA*LA sequences.txt HLA-LA/graphs/PRG_MHC_GRCh38_withIMGT/sequences.txt

Stats

Compare depth distributions between two inputs over a given region. Each input can be a BAM file or a BED file (auto-detected by extension). Reports mean absolute error (MAE) and Wasserstein-1 distance.

# BAM vs BAM
samsampleX stats \
    --a template.bam \
    --b sampled.bam \
    --region chr1:1000-2000

# BED vs BAM (e.g. combined cohort template against sampled output)
samsampleX stats \
    --a template.bed \
    --b sampled.bam \
    --region chr1:1000-2000
Option Description Default
--a FILE First input — BAM or BED file (reference) (required) -
--b FILE Second input — BAM or BED file (comparison) (required) -
--region REGION Target region, samtools-style (required) -

Example

Example plot results

The following commands showcase an example workflow of a short, arbitrary region on chromosome 21. Three 1000 Genomes Project 30X WGS samples are downloaded and mapped to a template, then used to downsample a GIAB 300X WGS sample in the same region. The results are finally displayed on a plot.

cd examples/

# Download reference genome (GRCh38)
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

# Download three first three 1K Genomes 30X WGS samples from
# https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/1000G_2504_high_coverage.sequence.index
wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239480/NA12718.final.cram -O NA12718.cram && samtools index NA12718.cram
wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239481/NA12748.final.cram -O NA12748.cram && samtools index NA12748.cram
wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239482/NA12775.final.cram -O NA12775.cram && samtools index NA12775.cram

# Convert to BAM, restrict to target region and index
samtools view NA12718.cram chr21:10000000-10010000 -b -o NA12718.bam -T GRCh38_full_analysis_set_plus_decoy_hla.fa && samtools index NA12718.bam
samtools view NA12748.cram chr21:10000000-10010000 -b -o NA12748.bam -T GRCh38_full_analysis_set_plus_decoy_hla.fa && samtools index NA12748.bam
samtools view NA12775.cram chr21:10000000-10010000 -b -o NA12775.bam -T GRCh38_full_analysis_set_plus_decoy_hla.fa && samtools index NA12775.bam

# Run samsampleX workflow
samsampleX map \
    --template-bam NA12718.bam NA12748.bam NA12775.bam \
    --region chr21:10000000-10010000 \
    --mode mean \
    --collapse 0 \
    --out-bed template.bed
# template.bed should match example-template.bed

# Source BAM+index is provided in the examples directory, created by subsetting to target region from
# https://ftp.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/NHGRI_Illumina300X_AJtrio_novoalign_bams/HG002.GRCh38.300x.bam
samsampleX sample \
    --source-bam HG002.GRCh38.300x.chr21:10000000-10010000.bam \
    --template-bed template.bed \
    --region chr21:10000000-10010000 \
    --seed 42 \
    --out-bam sampled.bam

samtools index sampled.bam

samsampleX plot \
    --source-bam HG002.GRCh38.300x.chr21:10000000-10010000.bam \
    --template-bed template.bed \
    --out-bam sampled.bam \
    --region chr21:10000000-10010000 \
    --out-png plot.png
# plot.png should match example-plot.png

Testing

A pytest test suite is available. Run with the -v flag for a detailed report.

pytest -v

Algorithm rundown

Mapping

  1. Parse target region from first BAM header
  2. Compute per-position depth of coverage for each BAM over the region
  3. If multiple BAMs: combine depths per-position using --mode (min, mean, median, max, random)
  4. Optionally collapse consecutive similar depths (--collapse)
  5. Write to BED4 format (chrom, start, end, depth columns)

Sampling

  1. Uniform mode (--uniform FRACTION): Skip template downsampling. For each read, hash the read name with xxHash32 to get $f_{read} \in [0, 1)$; keep if $f_{read} < FRACTION$. Deterministic and position-independent.
  2. Depth-based mode: Load template depths from BED file(s); if multiple templates are provided, combine them per-position using the selected --mode
  3. Compute source depths from BAM
  4. Calculate per-position sampling coefficient: $ratio(i) = \min(1,; depth_{template}(i) ;/; depth_{source}(i))$
    • Positions where the template depth meets or exceeds the source depth get coefficient 1.0 (keep all reads)
    • Positions with zero source depth get coefficient 0.0
  5. Build a cumulative sum of the coefficient array for O(1) range queries
  6. For each read in the source BAM:
    • Hash read name with xxHash32 to produce a deterministic fraction $f_{read} \in [0, 1)$
    • Summarise the coefficient over the read's covered positions using --stat (min, mean, median, max, random; default mean via cumsum for mean). random picks one overlap ratio from a deterministic index (read span + seed).
    • Keep the read if $f_{read} < ratio_{read}$

Metrics

Metric Significance
Mean Absolute Error Average absolute per-base depth difference between the two BAMs
Wasserstein-1 Distance L1 distance between empirical CDFs of depth (scales with region length)

Benchmarking

Benchmarking is done by a snakemake workflow in the benchmarks directory, and thus snakemake should be installed beforehand (for HPC systems, also install snakemake-executor-plugin-slurm or other plugin compatible with your system type).

An Apptainer container definition bench.def that contains installs for GATK, samtools, sambamba and samsampleX is included. Build this container using apptainer build bench.sif bench.def before running the workflow.

Configure the benchmarking parameters in config.yaml in the same directory: copy and rename an existing chunk with all parameters and populate the values. All input files are expected to be found in the same directory as config.yaml, BAM files should be indexed using samtools index.

# config.yaml
benchmarks:             # all chunks should be children of this header
  wgs-chr21:            # arbitrary name for benchmarking instance, parameters will be children
    chr: "chr21"        # specify contig
    start: 1            # region start coordinate
    end: 46709982       # region end coordinate
    seed: 42            # random seed (base)
    n_replicates: 1     # replicate count, will affect seed 
                        # (e.g. seed=42, n=3 will use seeds 43, 44, 45)
    collapse: 0         # define smoothing during mapping step
    templates:          # specify files to use as templates in sampling
      - "template.bam"  # all files must be in the benchmarks directory

    mode: "mean"        # how to determine per-position template depths from multiple template files
    source: "source.bam" # specify file to downsample

    coefficient: 0.1    # coefficient provided to GATK, samtools, sambamba

    cpu: 2              # specify hardware resource (used by all steps)
    mem_mb: 16384
    time: "10:00"

When executing the workflow, navigate to the benchmarks directory and make sure to use the following arguments:

snakemake -p --use-apptainer --apptainer-args '--bind $(pwd)'

A directory for all intermediate files will be created for each chunk defined in config.yaml and the final benchmark results will be made available in the benchmarks directory as benchmark-{chunk_name}.tsv.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

samsamplex-0.1.0.tar.gz (26.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

samsamplex-0.1.0-py3-none-any.whl (23.6 kB view details)

Uploaded Python 3

File details

Details for the file samsamplex-0.1.0.tar.gz.

File metadata

  • Download URL: samsamplex-0.1.0.tar.gz
  • Upload date:
  • Size: 26.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for samsamplex-0.1.0.tar.gz
Algorithm Hash digest
SHA256 39f81590fc02b7a4a8eaa8fa97cc14d96db7f3f40bc7a2c3ea01714caab84f19
MD5 200a9d99b580da01a126ea790b99ec5c
BLAKE2b-256 208f5e2250c07fc076ba099405269f6b68db09234648788535f25c0f9d63f4ed

See more details on using hashes here.

Provenance

The following attestation bundles were made for samsamplex-0.1.0.tar.gz:

Publisher: publish-pypi.yml on sdemiriz/samsampleX

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file samsamplex-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: samsamplex-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 23.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for samsamplex-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8bff6e48a133f4c814df87922aca55a176c2d2b24178706129dd6472c5894935
MD5 bb4e4bb6cfb67d1335d142aaf51372dd
BLAKE2b-256 ca853961bb0f966dc3bb8d2a4e0b8ba39e398d9849afc4142294156ea4953a17

See more details on using hashes here.

Provenance

The following attestation bundles were made for samsamplex-0.1.0-py3-none-any.whl:

Publisher: publish-pypi.yml on sdemiriz/samsampleX

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page