Skip to main content

Comprehensive BAM file deduplication that automatically handles multiple library schema

Project description

MarkDup

Pypi Releases Downloads Development Status

A comprehensive Python tool for deduplicating BAM files with automatic UMI detection and intelligent UMI-based or coordinate-based clustering.

⚠️ Early Development Stage: This tool is currently in alpha development. While functional, it may have bugs and the API may change. Please report any issues you encounter.

🚀 Features

  • 🔬 UMI-based deduplication with intelligent extraction from query names or BAM tags
  • 📍 Coordinate-based deduplication for files without UMIs
  • 🧬 Biological positioning for strand-aware clustering (start-only, end-only, or full fragment)
  • 🔄 Auto-detection of UMI presence and format
  • 🧬 Strand awareness for forward/reverse strand reads
  • 📏 CIGAR handling for reads with indels and complex alignments
  • ⚖️ Frequency balancing to prevent over-clustering of high-frequency UMIs
  • 🎯 Advanced clustering with edit distance and frequency-aware algorithms
  • 🔧 Quality selection with multiple metrics and automatic fallback
  • ⚡ Parallelized processing for multi-core performance
  • 📊 Comprehensive statistics and progress tracking

📦 Installation

From PyPI (Recommended)

pip install markdup

From Source

git clone https://github.com/y9c/markdup.git
cd markdup
pip install .

Using uv (Development)

git clone https://github.com/y9c/markdup.git
cd markdup
uv sync

🚀 Quick Start

Automatic UMI Detection and Processing

# Tool automatically detects UMIs and chooses appropriate method
markdup input.bam output.bam

# With multiple threads
markdup input.bam output.bam --threads 8

# Keep duplicates and mark them
markdup input.bam output.bam --keep-duplicates

Explicit Method Selection

# Default: Auto-detect UMI presence and use appropriate method
markdup input.bam output.bam

# Force coordinate-based deduplication (ignore UMIs)
markdup input.bam output.bam --no-umi

Advanced Positioning Options

# Start-only positioning (e.g., for ChIP-seq)
markdup input.bam output.bam --start-only

# End-only positioning (e.g., for reverse-complemented reads)
markdup input.bam output.bam --end-only

# Full fragment positioning (default, handles both start and end)
markdup input.bam output.bam

UMI Clustering Tuning

# Custom edit distance threshold
markdup input.bam output.bam --min-edit-dist-frac 0.17

# Frequency-aware clustering to prevent over-merging
markdup input.bam output.bam --min-frequency-ratio 0.1

# Custom UMI separator
markdup input.bam output.bam --umi-sep ":"

# Extract UMIs from BAM tags instead of query names
markdup input.bam output.bam --umi-tag UB

📋 Command Line Interface

Global Options

Option Description Default
--help Show help message -
--version Show version information -

Input/Output Options

Option Description Default
INPUT_BAM Input BAM file path Required
OUTPUT_BAM Output BAM file path Required
--force Overwrite output file if it exists False

Deduplication (UMI) Method

Option Description Default
--no-umi Force coordinate-based deduplication (ignore detected UMIs) Auto-detect
--umi-sep Separator for extracting UMIs from read names _
--umi-tag BAM tag name for UMI extraction (e.g., 'UB') None
--min-edit-dist-frac Minimum UMI edit distance as fraction of UMI length 0.1
--min-frequency-ratio Minimum frequency ratio for UMI clustering 0.1

Positioning Options

Option Description Default
--start-only Group reads by start position only False
--end-only Group reads by end position only False

Filtering Options

Option Description Default
--fragment-paired Keep only fragments with both reads present False
--fragment-mapped Keep only fragments where both reads are mapped False

Quality Selection

Option Description Default
--best-read-by Select best read by: mapq, avg_base_q avg_base_q

Processing Options

Option Description Default
--threads Number of threads for parallel processing 1
--window-size Size of genomic windows for processing 100000
--keep-duplicates Keep duplicate reads and mark them False

🧬 Algorithm Details

Automatic Condition Detection

The tool automatically detects and handles:

  1. UMI Presence: Scans read names for UMI patterns
  2. Read Type: Single-end vs. paired-end detection
  3. Strand Orientation: Forward vs. reverse strand handling
  4. CIGAR Complexity: Indel and complex alignment handling
  5. Quality Metrics: Available quality scores and selection criteria

Biological Positioning

MarkDup uses strand-aware positioning to ensure proper grouping regardless of read orientation:

  • Forward strand: Biological start = reference start, Biological end = reference end
  • Reverse strand: Biological start = reference end, Biological end = reference start
  • Strand-aware clustering: Ensures proper grouping regardless of strand orientation
  • CIGAR-aware positioning: Properly handles indels and complex alignments

UMI-based Deduplication

  1. Fragment Creation: Reads are grouped into fragments (single-end or paired-end)
  2. Position Grouping: Fragments are grouped by biological position and strand
  3. UMI Clustering: Within each position group, UMIs are clustered using:
    • Exact matching for identical UMIs
    • Edit distance clustering for similar UMIs
    • Frequency-aware clustering to prevent unrealistic merging
  4. Quality Selection: The highest quality read from each cluster is selected
  5. Output Generation: Selected reads are written with comprehensive cluster information

Coordinate-based Deduplication

  1. Fragment Creation: Reads are grouped into fragments
  2. Position Grouping: Fragments are grouped by genomic coordinates
  3. Quality Selection: The highest quality read from each group is selected
  4. Output Generation: Selected reads are written

📊 Output Format

BAM Tags

Tag Description
cn Cluster name with genomic coordinates and UMI (format: chr:start-end:strand:UMI)
cs Cluster size (number of reads in cluster)

Example Output

read1_UMI123    0    chr1    1001    60    50M    *    0    0    ATGC...    IIII...    cn:Z:chr1:1001-1050:+:UMI123    cs:i:3
read2_UMI123    1024  chr1    1001    50    50M    *    0    0    ATGC...    IIII...    cn:Z:chr1:1001-1050:+:UMI123    cs:i:3
read3_UMI123    1024  chr1    1001    45    50M    *    0    0    ATGC...    IIII...    cn:Z:chr1:1001-1050:+:UMI123    cs:i:3

📚 Documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markdup-0.0.11.tar.gz (38.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

markdup-0.0.11-py3-none-any.whl (28.7 kB view details)

Uploaded Python 3

File details

Details for the file markdup-0.0.11.tar.gz.

File metadata

  • Download URL: markdup-0.0.11.tar.gz
  • Upload date:
  • Size: 38.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for markdup-0.0.11.tar.gz
Algorithm Hash digest
SHA256 f07aa596661bcade19b2593f7dea989e62b5fbdf6a7f91624bd424cba41d5d55
MD5 ffe18fb7c61786f5c02b7be9e07e7166
BLAKE2b-256 a6a2c7c446a0f1a006674cc9cde68a16a1e4b083cf360d68ee413dc3903bfe48

See more details on using hashes here.

Provenance

The following attestation bundles were made for markdup-0.0.11.tar.gz:

Publisher: publish.yml on y9c/markdup

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file markdup-0.0.11-py3-none-any.whl.

File metadata

  • Download URL: markdup-0.0.11-py3-none-any.whl
  • Upload date:
  • Size: 28.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for markdup-0.0.11-py3-none-any.whl
Algorithm Hash digest
SHA256 aa6b0fcbfa0cf024d5cbbd22e8b1d6e6b1b9576f8d9214e35ccea770bae557c9
MD5 2f517d1a9f7763b6847dfa6e07d00e68
BLAKE2b-256 edf1cb1a4a2d7552830206050e8a4b9cf11e17bd64fcd54b6a617818a5b0c32d

See more details on using hashes here.

Provenance

The following attestation bundles were made for markdup-0.0.11-py3-none-any.whl:

Publisher: publish.yml on y9c/markdup

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page