Skip to main content

Comprehensive BAM file deduplication that automatically handles multiple sequencing conditions and edge cases

Project description

MarkDup

Python Version License: MIT Development Status

A comprehensive Python tool for deduplicating BAM files that automatically handles multiple sequencing conditions and edge cases with intelligent UMI clustering and biological positioning.

⚠️ Early Development Stage: This tool is currently in alpha development. While functional, it may have bugs and the API may change. Please report any issues you encounter.

🎯 Key Differentiators

Unlike other deduplication tools, MarkDup automatically handles multiple sequencing conditions and edge cases:

  • 🔬 Multi-condition Support: Works with or without UMIs, single-end or paired-end reads
  • 🧬 Biological Positioning: Automatically handles strand-aware positioning (start-only, end-only, or full fragment)
  • 🎯 Intelligent Clustering: Frequency-aware UMI clustering prevents unrealistic merging
  • ⚡ Edge Case Handling: Automatically detects and handles various sequencing artifacts
  • 🔧 Adaptive Processing: Automatically adjusts algorithms based on input data characteristics

🚀 Features

Core Capabilities

  • 🔬 UMI-based deduplication with quality-based read selection
  • 📍 Coordinate-based deduplication for files without UMIs
  • 🧬 Biological positioning for strand-aware clustering
  • ⚡ Process-based parallelism for multi-core performance
  • 🎯 Advanced clustering with edit distance and frequency-aware algorithms
  • 📊 Comprehensive statistics and progress tracking

Automatic Edge Case Handling

  • 🔄 UMI Detection: Automatically detects UMI presence and format
  • 🧬 Strand Awareness: Automatically handles forward/reverse strand reads
  • 📏 CIGAR Handling: Properly processes reads with indels and complex CIGAR strings
  • 🎯 Position Grouping: Intelligent grouping based on biological vs. reference coordinates
  • ⚖️ Frequency Balancing: Prevents over-clustering of high-frequency UMIs
  • 🔧 Quality Selection: Multiple quality metrics with automatic fallback

📦 Installation

From PyPI (Recommended)

pip install markdup

From Source

git clone https://github.com/y9c/markdup.git
cd markdup
pip install .

Using uv (Development)

git clone https://github.com/y9c/markdup.git
cd markdup
uv sync

🚀 Quick Start

Automatic UMI Detection and Processing

# Tool automatically detects UMIs and chooses appropriate method
markdup input.bam output.bam

# With multiple threads
markdup input.bam output.bam --threads 8

# Keep duplicates and mark them
markdup input.bam output.bam --keep-duplicates

Explicit Method Selection

# Force UMI-based deduplication
markdup input.bam output.bam --method umi

# Force coordinate-based deduplication (no UMIs)
markdup input.bam output.bam --method coordinate

Advanced Positioning Options

# Start-only positioning (e.g., for ChIP-seq)
markdup input.bam output.bam --start-only

# End-only positioning (e.g., for reverse-complemented reads)
markdup input.bam output.bam --end-only

# Full fragment positioning (default, handles both start and end)
markdup input.bam output.bam

UMI Clustering Tuning

# Custom edit distance threshold
markdup input.bam output.bam --min-edit-dist-frac 0.17

# Frequency-aware clustering to prevent over-merging
markdup input.bam output.bam --min-frequency-ratio 0.1

# Custom UMI separator
markdup input.bam output.bam --umi-separator ":"

📋 Command Line Interface

Global Options

Option Description Default
--help Show help message -
--version Show version information -

Input/Output Options

Option Description Default
INPUT_BAM Input BAM file path Required
OUTPUT_BAM Output BAM file path Required
--force Overwrite output file if it exists False

Deduplication Method

Option Description Default
--method Deduplication method: umi or coordinate umi

UMI Options

Option Description Default
--umi-separator Separator for extracting UMIs from read names _
--min-edit-dist-frac Minimum UMI edit distance as fraction of UMI length 0.1
--min-frequency-ratio Minimum frequency ratio for UMI clustering 0.1

Positioning Options

Option Description Default
--start-only Group reads by start position only False
--end-only Group reads by end position only False

Quality Selection

Option Description Default
--best-read-by Select best read by: mapq, avg_base_q avg_base_q

Processing Options

Option Description Default
--threads Number of threads for parallel processing 1
--window-size Size of genomic windows for processing 100000
--keep-duplicates Keep duplicate reads and mark them False

🧬 Algorithm Details

Automatic Condition Detection

The tool automatically detects and handles:

  1. UMI Presence: Scans read names for UMI patterns
  2. Read Type: Single-end vs. paired-end detection
  3. Strand Orientation: Forward vs. reverse strand handling
  4. CIGAR Complexity: Indel and complex alignment handling
  5. Quality Metrics: Available quality scores and selection criteria

UMI-based Deduplication

  1. Fragment Creation: Reads are grouped into fragments (single-end or paired-end)
  2. Biological Positioning: Fragments are positioned using strand-aware coordinates
  3. Position Grouping: Fragments are grouped by biological position and strand
  4. UMI Clustering: Within each position group, UMIs are clustered using:
    • Exact matching for identical UMIs
    • Edit distance clustering for similar UMIs
    • Frequency-aware clustering to prevent unrealistic merging
  5. Quality Selection: The highest quality read from each cluster is selected
  6. Output Generation: Selected reads are written with comprehensive cluster information

Coordinate-based Deduplication

  1. Fragment Creation: Reads are grouped into fragments
  2. Position Grouping: Fragments are grouped by genomic coordinates
  3. Quality Selection: The highest quality read from each group is selected
  4. Output Generation: Selected reads are written

Biological Positioning

  • Forward strand: Biological start = reference start, Biological end = reference end
  • Reverse strand: Biological start = reference end, Biological end = reference start
  • Strand-aware clustering: Ensures proper grouping regardless of strand orientation
  • CIGAR-aware positioning: Properly handles indels and complex alignments

📊 Output Format

BAM Tags

Tag Description
MI Molecular identifier (cluster ID)
RX Consensus UMI sequence
cs Cluster size (number of reads in cluster)
su Strand information
cn Cluster name with genomic coordinates
is_duplicate Duplicate flag (when using --keep-duplicates)

Example Output

read1_UMI123    0    chr1    1001    60    50M    *    0    0    ATGC...    IIII...    MI:Z:1    RX:Z:UMI123    cs:i:3    su:Z:+    cn:Z:chr1:1001-1050:+:UMI123
read2_UMI123    1024  chr1    1001    50    50M    *    0    0    ATGC...    IIII...    MI:Z:1    RX:Z:UMI123    cs:i:3    su:Z:+    cn:Z:chr1:1001-1050:+:UMI123
read3_UMI123    1024  chr1    1001    45    50M    *    0    0    ATGC...    IIII...    MI:Z:1    RX:Z:UMI123    cs:i:3    su:Z:+    cn:Z:chr1:1001-1050:+:UMI123

🎯 Use Cases

Single-cell RNA-seq

# Automatic UMI detection and processing
markdup input.bam output.bam --threads 16

# Custom edit distance for 6-nt UMIs (1 mismatch)
markdup input.bam output.bam --min-edit-dist-frac 0.17 --threads 16

ATAC-seq

# Coordinate-based deduplication with start-only positioning
markdup input.bam output.bam --method coordinate --start-only --threads 8

ChIP-seq

# Coordinate-based deduplication with start-only positioning
markdup input.bam output.bam --method coordinate --start-only --threads 8

WGS/WES

# Coordinate-based deduplication for whole genome/exome sequencing
markdup input.bam output.bam --method coordinate --threads 8

⚡ Performance

Benchmarks

  • Processing Speed: ~100,000 reads/second on 8 cores
  • Memory Usage: Efficient window-based processing
  • Scalability: Linear scaling with thread count
  • File Size: Handles files up to 100GB+ efficiently

Optimization Features

  • Process-based parallelism: Overcomes Python GIL limitations
  • Window-based processing: Memory-efficient for large files
  • Optimized data structures: Fast UMI clustering algorithms
  • I/O optimization: Minimized file access bottlenecks

🧪 Testing

Run Tests

# Run all tests
pytest test/ -v

# Run with coverage
pytest test/ --cov=markdup --cov-report=html

# Run specific test categories
pytest test/test_utils_comprehensive.py -v
pytest test/test_deduplication_comprehensive.py -v

Test Coverage

  • 84+ test cases covering all functionality
  • Unit tests for individual components
  • Integration tests for end-to-end workflows
  • Edge case testing for robust error handling

📚 Documentation

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

git clone https://github.com/y9c/markdup.git
cd markdup
uv sync --extra dev
pre-commit install

Code Quality

  • Ruff: Code formatting and linting
  • Pytest: Comprehensive testing
  • Type hints: Full type annotation coverage
  • Documentation: Comprehensive docstrings

🐛 Known Issues

  • Early Development: Some edge cases may not be fully handled
  • Memory Usage: Large files may require significant memory
  • Error Handling: Some error messages could be more descriptive

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • pysam: BAM file handling
  • rich: Beautiful terminal output
  • jellyfish: String distance calculations
  • click: Command-line interface

📞 Support


Made with ❤️ for the bioinformatics community

Note: This is an early development version. Please report any bugs or issues you encounter.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markdup-0.0.1.tar.gz (38.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

markdup-0.0.1-py3-none-any.whl (28.1 kB view details)

Uploaded Python 3

File details

Details for the file markdup-0.0.1.tar.gz.

File metadata

  • Download URL: markdup-0.0.1.tar.gz
  • Upload date:
  • Size: 38.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for markdup-0.0.1.tar.gz
Algorithm Hash digest
SHA256 bcdde50c060fd9f1c3c4a97e79dc6a604b518401f58a0dd7c241bba250f09c2d
MD5 486a166fae3e80324a5c98bdcbeca288
BLAKE2b-256 dc20b995d959d138f84d809882c8f01cbf0b89d9fb3ed0e422c89982c318cc55

See more details on using hashes here.

Provenance

The following attestation bundles were made for markdup-0.0.1.tar.gz:

Publisher: publish.yml on y9c/markdup

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file markdup-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: markdup-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 28.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for markdup-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 af608fa6f570561fd954fb68e5bb5fabf8aab7cfff25813df3ed1f2cfb4db031
MD5 0e16bead9983e8034d8b78d123e5aedb
BLAKE2b-256 c4bf1c805cc2a1e1bbf346de85f73904eae4f481eda1c9d6eb1d2c831c4f6477

See more details on using hashes here.

Provenance

The following attestation bundles were made for markdup-0.0.1-py3-none-any.whl:

Publisher: publish.yml on y9c/markdup

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page