Comprehensive BAM file deduplication that automatically handles multiple library schema
Project description
MarkDup
A comprehensive Python tool for deduplicating BAM files with automatic UMI detection and intelligent UMI-based or coordinate-based clustering.
⚠️ Early Development Stage: This tool is currently in alpha development. While functional, it may have bugs and the API may change. Please report any issues you encounter.
🚀 Features
- 🔬 UMI-based deduplication with intelligent extraction from query names or BAM tags
- 📍 Coordinate-based deduplication for files without UMIs
- 🧬 Biological positioning for strand-aware clustering (start-only, end-only, or full fragment)
- 🔄 Auto-detection of UMI presence and format
- 🧬 Strand awareness for forward/reverse strand reads
- 📏 CIGAR handling for reads with indels and complex alignments
- ⚖️ Frequency balancing to prevent over-clustering of high-frequency UMIs
- 🎯 Advanced clustering with edit distance and frequency-aware algorithms
- 🔧 Quality selection with multiple metrics and automatic fallback
- ⚡ Parallelized processing for multi-core performance
- 📊 Comprehensive statistics and progress tracking
📦 Installation
From PyPI (Recommended)
pip install markdup
From Source
git clone https://github.com/y9c/markdup.git
cd markdup
pip install .
Using uv (Development)
git clone https://github.com/y9c/markdup.git
cd markdup
uv sync
🚀 Quick Start
Automatic UMI Detection and Processing
# Tool automatically detects UMIs and chooses appropriate method
markdup input.bam output.bam
# With multiple threads
markdup input.bam output.bam --threads 8
# Keep duplicates and mark them
markdup input.bam output.bam --keep-duplicates
Explicit Method Selection
# Default: Auto-detect UMI presence and use appropriate method
markdup input.bam output.bam
# Force coordinate-based deduplication (ignore UMIs)
markdup input.bam output.bam --no-umi
Advanced Positioning Options
# Start-only positioning (e.g., for ChIP-seq)
markdup input.bam output.bam --start-only
# End-only positioning (e.g., for reverse-complemented reads)
markdup input.bam output.bam --end-only
# Full fragment positioning (default, handles both start and end)
markdup input.bam output.bam
UMI Clustering Tuning
# Custom edit distance threshold
markdup input.bam output.bam --min-edit-dist-frac 0.17
# Frequency-aware clustering to prevent over-merging
markdup input.bam output.bam --min-frequency-ratio 0.1
# Custom UMI separator
markdup input.bam output.bam --umi-sep ":"
# Extract UMIs from BAM tags instead of query names
markdup input.bam output.bam --umi-tag UB
📋 Command Line Interface
Global Options
| Option | Description | Default |
|---|---|---|
--help |
Show help message | - |
--version |
Show version information | - |
Input/Output Options
| Option | Description | Default |
|---|---|---|
INPUT_BAM |
Input BAM file path | Required |
OUTPUT_BAM |
Output BAM file path | Required |
--force |
Overwrite output file if it exists | False |
Deduplication (UMI) Method
| Option | Description | Default |
|---|---|---|
--no-umi |
Force coordinate-based deduplication (ignore detected UMIs) | Auto-detect |
--umi-sep |
Separator for extracting UMIs from read names | _ |
--umi-tag |
BAM tag name for UMI extraction (e.g., 'UB') | None |
--min-edit-dist-frac |
Minimum UMI edit distance as fraction of UMI length | 0.1 |
--min-frequency-ratio |
Minimum frequency ratio for UMI clustering | 0.1 |
Positioning Options
| Option | Description | Default |
|---|---|---|
--start-only |
Group reads by start position only | False |
--end-only |
Group reads by end position only | False |
Filtering Options
| Option | Description | Default |
|---|---|---|
--fragment-paired |
Keep only fragments with both reads present | False |
--fragment-mapped |
Keep only fragments where both reads are mapped | False |
Quality Selection
| Option | Description | Default |
|---|---|---|
--best-read-by |
Select best read by: mapq, avg_base_q |
avg_base_q |
Processing Options
| Option | Description | Default |
|---|---|---|
--threads |
Number of threads for parallel processing | 1 |
--window-size |
Size of genomic windows for processing | 100000 |
--keep-duplicates |
Keep duplicate reads and mark them | False |
🧬 Algorithm Details
Automatic Condition Detection
The tool automatically detects and handles:
- UMI Presence: Scans read names for UMI patterns
- Read Type: Single-end vs. paired-end detection
- Strand Orientation: Forward vs. reverse strand handling
- CIGAR Complexity: Indel and complex alignment handling
- Quality Metrics: Available quality scores and selection criteria
Biological Positioning
MarkDup uses strand-aware positioning to ensure proper grouping regardless of read orientation:
- Forward strand: Biological start = reference start, Biological end = reference end
- Reverse strand: Biological start = reference end, Biological end = reference start
- Strand-aware clustering: Ensures proper grouping regardless of strand orientation
- CIGAR-aware positioning: Properly handles indels and complex alignments
UMI-based Deduplication
- Fragment Creation: Reads are grouped into fragments (single-end or paired-end)
- Position Grouping: Fragments are grouped by biological position and strand
- UMI Clustering: Within each position group, UMIs are clustered using:
- Exact matching for identical UMIs
- Edit distance clustering for similar UMIs
- Frequency-aware clustering to prevent unrealistic merging
- Quality Selection: The highest quality read from each cluster is selected
- Output Generation: Selected reads are written with comprehensive cluster information
Coordinate-based Deduplication
- Fragment Creation: Reads are grouped into fragments
- Position Grouping: Fragments are grouped by genomic coordinates
- Quality Selection: The highest quality read from each group is selected
- Output Generation: Selected reads are written
📊 Output Format
BAM Tags
| Tag | Description |
|---|---|
cn |
Cluster name with genomic coordinates and UMI (format: chr:start-end:strand:UMI) |
cs |
Cluster size (number of reads in cluster) |
Example Output
read1_UMI123 0 chr1 1001 60 50M * 0 0 ATGC... IIII... cn:Z:chr1:1001-1050:+:UMI123 cs:i:3
read2_UMI123 1024 chr1 1001 50 50M * 0 0 ATGC... IIII... cn:Z:chr1:1001-1050:+:UMI123 cs:i:3
read3_UMI123 1024 chr1 1001 45 50M * 0 0 ATGC... IIII... cn:Z:chr1:1001-1050:+:UMI123 cs:i:3
📚 Documentation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file markdup-0.0.9.tar.gz.
File metadata
- Download URL: markdup-0.0.9.tar.gz
- Upload date:
- Size: 38.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c1f5df0cf07fe1d6dc18e682785927b0485747e43542dc1355f60a1e27ded2a
|
|
| MD5 |
8490d456ec031d7ffa0f3349f029cb6f
|
|
| BLAKE2b-256 |
c384a058245ca3ce6651465d99aabf112c9e53ec9c40204297c13fe92fa6765c
|
Provenance
The following attestation bundles were made for markdup-0.0.9.tar.gz:
Publisher:
publish.yml on y9c/markdup
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
markdup-0.0.9.tar.gz -
Subject digest:
2c1f5df0cf07fe1d6dc18e682785927b0485747e43542dc1355f60a1e27ded2a - Sigstore transparency entry: 626725894
- Sigstore integration time:
-
Permalink:
y9c/markdup@42d529b46eb042ed515222d1df5b30c99c05ed94 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/y9c
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@42d529b46eb042ed515222d1df5b30c99c05ed94 -
Trigger Event:
push
-
Statement type:
File details
Details for the file markdup-0.0.9-py3-none-any.whl.
File metadata
- Download URL: markdup-0.0.9-py3-none-any.whl
- Upload date:
- Size: 28.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3647c2576ecbe0429271f5a3c9296580e3eade15e912a708cd3f38ae8a18622
|
|
| MD5 |
4ad60dca87b5bd829f0a5ea90c522456
|
|
| BLAKE2b-256 |
dd33fba9f5e3bdc39ef498de31e96e118ef0d63c4485b7d88f4f8aa9ea3adb05
|
Provenance
The following attestation bundles were made for markdup-0.0.9-py3-none-any.whl:
Publisher:
publish.yml on y9c/markdup
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
markdup-0.0.9-py3-none-any.whl -
Subject digest:
f3647c2576ecbe0429271f5a3c9296580e3eade15e912a708cd3f38ae8a18622 - Sigstore transparency entry: 626725909
- Sigstore integration time:
-
Permalink:
y9c/markdup@42d529b46eb042ed515222d1df5b30c99c05ed94 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/y9c
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@42d529b46eb042ed515222d1df5b30c99c05ed94 -
Trigger Event:
push
-
Statement type: