Comprehensive BAM file deduplication that automatically handles multiple library schema
Project description
MarkDup
A comprehensive Python tool for deduplicating BAM files that automatically handles multiple library schema with intelligent UMI-based / Coordinate-based clustering.
⚠️ Early Development Stage: This tool is currently in alpha development. While functional, it may have bugs and the API may change. Please report any issues you encounter.
🎯 Key Differentiators
Unlike other deduplication tools, MarkDup automatically handles multiple sequencing conditions and edge cases:
- 🔬 Multi-condition Support: Works with or without UMIs, single-end or paired-end reads
- 🧬 Biological Positioning: Automatically handles strand-aware positioning (start-only, end-only, or full fragment)
- 🎯 Intelligent Clustering: Frequency-aware UMI clustering prevents unrealistic merging
- ⚡ Edge Case Handling: Automatically detects and handles various sequencing artifacts
- 🔧 Adaptive Processing: Automatically adjusts algorithms based on input data characteristics
🚀 Features
Core Capabilities
- 🔬 UMI-based deduplication with intelligent UMI extraction from query names or BAM tags
- 📍 Coordinate-based deduplication for files without UMIs
- 🧬 Biological positioning for strand-aware clustering
- ⚡ Process-based parallelism for multi-core performance
- 🎯 Advanced clustering with edit distance and frequency-aware algorithms
- 📊 Comprehensive statistics and progress tracking
- 🔍 Auto-detection of UMI presence and format
Automatic Edge Case Handling
- 🔄 UMI Detection: Automatically detects UMI presence and format
- 🧬 Strand Awareness: Automatically handles forward/reverse strand reads
- 📏 CIGAR Handling: Properly processes reads with indels and complex CIGAR strings
- 🎯 Position Grouping: Intelligent grouping based on biological vs. reference coordinates
- ⚖️ Frequency Balancing: Prevents over-clustering of high-frequency UMIs
- 🔧 Quality Selection: Multiple quality metrics with automatic fallback
- ⚡ Performance Optimized: 3.4x faster UMI extraction + 13-113x faster Levenshtein distance calculation
📦 Installation
From PyPI (Recommended)
pip install markdup
From Source
git clone https://github.com/y9c/markdup.git
cd markdup
pip install .
Using uv (Development)
git clone https://github.com/y9c/markdup.git
cd markdup
uv sync
🚀 Quick Start
Automatic UMI Detection and Processing
# Tool automatically detects UMIs and chooses appropriate method
markdup input.bam output.bam
# With multiple threads
markdup input.bam output.bam --threads 8
# Keep duplicates and mark them
markdup input.bam output.bam --keep-duplicates
Explicit Method Selection
# Force UMI-based deduplication
markdup input.bam output.bam --method umi
# Force coordinate-based deduplication (no UMIs)
markdup input.bam output.bam --method coordinate
Advanced Positioning Options
# Start-only positioning (e.g., for ChIP-seq)
markdup input.bam output.bam --start-only
# End-only positioning (e.g., for reverse-complemented reads)
markdup input.bam output.bam --end-only
# Full fragment positioning (default, handles both start and end)
markdup input.bam output.bam
UMI Clustering Tuning
# Custom edit distance threshold
markdup input.bam output.bam --min-edit-dist-frac 0.17
# Frequency-aware clustering to prevent over-merging
markdup input.bam output.bam --min-frequency-ratio 0.1
# Custom UMI separator
markdup input.bam output.bam --umi-sep ":"
# Extract UMIs from BAM tags instead of query names
markdup input.bam output.bam --umi-tag UB
# Auto-detect UMI method
markdup input.bam output.bam --auto
📋 Command Line Interface
Global Options
| Option | Description | Default |
|---|---|---|
--help |
Show help message | - |
--version |
Show version information | - |
Input/Output Options
| Option | Description | Default |
|---|---|---|
INPUT_BAM |
Input BAM file path | Required |
OUTPUT_BAM |
Output BAM file path | Required |
--force |
Overwrite output file if it exists | False |
Deduplication Method
| Option | Description | Default |
|---|---|---|
--method |
Deduplication method: umi or coordinate |
umi |
UMI Options
| Option | Description | Default |
|---|---|---|
--umi-sep |
Separator for extracting UMIs from read names | _ |
--umi-tag |
BAM tag name for UMI extraction (e.g., 'UB') | None |
--min-edit-dist-frac |
Minimum UMI edit distance as fraction of UMI length | 0.1 |
--min-frequency-ratio |
Minimum frequency ratio for UMI clustering | 0.1 |
--auto |
Auto-detect UMI method from first 10 reads | False |
Positioning Options
| Option | Description | Default |
|---|---|---|
--start-only |
Group reads by start position only | False |
--end-only |
Group reads by end position only | False |
Quality Selection
| Option | Description | Default |
|---|---|---|
--best-read-by |
Select best read by: mapq, avg_base_q |
avg_base_q |
Processing Options
| Option | Description | Default |
|---|---|---|
--threads |
Number of threads for parallel processing | 1 |
--window-size |
Size of genomic windows for processing | 100000 |
--keep-duplicates |
Keep duplicate reads and mark them | False |
🧬 Algorithm Details
Automatic Condition Detection
The tool automatically detects and handles:
- UMI Presence: Scans read names for UMI patterns
- Read Type: Single-end vs. paired-end detection
- Strand Orientation: Forward vs. reverse strand handling
- CIGAR Complexity: Indel and complex alignment handling
- Quality Metrics: Available quality scores and selection criteria
UMI-based Deduplication
- Fragment Creation: Reads are grouped into fragments (single-end or paired-end)
- Biological Positioning: Fragments are positioned using strand-aware coordinates
- Position Grouping: Fragments are grouped by biological position and strand
- UMI Clustering: Within each position group, UMIs are clustered using:
- Exact matching for identical UMIs
- Edit distance clustering for similar UMIs
- Frequency-aware clustering to prevent unrealistic merging
- Quality Selection: The highest quality read from each cluster is selected
- Output Generation: Selected reads are written with comprehensive cluster information
Coordinate-based Deduplication
- Fragment Creation: Reads are grouped into fragments
- Position Grouping: Fragments are grouped by genomic coordinates
- Quality Selection: The highest quality read from each group is selected
- Output Generation: Selected reads are written
Biological Positioning
- Forward strand: Biological start = reference start, Biological end = reference end
- Reverse strand: Biological start = reference end, Biological end = reference start
- Strand-aware clustering: Ensures proper grouping regardless of strand orientation
- CIGAR-aware positioning: Properly handles indels and complex alignments
📊 Output Format
BAM Tags
| Tag | Description |
|---|---|
cn |
Cluster name with genomic coordinates and UMI (format: chr:start-end:strand:UMI) |
cs |
Cluster size (number of reads in cluster) |
Example Output
read1_UMI123 0 chr1 1001 60 50M * 0 0 ATGC... IIII... cn:Z:chr1:1001-1050:+:UMI123 cs:i:3
read2_UMI123 1024 chr1 1001 50 50M * 0 0 ATGC... IIII... cn:Z:chr1:1001-1050:+:UMI123 cs:i:3
read3_UMI123 1024 chr1 1001 45 50M * 0 0 ATGC... IIII... cn:Z:chr1:1001-1050:+:UMI123 cs:i:3
📚 Documentation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file markdup-0.0.5.tar.gz.
File metadata
- Download URL: markdup-0.0.5.tar.gz
- Upload date:
- Size: 38.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b66186013445fb07d68f4927cd80ac5bc2b1524e2672ebd8c7c03ae44b1124c
|
|
| MD5 |
12df3720ee753fe5546a2411ad086bc3
|
|
| BLAKE2b-256 |
285591610749ce228247dd7708c8cac090ae59f0471078bf3ec5d7aca46f204a
|
Provenance
The following attestation bundles were made for markdup-0.0.5.tar.gz:
Publisher:
publish.yml on y9c/markdup
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
markdup-0.0.5.tar.gz -
Subject digest:
5b66186013445fb07d68f4927cd80ac5bc2b1524e2672ebd8c7c03ae44b1124c - Sigstore transparency entry: 623455315
- Sigstore integration time:
-
Permalink:
y9c/markdup@4658dff2375c3bd6f21f6244fa70f8272931002f -
Branch / Tag:
refs/heads/main - Owner: https://github.com/y9c
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4658dff2375c3bd6f21f6244fa70f8272931002f -
Trigger Event:
push
-
Statement type:
File details
Details for the file markdup-0.0.5-py3-none-any.whl.
File metadata
- Download URL: markdup-0.0.5-py3-none-any.whl
- Upload date:
- Size: 28.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e0ed35f7be37bad7485e7f2e4ce3f5ec764ef4ceb594036ca64e8630d476b970
|
|
| MD5 |
eaf9d4f292634daa618cdcf4cd6458fa
|
|
| BLAKE2b-256 |
00c39eccd49fc49b856c9a1e9091f37970af5919b673a8e5a9b97ea9be2082ec
|
Provenance
The following attestation bundles were made for markdup-0.0.5-py3-none-any.whl:
Publisher:
publish.yml on y9c/markdup
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
markdup-0.0.5-py3-none-any.whl -
Subject digest:
e0ed35f7be37bad7485e7f2e4ce3f5ec764ef4ceb594036ca64e8630d476b970 - Sigstore transparency entry: 623455321
- Sigstore integration time:
-
Permalink:
y9c/markdup@4658dff2375c3bd6f21f6244fa70f8272931002f -
Branch / Tag:
refs/heads/main - Owner: https://github.com/y9c
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4658dff2375c3bd6f21f6244fa70f8272931002f -
Trigger Event:
push
-
Statement type: