Comprehensive BAM file deduplication that automatically handles multiple library schema

These details have not been verified by PyPI

Project description

MarkDup

A comprehensive Python tool for deduplicating BAM files that automatically handles multiple library schema with intelligent UMI-based / Coordinate-based clustering.

⚠️ Early Development Stage: This tool is currently in alpha development. While functional, it may have bugs and the API may change. Please report any issues you encounter.

🎯 Key Differentiators

Unlike other deduplication tools, MarkDup automatically handles multiple sequencing conditions and edge cases:

🔬 Multi-condition Support: Works with or without UMIs, single-end or paired-end reads
🧬 Biological Positioning: Automatically handles strand-aware positioning (start-only, end-only, or full fragment)
🎯 Intelligent Clustering: Frequency-aware UMI clustering prevents unrealistic merging
⚡ Edge Case Handling: Automatically detects and handles various sequencing artifacts
🔧 Adaptive Processing: Automatically adjusts algorithms based on input data characteristics

🚀 Features

Core Capabilities

🔬 UMI-based deduplication with intelligent UMI extraction from query names or BAM tags
📍 Coordinate-based deduplication for files without UMIs
🧬 Biological positioning for strand-aware clustering
⚡ Process-based parallelism for multi-core performance
🎯 Advanced clustering with edit distance and frequency-aware algorithms
📊 Comprehensive statistics and progress tracking
🔍 Auto-detection of UMI presence and format

Automatic Edge Case Handling

🔄 UMI Detection: Automatically detects UMI presence and format
🧬 Strand Awareness: Automatically handles forward/reverse strand reads
📏 CIGAR Handling: Properly processes reads with indels and complex CIGAR strings
🎯 Position Grouping: Intelligent grouping based on biological vs. reference coordinates
⚖️ Frequency Balancing: Prevents over-clustering of high-frequency UMIs
🔧 Quality Selection: Multiple quality metrics with automatic fallback
⚡ Performance Optimized: 3.4x faster UMI extraction + 13-113x faster Levenshtein distance calculation

📦 Installation

From PyPI (Recommended)

pip install markdup

From Source

git clone https://github.com/y9c/markdup.git
cd markdup
pip install .

Using uv (Development)

git clone https://github.com/y9c/markdup.git
cd markdup
uv sync

🚀 Quick Start

Automatic UMI Detection and Processing

# Tool automatically detects UMIs and chooses appropriate method
markdup input.bam output.bam

# With multiple threads
markdup input.bam output.bam --threads 8

# Keep duplicates and mark them
markdup input.bam output.bam --keep-duplicates

Explicit Method Selection

# Force UMI-based deduplication
markdup input.bam output.bam --method umi

# Force coordinate-based deduplication (no UMIs)
markdup input.bam output.bam --method coordinate

Advanced Positioning Options

# Start-only positioning (e.g., for ChIP-seq)
markdup input.bam output.bam --start-only

# End-only positioning (e.g., for reverse-complemented reads)
markdup input.bam output.bam --end-only

# Full fragment positioning (default, handles both start and end)
markdup input.bam output.bam

UMI Clustering Tuning

# Custom edit distance threshold
markdup input.bam output.bam --min-edit-dist-frac 0.17

# Frequency-aware clustering to prevent over-merging
markdup input.bam output.bam --min-frequency-ratio 0.1

# Custom UMI separator
markdup input.bam output.bam --umi-sep ":"

# Extract UMIs from BAM tags instead of query names
markdup input.bam output.bam --umi-tag UB

# Auto-detect UMI method
markdup input.bam output.bam --auto

📋 Command Line Interface

Global Options

Option	Description	Default
`--help`	Show help message	-
`--version`	Show version information	-

Input/Output Options

Option	Description	Default
`INPUT_BAM`	Input BAM file path	Required
`OUTPUT_BAM`	Output BAM file path	Required
`--force`	Overwrite output file if it exists	False

Deduplication Method

Option	Description	Default
`--method`	Deduplication method: `umi` or `coordinate`	`umi`

UMI Options

Option	Description	Default
`--umi-sep`	Separator for extracting UMIs from read names	`_`
`--umi-tag`	BAM tag name for UMI extraction (e.g., 'UB')	None
`--min-edit-dist-frac`	Minimum UMI edit distance as fraction of UMI length	`0.1`
`--min-frequency-ratio`	Minimum frequency ratio for UMI clustering	`0.1`
`--auto`	Auto-detect UMI method from first 10 reads	False

Positioning Options

Option	Description	Default
`--start-only`	Group reads by start position only	False
`--end-only`	Group reads by end position only	False

Quality Selection

Option	Description	Default
`--best-read-by`	Select best read by: `mapq`, `avg_base_q`	`avg_base_q`

Processing Options

Option	Description	Default
`--threads`	Number of threads for parallel processing	`1`
`--window-size`	Size of genomic windows for processing	`100000`
`--keep-duplicates`	Keep duplicate reads and mark them	False

🧬 Algorithm Details

Automatic Condition Detection

The tool automatically detects and handles:

UMI Presence: Scans read names for UMI patterns
Read Type: Single-end vs. paired-end detection
Strand Orientation: Forward vs. reverse strand handling
CIGAR Complexity: Indel and complex alignment handling
Quality Metrics: Available quality scores and selection criteria

UMI-based Deduplication

Fragment Creation: Reads are grouped into fragments (single-end or paired-end)
Biological Positioning: Fragments are positioned using strand-aware coordinates
Position Grouping: Fragments are grouped by biological position and strand
UMI Clustering: Within each position group, UMIs are clustered using:
- Exact matching for identical UMIs
- Edit distance clustering for similar UMIs
- Frequency-aware clustering to prevent unrealistic merging
Quality Selection: The highest quality read from each cluster is selected
Output Generation: Selected reads are written with comprehensive cluster information

Coordinate-based Deduplication

Fragment Creation: Reads are grouped into fragments
Position Grouping: Fragments are grouped by genomic coordinates
Quality Selection: The highest quality read from each group is selected
Output Generation: Selected reads are written

Biological Positioning

Forward strand: Biological start = reference start, Biological end = reference end
Reverse strand: Biological start = reference end, Biological end = reference start
Strand-aware clustering: Ensures proper grouping regardless of strand orientation
CIGAR-aware positioning: Properly handles indels and complex alignments

📊 Output Format

BAM Tags

Tag	Description
`cn`	Cluster name with genomic coordinates and UMI (format: `chr:start-end:strand:UMI`)
`cs`	Cluster size (number of reads in cluster)

Example Output

read1_UMI123    0    chr1    1001    60    50M    *    0    0    ATGC...    IIII...    cn:Z:chr1:1001-1050:+:UMI123    cs:i:3
read2_UMI123    1024  chr1    1001    50    50M    *    0    0    ATGC...    IIII...    cn:Z:chr1:1001-1050:+:UMI123    cs:i:3
read3_UMI123    1024  chr1    1001    45    50M    *    0    0    ATGC...    IIII...    cn:Z:chr1:1001-1050:+:UMI123    cs:i:3

📚 Documentation

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.0.28

Mar 17, 2026

0.0.27

Mar 17, 2026

0.0.26

Mar 13, 2026

0.0.25

Nov 3, 2025

0.0.24

Nov 3, 2025

0.0.23

Nov 3, 2025

0.0.22

Nov 3, 2025

0.0.21

Nov 3, 2025

0.0.20

Nov 1, 2025

0.0.19

Nov 1, 2025

0.0.18

Nov 1, 2025

0.0.17

Nov 1, 2025

0.0.16

Oct 31, 2025

0.0.15

Oct 21, 2025

0.0.14

Oct 21, 2025

0.0.13

Oct 21, 2025

0.0.12

Oct 21, 2025

0.0.11

Oct 21, 2025

0.0.10

Oct 21, 2025

0.0.9

Oct 21, 2025

0.0.8

Oct 21, 2025

0.0.7

Oct 21, 2025

0.0.6

Oct 20, 2025

This version

0.0.5

Oct 20, 2025

0.0.4

Oct 20, 2025

0.0.3

Oct 20, 2025

0.0.1

Oct 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markdup-0.0.5.tar.gz (38.6 kB view details)

Uploaded Oct 20, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

markdup-0.0.5-py3-none-any.whl (28.4 kB view details)

Uploaded Oct 20, 2025 Python 3

File details

Details for the file markdup-0.0.5.tar.gz.

File metadata

Download URL: markdup-0.0.5.tar.gz
Upload date: Oct 20, 2025
Size: 38.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for markdup-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`5b66186013445fb07d68f4927cd80ac5bc2b1524e2672ebd8c7c03ae44b1124c`
MD5	`12df3720ee753fe5546a2411ad086bc3`
BLAKE2b-256	`285591610749ce228247dd7708c8cac090ae59f0471078bf3ec5d7aca46f204a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for markdup-0.0.5.tar.gz:

Publisher: publish.yml on y9c/markdup

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: markdup-0.0.5.tar.gz
- Subject digest: 5b66186013445fb07d68f4927cd80ac5bc2b1524e2672ebd8c7c03ae44b1124c
- Sigstore transparency entry: 623455315
- Sigstore integration time: Oct 20, 2025
Source repository:
- Permalink: y9c/markdup@4658dff2375c3bd6f21f6244fa70f8272931002f
- Branch / Tag: refs/heads/main
- Owner: https://github.com/y9c
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4658dff2375c3bd6f21f6244fa70f8272931002f
- Trigger Event: push

File details

Details for the file markdup-0.0.5-py3-none-any.whl.

File metadata

Download URL: markdup-0.0.5-py3-none-any.whl
Upload date: Oct 20, 2025
Size: 28.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for markdup-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e0ed35f7be37bad7485e7f2e4ce3f5ec764ef4ceb594036ca64e8630d476b970`
MD5	`eaf9d4f292634daa618cdcf4cd6458fa`
BLAKE2b-256	`00c39eccd49fc49b856c9a1e9091f37970af5919b673a8e5a9b97ea9be2082ec`

See more details on using hashes here.

Provenance

The following attestation bundles were made for markdup-0.0.5-py3-none-any.whl:

Publisher: publish.yml on y9c/markdup

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: markdup-0.0.5-py3-none-any.whl
- Subject digest: e0ed35f7be37bad7485e7f2e4ce3f5ec764ef4ceb594036ca64e8630d476b970
- Sigstore transparency entry: 623455321
- Sigstore integration time: Oct 20, 2025
Source repository:
- Permalink: y9c/markdup@4658dff2375c3bd6f21f6244fa70f8272931002f
- Branch / Tag: refs/heads/main
- Owner: https://github.com/y9c
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4658dff2375c3bd6f21f6244fa70f8272931002f
- Trigger Event: push

markdup 0.0.5

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

MarkDup

🎯 Key Differentiators

🚀 Features

Core Capabilities

Automatic Edge Case Handling

📦 Installation

From PyPI (Recommended)

From Source

Using uv (Development)

🚀 Quick Start

Automatic UMI Detection and Processing

Explicit Method Selection

Advanced Positioning Options

UMI Clustering Tuning

📋 Command Line Interface

Global Options

Input/Output Options

Deduplication Method

UMI Options

Positioning Options

Quality Selection

Processing Options

🧬 Algorithm Details

Automatic Condition Detection

UMI-based Deduplication

Coordinate-based Deduplication

Biological Positioning

📊 Output Format

BAM Tags

Example Output

📚 Documentation

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance