Universal mtDNA Haplogroup Classifier
Project description
eveHap
Universal mtDNA Haplogroup Classifier
eveHap is a Python implementation of mtDNA haplogroup classification, inspired by and building upon Haplogrep 3. It brings haplogroup classification into the Python ecosystem with support for modern high-coverage sequencing data as well as ancient DNA with damage filtering.
Features
- Pipeline-Friendly: Explicit file paths required - no hidden auto-downloads
- Offline-Ready: Bundled
rsrsandrcrsresources work without internet - Multiple Input Formats: BAM, CRAM, VCF, FASTA, HSD, and consumer genotyping files (23andMe, AncestryDNA)
- Dual Classification Strategy:
- Kulczynski scoring for high-coverage modern DNA
- Tree traversal for low-coverage/ancient DNA
- mitoLeaf Phylotree: Optionally download the complete mitoLeaf phylotree (6400+ haplogroups)
- Ancient DNA Support: Damage pattern detection and filtering for C→T/G→A substitutions
- Quality Metrics: Confidence scores, coverage statistics, and QC warnings
- Flexible Output: TSV, JSON, or human-readable text
Installation
cd evehap
pip install -e .
Quick Start
# Classify using bundled resources (offline-ready, no download required)
evehap classify sample.bam --tree rsrs --reference rcrs
# Classify a VCF file with specific sample
evehap classify samples.vcf.gz --tree rsrs --reference rcrs --sample-id HG00096
# Classify multiple files with output to TSV
evehap classify *.bam --tree rsrs --reference rcrs -o results.tsv
# Classify ancient DNA with damage filtering
evehap classify ancient.bam --tree rsrs --reference rcrs --damage-filter --method traversal
# Download mitoLeaf tree (more haplogroups) and use it
evehap download --outdir ./evehap_data/
evehap classify sample.bam --tree ./evehap_data/mitoleaf_tree.json --reference ./evehap_data/rCRS.fasta
# Estimate ancient DNA damage rates
evehap damage ancient.bam
# Show version and bundled resources
evehap version
Usage
classify
Classify mtDNA haplogroup from input files.
evehap classify [OPTIONS] INPUT_FILES...
Options:
--tree TEXT REQUIRED: 'rsrs', 'rcrs', or path to tree file
--reference TEXT REQUIRED: 'rcrs' or path to reference FASTA
--format TEXT Input format (auto, bam, vcf, fasta, hsd, microarray)
-o, --output PATH Output file (default: stdout)
--output-format [json|tsv|text] Output format
--sample-id TEXT Sample ID for multi-sample VCF/HSD files
--damage-filter Apply ancient DNA damage filtering
--method [auto|kulczynski|traversal]
Classification method
--top-n INTEGER Number of alternatives to report
-q, --quiet Suppress progress output
Tree Options:
| Name | Description | Haplogroups | Requires Download |
|---|---|---|---|
rsrs |
Bundled RSRS-based XML tree | ~5400 | No (bundled) |
rcrs |
Bundled rCRS-based XML tree | ~2400 | No (bundled) |
mitoleaf_tree.json |
mitoLeaf JSON tree | ~6400 | Yes |
# Use bundled RSRS tree (offline-ready)
evehap classify sample.bam --tree rsrs --reference rcrs
# Use bundled rCRS tree
evehap classify sample.bam --tree rcrs --reference rcrs
# Download and use mitoLeaf tree (most comprehensive)
evehap download --outdir ./evehap_data/
evehap classify sample.bam --tree ./evehap_data/mitoleaf_tree.json --reference ./evehap_data/rCRS.fasta
download
Download phylotree and reference resources.
evehap download --outdir ./evehap_data/
Options:
-o, --outdir PATH REQUIRED: Directory to save downloaded files
--category [all|reference|phylotree]
Category to download (default: all)
--resource TEXT Specific resource(s) to download
--force Overwrite existing files
--check-updates Check for updates without downloading
--list List available resources
# Download all resources
evehap download --outdir ./evehap_data/
# List available resources
evehap download --outdir ./evehap_data/ --list
# Download only the mitoLeaf tree
evehap download --outdir ./evehap_data/ --resource mitoleaf-tree
# Check for updates
evehap download --outdir ./evehap_data/ --check-updates
info
Show information about an input file.
evehap info sample.bam
damage
Estimate ancient DNA damage rates from BAM file.
evehap damage ancient.bam
download
Download reference sequences and phylotree resources.
evehap download [OPTIONS]
Options:
--category [all|reference|phylotree] Category of resources (default: all)
--resource TEXT Specific resource(s) to download
--force Overwrite existing files
--check-updates Check for updates without downloading
--list List available resources
Available Resources:
| Resource | Source | Description |
|---|---|---|
| Reference Sequences | ||
rsrs |
phylotree.org | Reconstructed Sapiens Reference Sequence |
rcrs |
phylotree.org | Revised Cambridge Reference Sequence |
| mitoLeaf Phylotree Data | ||
mitoleaf-tree |
forensicgenomics.github.io | Complete phylotree in JSON format (~15MB) |
mitoleaf-motifs |
forensicgenomics.github.io | Haplogroup defining mutations |
mitoleaf-representatives |
forensicgenomics.github.io | Representative sequences per haplogroup |
Examples:
# Download all resources
evehap download
# Download only reference sequences
evehap download --category reference
# Download only phylotree data
evehap download --category phylotree
# Download specific resource
evehap download --resource mitoleaf-tree
# Check for updates to installed resources
evehap download --check-updates
# Force re-download (update all)
evehap download --force
# List all available resources
evehap download --list
version
Show version information and installed resources.
evehap version [OPTIONS]
Options:
--check-updates Check for resource updates
Output shows:
- Package and data directory locations
- Default tree and reference paths
- Bundled phylotrees with sizes
- Downloaded resources with sizes and dates
Example:
$ evehap version
eveHap v0.1.0
Package directory: /path/to/evehap
Data directory: /path/to/evehap/data
Default tree: tree-rsrs.xml
Default reference: rCRS.fasta
Bundled phylotrees:
✓ tree-rsrs.xml (1837.7 KB)
✓ tree.xml (2369.2 KB)
Downloaded resources:
[reference]
✓ rsrs (16.7 KB, 2025-01-01)
✓ rcrs (16.5 KB, 2025-01-01)
[phylotree]
✓ mitoleaf-tree (15293.6 KB, 2025-01-01)
...
Python API
from evehap.adapters.bam import BAMAdapter
from evehap.core.classifier import Classifier
from evehap.core.phylotree import Phylotree
# Load phylotree (mitoLeaf JSON or Haplogrep XML - auto-detected)
phylotree = Phylotree.load("data/phylotree/mitoleaf/tree.json")
# Extract profile from BAM
adapter = BAMAdapter(reference_path="data/reference/rCRS.fasta")
profile = adapter.extract_profile("sample.bam")
# Classify
classifier = Classifier(phylotree)
result = classifier.classify(profile)
print(f"Haplogroup: {result.haplogroup}")
print(f"Confidence: {result.confidence:.1%}")
print(f"Quality: {result.quality}")
Supported Input Formats
| Format | Extensions | Description |
|---|---|---|
| BAM/CRAM | .bam, .cram |
Aligned sequencing reads |
| VCF | .vcf, .vcf.gz |
Variant call format |
| FASTA | .fasta, .fa, .fna |
Consensus sequences |
| HSD | .hsd |
Haplogrep polymorphism format |
| Microarray | .txt, .csv |
23andMe, AncestryDNA raw data |
Classification Methods
Kulczynski Scoring (default for high-coverage)
Uses the Kulczynski similarity measure to score each haplogroup based on the proportion of expected mutations present in the sample. Best for samples with >80% mtDNA coverage.
Tree Traversal (default for low-coverage)
Traverses the phylogenetic tree from root, evaluating support for each branch based on derived/ancestral allele counts. Best for ancient DNA or samples with sparse coverage.
Reference Sequences
eveHap supports two mtDNA reference sequences:
RSRS (Reconstructed Sapiens Reference Sequence)
The ancestral human mtDNA sequence reconstructed from phylogenetic analysis. The RSRS-based phylotree has the complete haplogroup structure starting from mtMRCA (mitochondrial Most Recent Common Ancestor).
- Use for: Phylotree structure, haplogroup classification
- Source: phylotree.org/resources/RSRS.fasta
rCRS (Revised Cambridge Reference Sequence)
The standard reference for modern mtDNA sequencing. Most BAM/VCF files are aligned to rCRS. It represents haplogroup H2a2a1 (a European lineage).
- Use for: Sequence alignment, variant calling
- Source: phylotree.org/resources/rCRS.fasta
Note: eveHap automatically handles the translation between rCRS-aligned input files and the RSRS-based phylotree.
Ancient DNA Damage Filtering
Ancient DNA exhibits characteristic damage patterns:
- C→T substitutions at 5' read ends
- G→A substitutions at 3' read ends
Use --damage-filter to automatically detect and filter potentially damaged bases.
Output Formats
TSV (default)
sample_id haplogroup confidence quality coverage depth method warnings
HGDP00582 H2a2a1g 1.0000 high 1.0000 19004.1 kulczynski HETEROPLASMY
JSON
{
"sample_id": "HGDP00582",
"haplogroup": "H2a2a1g",
"confidence": 1.0,
"quality": "high",
...
}
Text
Sample: HGDP00582
Haplogroup: H2a2a1g
Confidence: 100.0%
Quality: high
Method: kulczynski
Coverage: 100.0%
Pipeline Integration
eveHap is designed to be pipeline-friendly for batch processing on HPC and cloud environments. All resource paths must be explicitly specified - no hidden auto-downloads.
Using Bundled Resources (Offline-Ready)
# Bundled resources work without internet access
evehap classify sample.bam --tree rsrs --reference rcrs
Using Downloaded Resources
# Download resources to a directory
evehap download --outdir ./evehap_data/
# Use downloaded files explicitly
evehap classify sample.bam \
--tree ./evehap_data/mitoleaf_tree.json \
--reference ./evehap_data/rCRS.fasta
Nextflow
# Using bundled resources
nextflow run pipelines/nextflow/main.nf \
--input 'samples/*.bam' \
--tree rsrs \
--reference rcrs \
--outdir results/
# Using downloaded resources
nextflow run pipelines/nextflow/main.nf \
--input 'samples/*.bam' \
--tree ./evehap_data/mitoleaf_tree.json \
--reference ./evehap_data/rCRS.fasta \
--outdir results/
# SLURM cluster
nextflow run pipelines/nextflow/main.nf \
--input 'samples/*.bam' \
--tree rsrs \
--reference rcrs \
-profile slurm
Snakemake
cd pipelines/snakemake
# Using bundled resources
snakemake --cores 4 --config input_dir=../../samples tree=rsrs reference=rcrs
# SLURM cluster
snakemake --profile profiles/slurm --config tree=rsrs reference=rcrs
See pipelines/README.md for complete documentation.
Testing
# Run all tests
pytest tests/
# Run fast tests only (exclude slow benchmarks)
pytest tests/ -m "not slow"
# Run with verbose output
pytest tests/ -v
Project Structure
evehap/
├── evehap/
│ ├── adapters/ # Input format adapters
│ │ ├── bam.py # BAM/CRAM adapter
│ │ ├── vcf.py # VCF adapter
│ │ ├── fasta.py # FASTA adapter
│ │ ├── hsd.py # HSD adapter
│ │ └── microarray.py # 23andMe/Ancestry adapter
│ ├── core/ # Core classification logic
│ │ ├── classifier.py # Classifier algorithms
│ │ ├── damage.py # Ancient DNA damage filtering
│ │ ├── phylotree.py # Phylotree parser
│ │ └── profile.py # AlleleProfile data structures
│ ├── output/ # Output formatting
│ │ ├── report.py # Report generation
│ │ └── result.py # Result data structures
│ └── cli.py # Command-line interface
├── data/
│ ├── phylotree/ # Phylotree data files
│ │ ├── tree-rsrs.xml # RSRS-based tree (default, complete phylogeny)
│ │ ├── tree.xml # rCRS-based tree
│ │ └── mitoleaf/ # mitoLeaf data (downloaded)
│ │ ├── tree.json # Complete phylotree (JSON)
│ │ ├── hgmotifs.json # Haplogroup motifs
│ │ └── mito_representatives.csv
│ └── reference/ # Reference sequences (downloaded)
│ ├── RSRS.fasta # Reconstructed Sapiens Reference Sequence
│ └── rCRS.fasta # Revised Cambridge Reference Sequence
└── tests/ # Test suite
Requirements
- Python 3.8+
- pysam
- click
- numpy
License
PolyForm Noncommercial License 1.0.0
This software is free for noncommercial use, including academic research, personal projects, and use by nonprofit organizations. See LICENSE for full terms.
For commercial licensing inquiries, please contact the authors.
Citation
If you use eveHap in your research, please cite both eveHap and the foundational Haplogrep 3 work:
Haplogrep 3 (classification methodology and phylotree data):
Schönherr S, Weissensteiner H, Kronenberg F, Forer L. Haplogrep 3 - an interactive haplogroup classification and analysis platform. Nucleic Acids Res. 2023. https://doi.org/10.1093/nar/gkad284
eveHap (this implementation):
eveHap: Universal mtDNA Haplogroup Classifier https://doi.org/10.5281/zenodo.18305868
Acknowledgments
eveHap implements haplogroup classification algorithms originally developed by the Haplogrep team at the Medical University of Innsbruck. We are grateful for their foundational work in making mtDNA haplogroup classification accessible and rigorous.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file evehap-0.1.1.tar.gz.
File metadata
- Download URL: evehap-0.1.1.tar.gz
- Upload date:
- Size: 390.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f02a2c13a4dee6de11392dd2a3d7c2021872e76565706fae1a2b5151c178445b
|
|
| MD5 |
f1130902ad9d82e1a4d8973f07b3e65b
|
|
| BLAKE2b-256 |
12154b88f6ad4481af431a868a3e63c372b469145e1b0c19c19aa0002c27a5d3
|
File details
Details for the file evehap-0.1.1-py3-none-any.whl.
File metadata
- Download URL: evehap-0.1.1-py3-none-any.whl
- Upload date:
- Size: 427.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c91a4fc67d9e9b51a57c94f9bb05aa1d1749d8a52951167108eff14b72d005e1
|
|
| MD5 |
0b8dc847cfe80ce237212d4d6d303695
|
|
| BLAKE2b-256 |
0f725eb9938bbc1cfcb4200512e14cf96c720008e6e5a1d3f407503f03890086
|