Detect chromosome-level scaffolds in genome assemblies with inconsistent naming conventions
Project description
ChromDetect
Detect chromosome-level scaffolds in genome assemblies with inconsistent naming conventions.
The Problem
Genome assemblies use wildly inconsistent naming conventions for chromosome-level scaffolds:
Super_scaffold_1,Superscaffold_1,SUPER_1chr1,chromosome_1,Chr_1LG_1(linkage groups)scaffold_1_cov50(coverage-annotated)HiC_scaffold_1,Scaffold_1_RaGOONC_000001.11,CM000001.1(NCBI accessions)
This inconsistency makes automated analysis and cross-species comparisons difficult. Existing QC tools like QUAST report metrics but don't classify scaffolds. Scaffolding tools like LACHESIS create assemblies but don't help interpret existing ones.
Why ChromDetect?
| Feature | QUAST | assembly-stats | gfastats | ChromDetect |
|---|---|---|---|---|
| N50/N90 statistics | ✅ | ✅ | ✅ | ✅ |
| Scaffold classification | ❌ | ❌ | ❌ | ✅ |
| Pattern-based detection | ❌ | ❌ | ❌ | ✅ |
| Size-based detection | ❌ | ❌ | ❌ | ✅ |
| Karyotype-aware | ❌ | ❌ | ❌ | ✅ |
| Multiple output formats | ✅ | ❌ | ✅ | ✅ |
| Zero dependencies | ❌ | ✅ | ❌ | ✅ |
ChromDetect fills a gap in the genomics toolkit: automatically identifying which scaffolds represent chromosomes rather than just reporting assembly statistics.
The Solution
ChromDetect uses multiple complementary strategies to identify chromosome-level scaffolds:
- Name-based detection - Regex patterns for 15+ common naming conventions
- Size-based detection - Large scaffolds are typically chromosomes
- N50-based detection - Scaffolds contributing to N50 are typically chromosome-level
- Karyotype-informed detection - Use known chromosome count to adjust classifications
Installation
pip install chromdetect
Or install from source:
git clone https://github.com/shandley/chromdetect.git
cd chromdetect
pip install -e .
Quick Start
Command Line
# Basic usage - get summary
chromdetect assembly.fasta
# Output JSON for programmatic use
chromdetect assembly.fasta --format json --output results.json
# Use karyotype information for better accuracy
chromdetect assembly.fasta --karyotype 24
# Export only chromosome-level scaffolds as TSV
chromdetect assembly.fasta --chromosomes-only --format tsv > chromosomes.tsv
Python API
from chromdetect import parse_fasta, classify_scaffolds
# Parse and classify
scaffolds = parse_fasta("assembly.fasta.gz")
results, stats = classify_scaffolds(scaffolds, expected_chromosomes=24)
# Print summary
print(f"Found {stats.chromosome_count} chromosomes")
print(f"Total assembly: {stats.total_length / 1e9:.2f} Gb")
print(f"N50: {stats.n50 / 1e6:.1f} Mb")
# Access individual scaffold classifications
for r in results:
if r.classification == "chromosome":
print(f"{r.name}: {r.length:,} bp (confidence: {r.confidence:.2f})")
Output Formats
Summary (default)
============================================================
CHROMDETECT ASSEMBLY ANALYSIS
============================================================
Total scaffolds: 1,234
Total length: 2,876,543,210 bp (2.88 Gb)
N50: 45,678,901 bp (45.7 Mb)
N90: 12,345,678 bp
Largest scaffold: 198,765,432 bp
Scaffold Classification:
Chromosomes: 24 (2.85 Gb)
Unlocalized: 15
Unplaced: 1,195
Chromosome N50: 118,234,567 bp (118.2 Mb)
GC content: 41.2%
JSON
{
"summary": {
"total_scaffolds": 1234,
"chromosome_count": 24,
"n50": 45678901,
...
},
"scaffolds": [
{
"name": "chr1",
"length": 198765432,
"classification": "chromosome",
"confidence": 0.95,
"detection_method": "name_chr_explicit",
"chromosome_id": "1"
},
...
]
}
TSV
name length classification confidence method chromosome_id
chr1 198765432 chromosome 0.95 name_chr_explicit 1
chr2 175432198 chromosome 0.93 name_chr_explicit 2
...
Options
| Option | Description |
|---|---|
-f, --format |
Output format: summary, json, tsv (default: summary) |
-o, --output |
Write output to file instead of stdout |
-k, --karyotype |
Expected chromosome count for karyotype-informed detection |
-s, --min-size |
Minimum size (bp) to consider chromosome-level (default: 10Mb) |
-c, --chromosomes-only |
Only output chromosome-level scaffolds |
-q, --quiet |
Suppress progress messages |
Supported Naming Conventions
ChromDetect recognizes these naming patterns (case-insensitive):
| Pattern | Examples | Method |
|---|---|---|
| Explicit chromosome | chr1, chromosome_X, Chr_MT |
name_chr_explicit |
| Super scaffold | Super_scaffold_1, Superscaffold_X |
name_super_scaffold |
| SUPER | SUPER_1, SUPER1 |
name_SUPER |
| Linkage group | LG1, LG_X |
name_linkage_group |
| NCBI RefSeq | NC_000001.11 |
name_ncbi_refseq |
| NCBI GenBank | CM000001.1 |
name_ncbi_genbank |
| HiC scaffold | HiC_scaffold_1 |
name_hic_scaffold |
| RaGOO | Scaffold_1_RaGOO |
name_ragoo |
| Simple numeric | 1, X, MT |
name_numeric |
Patterns that indicate unlocalized scaffolds:
*_random,*_unloc*,chrUn_*
Patterns that indicate unplaced scaffolds (contigs/fragments):
*_ctg*,*contig*,*_arrow_*,*_pilon*,*_hap*
How It Works
ChromDetect combines name-based and size-based detection with these priority rules:
- Strong name match (confidence ≥ 0.8) takes priority
- Large scaffold + weak name match = chromosome with boosted confidence
- Large scaffold + no name match = chromosome with reduced confidence
- Small scaffold = unplaced regardless of name
When --karyotype is provided:
- If too many candidates: demote lowest-confidence chromosomes
- If too few candidates: promote largest unplaced scaffolds
Use Cases
VGP Assembly Validation
# Validate a VGP curated assembly
chromdetect species.pri.cur.fasta.gz --karyotype 24 --format json
Cross-Species Comparison
from chromdetect import parse_fasta, classify_scaffolds
species_files = ["human.fa", "mouse.fa", "zebrafish.fa"]
karyotypes = [23, 20, 25]
for fasta, n_chr in zip(species_files, karyotypes):
scaffolds = parse_fasta(fasta)
results, stats = classify_scaffolds(scaffolds, expected_chromosomes=n_chr)
print(f"{fasta}: {stats.chromosome_count} chromosomes detected")
Pipeline Integration
# As part of assembly QC pipeline
chromdetect assembly.fasta --format json | jq '.summary.chromosome_count'
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Adding New Patterns
To add support for a new naming convention:
- Add the regex pattern to
chromdetect/patterns.py - Include a descriptive method name
- Ensure the pattern captures chromosome ID in group 1
- Add tests in
tests/test_patterns.py
Example:
# In patterns.py
CHROMOSOME_PATTERNS.append(
(r'^MyConvention_(\d+)$', 'my_convention'),
)
Citation
If you use ChromDetect in your research, please cite:
ChromDetect: Chromosome-level scaffold detection for genome assemblies
https://github.com/shandley/chromdetect
License
MIT License - see LICENSE for details.
Related Projects
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chromdetect-0.2.0.tar.gz.
File metadata
- Download URL: chromdetect-0.2.0.tar.gz
- Upload date:
- Size: 23.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
435b413ee26d31ee12f592d3e0f71547f392cf10ad2968b15c843c3d7070df91
|
|
| MD5 |
f1def2c42b2037eb96eff54424f42f3d
|
|
| BLAKE2b-256 |
fc0c3a22a067daad7c8038aefdce1970af14698598495bd2e75f22fa692e1972
|
Provenance
The following attestation bundles were made for chromdetect-0.2.0.tar.gz:
Publisher:
publish.yml on shandley/chromdetect
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chromdetect-0.2.0.tar.gz -
Subject digest:
435b413ee26d31ee12f592d3e0f71547f392cf10ad2968b15c843c3d7070df91 - Sigstore transparency entry: 763986447
- Sigstore integration time:
-
Permalink:
shandley/chromdetect@d6e3ce8e89bb48341c5e8a61bc265ff631683309 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/shandley
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d6e3ce8e89bb48341c5e8a61bc265ff631683309 -
Trigger Event:
release
-
Statement type:
File details
Details for the file chromdetect-0.2.0-py3-none-any.whl.
File metadata
- Download URL: chromdetect-0.2.0-py3-none-any.whl
- Upload date:
- Size: 16.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
340efce9c4a198f826d978091bedace37b8a922fe9b89a6224dd30245aa7561f
|
|
| MD5 |
156f2219f5e92cee7ada2c8cf815ec8e
|
|
| BLAKE2b-256 |
a7b8ee30dd021fd081df6cb3b2e3f6f9928056f91c2200b6bb42f966fb7536ba
|
Provenance
The following attestation bundles were made for chromdetect-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on shandley/chromdetect
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chromdetect-0.2.0-py3-none-any.whl -
Subject digest:
340efce9c4a198f826d978091bedace37b8a922fe9b89a6224dd30245aa7561f - Sigstore transparency entry: 763986450
- Sigstore integration time:
-
Permalink:
shandley/chromdetect@d6e3ce8e89bb48341c5e8a61bc265ff631683309 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/shandley
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d6e3ce8e89bb48341c5e8a61bc265ff631683309 -
Trigger Event:
release
-
Statement type: