Detect chromosome-level scaffolds in genome assemblies with inconsistent naming conventions
Project description
ChromDetect
A utility to classify scaffolds in genome assemblies based on naming conventions and size.
What It Does
ChromDetect is a simple utility that classifies scaffolds in genome assemblies as chromosomes, unlocalized, or unplaced sequences. It works by:
- Matching scaffold names against common naming patterns (
chr1,Super_scaffold_1,LG_1,NC_*, etc.) - Using size heuristics (large scaffolds are likely chromosomes)
- Adjusting for expected karyotype if you know the chromosome count
Why Use It?
Genome assemblies use inconsistent naming conventions:
Super_scaffold_1, chr1, LG_1, HiC_scaffold_1, NC_000001.11, scaffold_1_cov50...
If you need to quickly identify which scaffolds are chromosomes—for filtering, statistics, or downstream analysis—ChromDetect automates that classification.
This is a utility tool, not a validator. It doesn't detect misassemblies or verify correctness. For assembly QC, use tools like QUAST or Merqury.
Installation
pip install chromdetect
Or install from source:
git clone https://github.com/shandley/chromdetect.git
cd chromdetect
pip install -e .
Example Data
ChromDetect includes synthetic test assemblies in the examples/ directory:
# Try it immediately after installation
chromdetect examples/synthetic_assembly.fasta
# Compare two assembly versions
chromdetect examples/synthetic_assembly.fasta --compare examples/synthetic_assembly_v2.fasta
Downloading Real Genome Assemblies
For testing with real data, we recommend these small, well-annotated assemblies:
Saccharomyces cerevisiae S288C (Yeast, ~12 Mb, 16 chromosomes):
# Using NCBI datasets CLI (install: pip install ncbi-datasets-cli)
datasets download genome accession GCF_000146045.2 --include genome
unzip ncbi_dataset.zip
chromdetect ncbi_dataset/data/GCF_000146045.2/GCF_000146045.2_R64_genomic.fna
Caenorhabditis elegans (Nematode, ~100 Mb, 6 chromosomes):
datasets download genome accession GCF_000002985.6 --include genome
unzip ncbi_dataset.zip
chromdetect ncbi_dataset/data/GCF_000002985.6/*.fna
Arabidopsis thaliana (Plant, ~135 Mb, 5 chromosomes):
datasets download genome accession GCF_000001735.4 --include genome
unzip ncbi_dataset.zip
chromdetect ncbi_dataset/data/GCF_000001735.4/*.fna --karyotype 5
For more test data options, see NCBI Datasets or GenomeArk (VGP assemblies).
Quick Start
Command Line
# Basic usage - get summary
chromdetect assembly.fasta
# Output JSON for programmatic use
chromdetect assembly.fasta --format json --output results.json
# Use karyotype information for better accuracy
chromdetect assembly.fasta --karyotype 24
# Export only chromosome-level scaffolds as TSV
chromdetect assembly.fasta --chromosomes-only --format tsv > chromosomes.tsv
# Export as BED or GFF format for pipeline integration
chromdetect assembly.fasta --format bed > scaffolds.bed
chromdetect assembly.fasta --format gff > scaffolds.gff
# Extract chromosome sequences to a new FASTA file
chromdetect assembly.fasta --extract-chromosomes chromosomes.fasta
# Batch process multiple assemblies
chromdetect --batch assemblies_dir/ --output results_dir/
# Compare two assemblies side-by-side
chromdetect assembly_v1.fasta --compare assembly_v2.fasta
# Generate visual HTML report
chromdetect assembly.fasta --format html -o report.html
# Use custom naming patterns
chromdetect assembly.fasta --patterns custom_patterns.yaml
# Use NCBI assembly report for accurate classification
chromdetect assembly.fasta --assembly-report GCF_000001405.assembly_report.txt
Python API
# Simple one-liner classification (recommended for most use cases)
from chromdetect import classify_fasta, compare_fasta_files
results, stats = classify_fasta("assembly.fasta")
print(f"Found {stats.chromosome_count} chromosomes")
print(f"N50: {stats.n50 / 1e6:.1f} Mb")
# Compare two assemblies
comparison = compare_fasta_files("assembly_v1.fasta", "assembly_v2.fasta")
print(f"Shared chromosomes: {len(comparison.shared_chromosomes)}")
print(f"N50 change: {comparison.summary()['n50_difference']:,} bp")
For more control, use the lower-level API:
from chromdetect import (
parse_fasta, classify_scaffolds, write_fasta, format_bed, format_gff,
parse_assembly_report
)
# Parse and classify with options
scaffolds = parse_fasta("assembly.fasta.gz")
results, stats = classify_scaffolds(scaffolds, expected_chromosomes=24)
# Print summary
print(f"Found {stats.chromosome_count} chromosomes")
print(f"Total assembly: {stats.total_length / 1e9:.2f} Gb")
# Access individual scaffold classifications
for r in results:
if r.classification == "chromosome":
print(f"{r.name}: {r.length:,} bp (confidence: {r.confidence:.2f})")
# Export to BED or GFF format
bed_output = format_bed(results)
gff_output = format_gff(results)
# Use NCBI assembly report for authoritative classification
report = parse_assembly_report("assembly_report.txt")
results, stats = classify_scaffolds(scaffolds, assembly_report=report)
Output Formats
Summary (default)
============================================================
CHROMDETECT ASSEMBLY ANALYSIS
============================================================
Total scaffolds: 1,234
Total length: 2,876,543,210 bp (2.88 Gb)
N50: 45,678,901 bp (45.7 Mb)
N90: 12,345,678 bp
Largest scaffold: 198,765,432 bp
Scaffold Classification:
Chromosomes: 24 (2.85 Gb)
Unlocalized: 15
Unplaced: 1,195
Chromosome N50: 118,234,567 bp (118.2 Mb)
GC content: 41.2%
JSON
{
"summary": {
"total_scaffolds": 1234,
"chromosome_count": 24,
"n50": 45678901,
...
},
"scaffolds": [
{
"name": "chr1",
"length": 198765432,
"classification": "chromosome",
"confidence": 0.95,
"detection_method": "name_chr_explicit",
"chromosome_id": "1"
},
...
]
}
TSV
name length classification confidence method chromosome_id
chr1 198765432 chromosome 0.95 name_chr_explicit 1
chr2 175432198 chromosome 0.93 name_chr_explicit 2
...
BED
Standard BED6 format for integration with bedtools, IGV, and other genomics tools:
chr1 0 198765432 chromosome 950 .
chr2 0 175432198 chromosome 930 .
...
GFF3
GFF3 format with classification metadata in attributes:
##gff-version 3
chr1 chromdetect chromosome 1 198765432 0.950 . . ID=chr1;Name=chr1;classification=chromosome;detection_method=name_chr_explicit;chromosome_id=1
...
Options
| Option | Description |
|---|---|
-f, --format |
Output format: summary, json, tsv, bed, gff, html (default: summary) |
-o, --output |
Write output to file instead of stdout |
-k, --karyotype |
Expected chromosome count for karyotype-informed detection |
-s, --min-size |
Minimum size (bp) to consider chromosome-level (default: 10Mb) |
-c, --chromosomes-only |
Only output chromosome-level scaffolds |
--extract-chromosomes |
Extract chromosome sequences to a FASTA file |
--batch |
Process all FASTA files in a directory |
--compare |
Compare with a second assembly (side-by-side analysis) |
--patterns |
Custom patterns file (YAML or JSON) for scaffold name matching |
--assembly-report |
NCBI assembly report file for authoritative classification |
--min-confidence |
Minimum confidence threshold (0.0-1.0) to include scaffolds |
--min-length |
Minimum scaffold length (bp) to include in output |
-q, --quiet |
Suppress progress messages |
-v, --verbose |
Show detailed processing information |
Supported Naming Conventions
ChromDetect recognizes these naming patterns (case-insensitive):
| Pattern | Examples | Method |
|---|---|---|
| Explicit chromosome | chr1, chromosome_X, Chr_MT |
name_chr_explicit |
| Super scaffold | Super_scaffold_1, Superscaffold_X |
name_super_scaffold |
| SUPER | SUPER_1, SUPER1 |
name_SUPER |
| Linkage group | LG1, LG_X |
name_linkage_group |
| NCBI RefSeq | NC_000001.11 |
name_ncbi_refseq |
| NCBI GenBank | CM000001.1 |
name_ncbi_genbank |
| HiC scaffold | HiC_scaffold_1 |
name_hic_scaffold |
| RaGOO | Scaffold_1_RaGOO |
name_ragoo |
| Simple numeric | 1, X, MT |
name_numeric |
Patterns that indicate unlocalized scaffolds:
*_random,*_unloc*,chrUn_*
Patterns that indicate unplaced scaffolds (contigs/fragments):
*_ctg*,*contig*,*_arrow_*,*_pilon*,*_hap*
How It Works
ChromDetect combines name-based and size-based detection with these priority rules:
- Strong name match (confidence ≥ 0.8) takes priority
- Large scaffold + weak name match = chromosome with boosted confidence
- Large scaffold + no name match = chromosome with reduced confidence
- Small scaffold = unplaced regardless of name
When --karyotype is provided:
- If too many candidates: demote lowest-confidence chromosomes
- If too few candidates: promote largest unplaced scaffolds
Use Cases
VGP Assembly Classification
# Classify scaffolds in a VGP curated assembly
chromdetect species.pri.cur.fasta.gz --karyotype 24 --format json
Multi-Assembly Classification
from chromdetect import classify_fasta
# Classify multiple assemblies independently
species = [
("human.fa", 23),
("mouse.fa", 20),
("zebrafish.fa", 25),
]
for fasta, expected_chr in species:
results, stats = classify_fasta(fasta)
print(f"{fasta}: {stats.chromosome_count} chromosomes detected (expected {expected_chr})")
Note: This classifies each assembly independently. ChromDetect does not perform synteny analysis or identify homologous chromosomes across species.
Pipeline Integration
# As part of assembly QC pipeline
chromdetect assembly.fasta --format json | jq '.summary.chromosome_count'
# Export scaffold regions in BED format for downstream analysis
chromdetect assembly.fasta --format bed --chromosomes-only > chromosomes.bed
bedtools getfasta -fi assembly.fasta -bed chromosomes.bed -fo chr_regions.fa
Batch Processing
# Process all assemblies in a directory
chromdetect --batch assemblies/ --format json --output results/
# This creates:
# - results/assembly1.json
# - results/assembly2.json
# - ...
# - results/batch_summary.tsv (overview of all assemblies)
Extract Chromosome Sequences
# Extract only chromosome-level sequences to a new FASTA
chromdetect assembly.fasta --extract-chromosomes chromosomes.fasta
# Combine with other options
chromdetect assembly.fasta \
--karyotype 24 \
--extract-chromosomes chromosomes.fasta \
--format json --output report.json
Using NCBI Assembly Reports
# Download an assembly report from NCBI
# https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.40
# Use it for authoritative scaffold classification
chromdetect GRCh38.fasta --assembly-report GCF_000001405.40_GRCh38.p14_assembly_report.txt
Limitations
ChromDetect uses heuristics and pattern matching—it has inherent limitations:
-
Not a validator: ChromDetect classifies scaffolds but cannot detect misassemblies, inversions, or sequence errors. Use QUAST, Merqury, or similar tools for assembly validation.
-
Pattern-dependent: Classification relies on naming conventions. Unusual or custom naming schemes may not be recognized without custom patterns.
-
Size heuristics are approximate: A 50 Mb scaffold is assumed to be chromosome-level, but could be a misassembly or concatenated contigs.
-
No reference comparison: ChromDetect doesn't compare against reference genomes, so it cannot identify missing chromosomes or structural variants.
For critical applications, combine ChromDetect with comprehensive QC tools and manual curation.
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Adding New Patterns
To add support for a new naming convention:
- Add the regex pattern to
chromdetect/patterns.py - Include a descriptive method name
- Ensure the pattern captures chromosome ID in group 1
- Add tests in
tests/test_patterns.py
Example:
# In patterns.py
CHROMOSOME_PATTERNS.append(
(r'^MyConvention_(\d+)$', 'my_convention'),
)
Using Custom Patterns
You can also use custom patterns without modifying the source code:
# custom_patterns.yaml
chromosome_patterns:
- pattern: "^MyScaffold_(\\d+)$"
name: "my_scaffold"
- pattern: "^CustomChr_(\\d+)$"
name: "custom_chr"
unlocalized_patterns:
- my_random
fragment_patterns:
- my_contig
chromdetect assembly.fasta --patterns custom_patterns.yaml
Citation
If you use ChromDetect in your research, please cite it using the metadata from our CITATION.cff file:
@software{chromdetect,
author = {Handley, Scott A.},
title = {ChromDetect: A utility for classifying scaffolds in genome assemblies},
url = {https://github.com/shandley/chromdetect},
version = {0.5.0},
year = {2024}
}
Note: Replace the Zenodo DOI badge above with the actual DOI after your first GitHub release triggers Zenodo archival.
License
MIT License - see LICENSE for details.
Related Projects
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chromdetect-0.5.0.tar.gz.
File metadata
- Download URL: chromdetect-0.5.0.tar.gz
- Upload date:
- Size: 52.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
195c62ebcc52102eea5830cbd692b225b81a44050f1a1e4168143cd723256291
|
|
| MD5 |
65c546787afdda3566c464d24f649d05
|
|
| BLAKE2b-256 |
a2b2a55712540cf261c0af98e6b4c3e93efe8c66fc233b207d6c11b86937b0f1
|
Provenance
The following attestation bundles were made for chromdetect-0.5.0.tar.gz:
Publisher:
publish.yml on shandley/chromdetect
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chromdetect-0.5.0.tar.gz -
Subject digest:
195c62ebcc52102eea5830cbd692b225b81a44050f1a1e4168143cd723256291 - Sigstore transparency entry: 765811233
- Sigstore integration time:
-
Permalink:
shandley/chromdetect@fa927fabe09fe381cf7fb20f66991b3fe2c533a6 -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/shandley
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@fa927fabe09fe381cf7fb20f66991b3fe2c533a6 -
Trigger Event:
release
-
Statement type:
File details
Details for the file chromdetect-0.5.0-py3-none-any.whl.
File metadata
- Download URL: chromdetect-0.5.0-py3-none-any.whl
- Upload date:
- Size: 35.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab74d574633e70188570aeeb82a07d03d6d99ac06f4d2b5aaa53455657ed5af4
|
|
| MD5 |
f3cc9cba56d915f78996e811afe924bb
|
|
| BLAKE2b-256 |
55cd2fb5be9ec9c525133491331f15441496793d1caca6d4f5e010d39bc0f02d
|
Provenance
The following attestation bundles were made for chromdetect-0.5.0-py3-none-any.whl:
Publisher:
publish.yml on shandley/chromdetect
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chromdetect-0.5.0-py3-none-any.whl -
Subject digest:
ab74d574633e70188570aeeb82a07d03d6d99ac06f4d2b5aaa53455657ed5af4 - Sigstore transparency entry: 765811239
- Sigstore integration time:
-
Permalink:
shandley/chromdetect@fa927fabe09fe381cf7fb20f66991b3fe2c533a6 -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/shandley
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@fa927fabe09fe381cf7fb20f66991b3fe2c533a6 -
Trigger Event:
release
-
Statement type: