Skip to main content

Detect chromosome-level scaffolds in genome assemblies with inconsistent naming conventions

Project description

ChromDetect

PyPI version Python 3.9+ License: MIT Tests

Detect chromosome-level scaffolds in genome assemblies with inconsistent naming conventions.

The Problem

Genome assemblies use wildly inconsistent naming conventions for chromosome-level scaffolds:

  • Super_scaffold_1, Superscaffold_1, SUPER_1
  • chr1, chromosome_1, Chr_1
  • LG_1 (linkage groups)
  • scaffold_1_cov50 (coverage-annotated)
  • HiC_scaffold_1, Scaffold_1_RaGOO
  • NC_000001.11, CM000001.1 (NCBI accessions)

This inconsistency makes automated analysis and cross-species comparisons difficult. Existing QC tools like QUAST report metrics but don't classify scaffolds. Scaffolding tools like LACHESIS create assemblies but don't help interpret existing ones.

Why ChromDetect?

Feature QUAST assembly-stats gfastats ChromDetect
N50/N90 statistics
Scaffold classification
Pattern-based detection
Size-based detection
Karyotype-aware
Multiple output formats
Zero dependencies

ChromDetect fills a gap in the genomics toolkit: automatically identifying which scaffolds represent chromosomes rather than just reporting assembly statistics.

The Solution

ChromDetect uses multiple complementary strategies to identify chromosome-level scaffolds:

  1. Name-based detection - Regex patterns for 15+ common naming conventions
  2. Size-based detection - Large scaffolds are typically chromosomes
  3. N50-based detection - Scaffolds contributing to N50 are typically chromosome-level
  4. Karyotype-informed detection - Use known chromosome count to adjust classifications

Installation

pip install chromdetect

Or install from source:

git clone https://github.com/shandley/chromdetect.git
cd chromdetect
pip install -e .

Quick Start

Command Line

# Basic usage - get summary
chromdetect assembly.fasta

# Output JSON for programmatic use
chromdetect assembly.fasta --format json --output results.json

# Use karyotype information for better accuracy
chromdetect assembly.fasta --karyotype 24

# Export only chromosome-level scaffolds as TSV
chromdetect assembly.fasta --chromosomes-only --format tsv > chromosomes.tsv

Python API

from chromdetect import parse_fasta, classify_scaffolds

# Parse and classify
scaffolds = parse_fasta("assembly.fasta.gz")
results, stats = classify_scaffolds(scaffolds, expected_chromosomes=24)

# Print summary
print(f"Found {stats.chromosome_count} chromosomes")
print(f"Total assembly: {stats.total_length / 1e9:.2f} Gb")
print(f"N50: {stats.n50 / 1e6:.1f} Mb")

# Access individual scaffold classifications
for r in results:
    if r.classification == "chromosome":
        print(f"{r.name}: {r.length:,} bp (confidence: {r.confidence:.2f})")

Output Formats

Summary (default)

============================================================
CHROMDETECT ASSEMBLY ANALYSIS
============================================================

Total scaffolds:     1,234
Total length:        2,876,543,210 bp (2.88 Gb)
N50:                 45,678,901 bp (45.7 Mb)
N90:                 12,345,678 bp
Largest scaffold:    198,765,432 bp

Scaffold Classification:
  Chromosomes:       24 (2.85 Gb)
  Unlocalized:       15
  Unplaced:          1,195

Chromosome N50:      118,234,567 bp (118.2 Mb)
GC content:          41.2%

JSON

{
  "summary": {
    "total_scaffolds": 1234,
    "chromosome_count": 24,
    "n50": 45678901,
    ...
  },
  "scaffolds": [
    {
      "name": "chr1",
      "length": 198765432,
      "classification": "chromosome",
      "confidence": 0.95,
      "detection_method": "name_chr_explicit",
      "chromosome_id": "1"
    },
    ...
  ]
}

TSV

name    length    classification    confidence    method    chromosome_id
chr1    198765432    chromosome    0.95    name_chr_explicit    1
chr2    175432198    chromosome    0.93    name_chr_explicit    2
...

Options

Option Description
-f, --format Output format: summary, json, tsv (default: summary)
-o, --output Write output to file instead of stdout
-k, --karyotype Expected chromosome count for karyotype-informed detection
-s, --min-size Minimum size (bp) to consider chromosome-level (default: 10Mb)
-c, --chromosomes-only Only output chromosome-level scaffolds
-q, --quiet Suppress progress messages

Supported Naming Conventions

ChromDetect recognizes these naming patterns (case-insensitive):

Pattern Examples Method
Explicit chromosome chr1, chromosome_X, Chr_MT name_chr_explicit
Super scaffold Super_scaffold_1, Superscaffold_X name_super_scaffold
SUPER SUPER_1, SUPER1 name_SUPER
Linkage group LG1, LG_X name_linkage_group
NCBI RefSeq NC_000001.11 name_ncbi_refseq
NCBI GenBank CM000001.1 name_ncbi_genbank
HiC scaffold HiC_scaffold_1 name_hic_scaffold
RaGOO Scaffold_1_RaGOO name_ragoo
Simple numeric 1, X, MT name_numeric

Patterns that indicate unlocalized scaffolds:

  • *_random, *_unloc*, chrUn_*

Patterns that indicate unplaced scaffolds (contigs/fragments):

  • *_ctg*, *contig*, *_arrow_*, *_pilon*, *_hap*

How It Works

ChromDetect combines name-based and size-based detection with these priority rules:

  1. Strong name match (confidence ≥ 0.8) takes priority
  2. Large scaffold + weak name match = chromosome with boosted confidence
  3. Large scaffold + no name match = chromosome with reduced confidence
  4. Small scaffold = unplaced regardless of name

When --karyotype is provided:

  • If too many candidates: demote lowest-confidence chromosomes
  • If too few candidates: promote largest unplaced scaffolds

Use Cases

VGP Assembly Validation

# Validate a VGP curated assembly
chromdetect species.pri.cur.fasta.gz --karyotype 24 --format json

Cross-Species Comparison

from chromdetect import parse_fasta, classify_scaffolds

species_files = ["human.fa", "mouse.fa", "zebrafish.fa"]
karyotypes = [23, 20, 25]

for fasta, n_chr in zip(species_files, karyotypes):
    scaffolds = parse_fasta(fasta)
    results, stats = classify_scaffolds(scaffolds, expected_chromosomes=n_chr)
    print(f"{fasta}: {stats.chromosome_count} chromosomes detected")

Pipeline Integration

# As part of assembly QC pipeline
chromdetect assembly.fasta --format json | jq '.summary.chromosome_count'

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Adding New Patterns

To add support for a new naming convention:

  1. Add the regex pattern to chromdetect/patterns.py
  2. Include a descriptive method name
  3. Ensure the pattern captures chromosome ID in group 1
  4. Add tests in tests/test_patterns.py

Example:

# In patterns.py
CHROMOSOME_PATTERNS.append(
    (r'^MyConvention_(\d+)$', 'my_convention'),
)

Citation

If you use ChromDetect in your research, please cite:

ChromDetect: Chromosome-level scaffold detection for genome assemblies
https://github.com/shandley/chromdetect

License

MIT License - see LICENSE for details.

Related Projects

  • QUAST - Quality assessment tool for genome assemblies
  • Verity - Hi-C-based assembly validation framework

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chromdetect-0.2.0.tar.gz (23.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chromdetect-0.2.0-py3-none-any.whl (16.0 kB view details)

Uploaded Python 3

File details

Details for the file chromdetect-0.2.0.tar.gz.

File metadata

  • Download URL: chromdetect-0.2.0.tar.gz
  • Upload date:
  • Size: 23.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chromdetect-0.2.0.tar.gz
Algorithm Hash digest
SHA256 435b413ee26d31ee12f592d3e0f71547f392cf10ad2968b15c843c3d7070df91
MD5 f1def2c42b2037eb96eff54424f42f3d
BLAKE2b-256 fc0c3a22a067daad7c8038aefdce1970af14698598495bd2e75f22fa692e1972

See more details on using hashes here.

Provenance

The following attestation bundles were made for chromdetect-0.2.0.tar.gz:

Publisher: publish.yml on shandley/chromdetect

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chromdetect-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: chromdetect-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 16.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chromdetect-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 340efce9c4a198f826d978091bedace37b8a922fe9b89a6224dd30245aa7561f
MD5 156f2219f5e92cee7ada2c8cf815ec8e
BLAKE2b-256 a7b8ee30dd021fd081df6cb3b2e3f6f9928056f91c2200b6bb42f966fb7536ba

See more details on using hashes here.

Provenance

The following attestation bundles were made for chromdetect-0.2.0-py3-none-any.whl:

Publisher: publish.yml on shandley/chromdetect

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page