Skip to main content

Detect chromosome-level scaffolds in genome assemblies with inconsistent naming conventions

Project description

ChromDetect Logo

ChromDetect

PyPI version Python versions License: MIT Tests

A utility to classify scaffolds in genome assemblies based on naming conventions and size.


What It Does

ChromDetect is a simple utility that classifies scaffolds in genome assemblies as chromosomes, unlocalized, or unplaced sequences. It works by:

  1. Matching scaffold names against common naming patterns (chr1, Super_scaffold_1, LG_1, NC_*, etc.)
  2. Using size heuristics (large scaffolds are likely chromosomes)
  3. Adjusting for expected karyotype if you know the chromosome count

Why Use It?

Genome assemblies use inconsistent naming conventions:

Super_scaffold_1, chr1, LG_1, HiC_scaffold_1, NC_000001.11, scaffold_1_cov50...

If you need to quickly identify which scaffolds are chromosomes—for filtering, statistics, or downstream analysis—ChromDetect automates that classification.

This is a utility tool, not a validator. It doesn't detect misassemblies or verify correctness. For assembly QC, use tools like QUAST or Merqury.

Installation

pip install chromdetect

Or install from source:

git clone https://github.com/shandley/chromdetect.git
cd chromdetect
pip install -e .

Example Data

ChromDetect includes synthetic test assemblies in the examples/ directory:

# Try it immediately after installation
chromdetect examples/synthetic_assembly.fasta

# Compare two assembly versions
chromdetect examples/synthetic_assembly.fasta --compare examples/synthetic_assembly_v2.fasta

Downloading Real Genome Assemblies

For testing with real data, we recommend these small, well-annotated assemblies:

Saccharomyces cerevisiae S288C (Yeast, ~12 Mb, 16 chromosomes):

# Using NCBI datasets CLI (install: pip install ncbi-datasets-cli)
datasets download genome accession GCF_000146045.2 --include genome
unzip ncbi_dataset.zip
chromdetect ncbi_dataset/data/GCF_000146045.2/GCF_000146045.2_R64_genomic.fna

Caenorhabditis elegans (Nematode, ~100 Mb, 6 chromosomes):

datasets download genome accession GCF_000002985.6 --include genome
unzip ncbi_dataset.zip
chromdetect ncbi_dataset/data/GCF_000002985.6/*.fna

Arabidopsis thaliana (Plant, ~135 Mb, 5 chromosomes):

datasets download genome accession GCF_000001735.4 --include genome
unzip ncbi_dataset.zip
chromdetect ncbi_dataset/data/GCF_000001735.4/*.fna --karyotype 5

For more test data options, see NCBI Datasets or GenomeArk (VGP assemblies).

Quick Start

Command Line

# Basic usage - get summary
chromdetect assembly.fasta

# Output JSON for programmatic use
chromdetect assembly.fasta --format json --output results.json

# Use karyotype information for better accuracy
chromdetect assembly.fasta --karyotype 24

# Export only chromosome-level scaffolds as TSV
chromdetect assembly.fasta --chromosomes-only --format tsv > chromosomes.tsv

# Export as BED or GFF format for pipeline integration
chromdetect assembly.fasta --format bed > scaffolds.bed
chromdetect assembly.fasta --format gff > scaffolds.gff

# Extract chromosome sequences to a new FASTA file
chromdetect assembly.fasta --extract-chromosomes chromosomes.fasta

# Batch process multiple assemblies
chromdetect --batch assemblies_dir/ --output results_dir/

# Compare two assemblies side-by-side
chromdetect assembly_v1.fasta --compare assembly_v2.fasta

# Generate visual HTML report
chromdetect assembly.fasta --format html -o report.html

# Use custom naming patterns
chromdetect assembly.fasta --patterns custom_patterns.yaml

# Use NCBI assembly report for accurate classification
chromdetect assembly.fasta --assembly-report GCF_000001405.assembly_report.txt

Python API

# Simple one-liner classification (recommended for most use cases)
from chromdetect import classify_fasta, compare_fasta_files

results, stats = classify_fasta("assembly.fasta")
print(f"Found {stats.chromosome_count} chromosomes")
print(f"N50: {stats.n50 / 1e6:.1f} Mb")

# Compare two assemblies
comparison = compare_fasta_files("assembly_v1.fasta", "assembly_v2.fasta")
print(f"Shared chromosomes: {len(comparison.shared_chromosomes)}")
print(f"N50 change: {comparison.summary()['n50_difference']:,} bp")

For more control, use the lower-level API:

from chromdetect import (
    parse_fasta, classify_scaffolds, write_fasta, format_bed, format_gff,
    parse_assembly_report
)

# Parse and classify with options
scaffolds = parse_fasta("assembly.fasta.gz")
results, stats = classify_scaffolds(scaffolds, expected_chromosomes=24)

# Print summary
print(f"Found {stats.chromosome_count} chromosomes")
print(f"Total assembly: {stats.total_length / 1e9:.2f} Gb")

# Access individual scaffold classifications
for r in results:
    if r.classification == "chromosome":
        print(f"{r.name}: {r.length:,} bp (confidence: {r.confidence:.2f})")

# Export to BED or GFF format
bed_output = format_bed(results)
gff_output = format_gff(results)

# Use NCBI assembly report for authoritative classification
report = parse_assembly_report("assembly_report.txt")
results, stats = classify_scaffolds(scaffolds, assembly_report=report)

Output Formats

Summary (default)

============================================================
CHROMDETECT ASSEMBLY ANALYSIS
============================================================

Total scaffolds:     1,234
Total length:        2,876,543,210 bp (2.88 Gb)
N50:                 45,678,901 bp (45.7 Mb)
N90:                 12,345,678 bp
Largest scaffold:    198,765,432 bp

Scaffold Classification:
  Chromosomes:       24 (2.85 Gb)
  Unlocalized:       15
  Unplaced:          1,195

Chromosome N50:      118,234,567 bp (118.2 Mb)
GC content:          41.2%

JSON

{
  "summary": {
    "total_scaffolds": 1234,
    "chromosome_count": 24,
    "n50": 45678901,
    ...
  },
  "scaffolds": [
    {
      "name": "chr1",
      "length": 198765432,
      "classification": "chromosome",
      "confidence": 0.95,
      "detection_method": "name_chr_explicit",
      "chromosome_id": "1"
    },
    ...
  ]
}

TSV

name    length    classification    confidence    method    chromosome_id
chr1    198765432    chromosome    0.95    name_chr_explicit    1
chr2    175432198    chromosome    0.93    name_chr_explicit    2
...

BED

Standard BED6 format for integration with bedtools, IGV, and other genomics tools:

chr1    0    198765432    chromosome    950    .
chr2    0    175432198    chromosome    930    .
...

GFF3

GFF3 format with classification metadata in attributes:

##gff-version 3
chr1    chromdetect    chromosome    1    198765432    0.950    .    .    ID=chr1;Name=chr1;classification=chromosome;detection_method=name_chr_explicit;chromosome_id=1
...

Options

Option Description
-f, --format Output format: summary, json, tsv, bed, gff, html (default: summary)
-o, --output Write output to file instead of stdout
-k, --karyotype Expected chromosome count for karyotype-informed detection
-s, --min-size Minimum size (bp) to consider chromosome-level (default: 10Mb)
-c, --chromosomes-only Only output chromosome-level scaffolds
--extract-chromosomes Extract chromosome sequences to a FASTA file
--batch Process all FASTA files in a directory
--compare Compare with a second assembly (side-by-side analysis)
--patterns Custom patterns file (YAML or JSON) for scaffold name matching
--assembly-report NCBI assembly report file for authoritative classification
--min-confidence Minimum confidence threshold (0.0-1.0) to include scaffolds
--min-length Minimum scaffold length (bp) to include in output
-q, --quiet Suppress progress messages
-v, --verbose Show detailed processing information

Supported Naming Conventions

ChromDetect recognizes these naming patterns (case-insensitive):

Pattern Examples Method
Explicit chromosome chr1, chromosome_X, Chr_MT name_chr_explicit
Super scaffold Super_scaffold_1, Superscaffold_X name_super_scaffold
SUPER SUPER_1, SUPER1 name_SUPER
Linkage group LG1, LG_X name_linkage_group
NCBI RefSeq NC_000001.11 name_ncbi_refseq
NCBI GenBank CM000001.1 name_ncbi_genbank
HiC scaffold HiC_scaffold_1 name_hic_scaffold
RaGOO Scaffold_1_RaGOO name_ragoo
Simple numeric 1, X, MT name_numeric

Patterns that indicate unlocalized scaffolds:

  • *_random, *_unloc*, chrUn_*

Patterns that indicate unplaced scaffolds (contigs/fragments):

  • *_ctg*, *contig*, *_arrow_*, *_pilon*, *_hap*

How It Works

ChromDetect combines name-based and size-based detection with these priority rules:

  1. Strong name match (confidence ≥ 0.8) takes priority
  2. Large scaffold + weak name match = chromosome with boosted confidence
  3. Large scaffold + no name match = chromosome with reduced confidence
  4. Small scaffold = unplaced regardless of name

When --karyotype is provided:

  • If too many candidates: demote lowest-confidence chromosomes
  • If too few candidates: promote largest unplaced scaffolds

Use Cases

VGP Assembly Classification

# Classify scaffolds in a VGP curated assembly
chromdetect species.pri.cur.fasta.gz --karyotype 24 --format json

Multi-Assembly Classification

from chromdetect import classify_fasta

# Classify multiple assemblies independently
species = [
    ("human.fa", 23),
    ("mouse.fa", 20),
    ("zebrafish.fa", 25),
]

for fasta, expected_chr in species:
    results, stats = classify_fasta(fasta)
    print(f"{fasta}: {stats.chromosome_count} chromosomes detected (expected {expected_chr})")

Note: This classifies each assembly independently. ChromDetect does not perform synteny analysis or identify homologous chromosomes across species.

Pipeline Integration

# As part of assembly QC pipeline
chromdetect assembly.fasta --format json | jq '.summary.chromosome_count'

# Export scaffold regions in BED format for downstream analysis
chromdetect assembly.fasta --format bed --chromosomes-only > chromosomes.bed
bedtools getfasta -fi assembly.fasta -bed chromosomes.bed -fo chr_regions.fa

Batch Processing

# Process all assemblies in a directory
chromdetect --batch assemblies/ --format json --output results/

# This creates:
# - results/assembly1.json
# - results/assembly2.json
# - ...
# - results/batch_summary.tsv  (overview of all assemblies)

Extract Chromosome Sequences

# Extract only chromosome-level sequences to a new FASTA
chromdetect assembly.fasta --extract-chromosomes chromosomes.fasta

# Combine with other options
chromdetect assembly.fasta \
    --karyotype 24 \
    --extract-chromosomes chromosomes.fasta \
    --format json --output report.json

Using NCBI Assembly Reports

# Download an assembly report from NCBI
# https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.40

# Use it for authoritative scaffold classification
chromdetect GRCh38.fasta --assembly-report GCF_000001405.40_GRCh38.p14_assembly_report.txt

Limitations

ChromDetect uses heuristics and pattern matching—it has inherent limitations:

  • Not a validator: ChromDetect classifies scaffolds but cannot detect misassemblies, inversions, or sequence errors. Use QUAST, Merqury, or similar tools for assembly validation.

  • Pattern-dependent: Classification relies on naming conventions. Unusual or custom naming schemes may not be recognized without custom patterns.

  • Size heuristics are approximate: A 50 Mb scaffold is assumed to be chromosome-level, but could be a misassembly or concatenated contigs.

  • No reference comparison: ChromDetect doesn't compare against reference genomes, so it cannot identify missing chromosomes or structural variants.

For critical applications, combine ChromDetect with comprehensive QC tools and manual curation.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Adding New Patterns

To add support for a new naming convention:

  1. Add the regex pattern to chromdetect/patterns.py
  2. Include a descriptive method name
  3. Ensure the pattern captures chromosome ID in group 1
  4. Add tests in tests/test_patterns.py

Example:

# In patterns.py
CHROMOSOME_PATTERNS.append(
    (r'^MyConvention_(\d+)$', 'my_convention'),
)

Using Custom Patterns

You can also use custom patterns without modifying the source code:

# custom_patterns.yaml
chromosome_patterns:
  - pattern: "^MyScaffold_(\\d+)$"
    name: "my_scaffold"
  - pattern: "^CustomChr_(\\d+)$"
    name: "custom_chr"
unlocalized_patterns:
  - my_random
fragment_patterns:
  - my_contig
chromdetect assembly.fasta --patterns custom_patterns.yaml

Citation

If you use ChromDetect in your research, please cite it using the metadata from our CITATION.cff file:

@software{chromdetect,
  author = {Handley, Scott A.},
  title = {ChromDetect: A utility for classifying scaffolds in genome assemblies},
  url = {https://github.com/shandley/chromdetect},
  version = {0.5.0},
  year = {2024}
}

DOI

Note: Replace the Zenodo DOI badge above with the actual DOI after your first GitHub release triggers Zenodo archival.

License

MIT License - see LICENSE for details.

Related Projects

  • QUAST - Quality assessment tool for genome assemblies
  • Verity - Hi-C-based assembly validation framework

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chromdetect-0.5.0.tar.gz (52.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chromdetect-0.5.0-py3-none-any.whl (35.0 kB view details)

Uploaded Python 3

File details

Details for the file chromdetect-0.5.0.tar.gz.

File metadata

  • Download URL: chromdetect-0.5.0.tar.gz
  • Upload date:
  • Size: 52.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chromdetect-0.5.0.tar.gz
Algorithm Hash digest
SHA256 195c62ebcc52102eea5830cbd692b225b81a44050f1a1e4168143cd723256291
MD5 65c546787afdda3566c464d24f649d05
BLAKE2b-256 a2b2a55712540cf261c0af98e6b4c3e93efe8c66fc233b207d6c11b86937b0f1

See more details on using hashes here.

Provenance

The following attestation bundles were made for chromdetect-0.5.0.tar.gz:

Publisher: publish.yml on shandley/chromdetect

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chromdetect-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: chromdetect-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 35.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chromdetect-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ab74d574633e70188570aeeb82a07d03d6d99ac06f4d2b5aaa53455657ed5af4
MD5 f3cc9cba56d915f78996e811afe924bb
BLAKE2b-256 55cd2fb5be9ec9c525133491331f15441496793d1caca6d4f5e010d39bc0f02d

See more details on using hashes here.

Provenance

The following attestation bundles were made for chromdetect-0.5.0-py3-none-any.whl:

Publisher: publish.yml on shandley/chromdetect

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page