Skip to main content

A bioinformatics software focused on quality control based on species criteria

Project description

speccheck

CI codecov GPLv3 License Python Version Code style: black Ruff

speccheck is a modular command-line tool for collecting, validating, and summarizing quality control (QC) metrics from genomic analysis pipelines. It automatically detects and processes outputs from multiple bioinformatics tools, validates them against customizable criteria, and generates comprehensive reports with optional interactive visualizations.

Features

  • ๐Ÿ” Automatic Module Detection: Supports CheckM, QUAST, Speciator, ARIBA, and Sylph outputs
  • โœ… Flexible QC Validation: Define organism-specific quality criteria with pass/fail checks
  • ๐Ÿ“Š Interactive Reports: Generate HTML dashboards with Plotly visualizations
  • ๐Ÿ”— Metadata Integration: Merge external sample metadata into QC reports
  • ๐Ÿ“ Rich Logging: Beautiful console output with Rich library
  • ๐Ÿณ Docker Support: Pre-built Docker images available

Installation

From Source

Clone the repository and install with pip:

git clone https://github.com/happykhan/speccheck.git
cd speccheck
pip install -e .

Development Installation

For development with testing and linting tools:

pip install -e '.[dev]'

Note: This project uses modern Python packaging with pyproject.toml (PEP 517/621). See MIGRATION.md for details on the migration from setup.py.

Docker

A Docker image is available for containerized execution:

docker pull happykhan/speccheck

Quick Start

  1. Collect QC data from analysis outputs:
speccheck collect tests/practice_data/Sample_* --output-file results.csv
  1. Generate summary report with visualizations:
speccheck summary qc_results/ --plot
  1. Validate criteria file:
speccheck check --criteria-file criteria.csv

Usage

Command: collect

Collect and validate QC metrics from bioinformatics tool outputs.

speccheck collect [OPTIONS] FILEPATHS...

Options

Option Type Default Description
FILEPATHS Positional Required File paths (supports wildcards like data/*/*.tsv)
--organism String Auto-detect Organism name for criteria matching
--sample String None Sample identifier
--criteria-file Path criteria.csv CSV file with QC criteria
--output-file Path qc_results/collected_data.csv Output CSV path
--metadata Path None CSV with additional metadata (requires sample_id column)
-v, --verbose Flag False Enable debug logging
--version Flag - Show version and exit

Examples

Basic collection:

speccheck collect data/sample1/*.tsv --sample sample1

With organism specification:

speccheck collect data/ecoli_* --organism "Escherichia coli" --output-file ecoli_qc.csv

With metadata merging:

speccheck collect data/* --metadata sample_info.csv --output-file merged_results.csv

Supported Modules

The collect command automatically detects outputs from:

  • CheckM: Completeness, contamination, genome metrics
  • QUAST: Assembly statistics (N50, contigs, GC content)
  • Speciator: Species identification and confidence
  • ARIBA: Antimicrobial resistance gene detection
  • Sylph: Metagenomic profiling and ANI values

Command: summary

Generate consolidated reports from multiple collected QC files.

speccheck summary [OPTIONS] DIRECTORY

Options

Option Type Default Description
DIRECTORY Positional Required Directory containing CSV QC reports
--output Path qc_report Output directory for summary
--species String Speciator.speciesName Column name for species field
--sample String sample_id Column name for sample identifier
--templates Path templates/report.html HTML template file
--plot Flag False Generate interactive plots
-v, --verbose Flag False Enable debug logging
--version Flag - Show version and exit

Examples

Basic summary:

speccheck summary qc_results/

With plotting enabled:

speccheck summary qc_results/ --plot --output final_report/

Custom field names:

speccheck summary results/ --sample SampleID --species Species --plot

Output

  • report.csv: Consolidated QC metrics with sorted columns (sample_id, all_checks_passed, .check columns, other fields)
  • report.html: Interactive HTML dashboard (when --plot is enabled)

Command: check

Validate the structure and content of a criteria file.

speccheck check [OPTIONS]

Options

Option Type Default Description
--criteria-file Path criteria.csv Path to criteria CSV file
-v, --verbose Flag False Enable debug logging
--version Flag - Show version and exit

Example

speccheck check --criteria-file config/custom_criteria.csv

Criteria File Format

The criteria file defines organism-specific QC thresholds in CSV format:

organism,software,field,operator,threshold
Escherichia coli,Checkm,Completeness,>=,95
Escherichia coli,Checkm,Contamination,<=,5
Escherichia coli,Quast,N50,>=,50000

Columns:

  • organism: Species or genus name (use "all" for universal criteria)
  • software: Tool name (CheckM, QUAST, Speciator, ARIBA, Sylph)
  • field: Metric name from tool output
  • operator: Comparison operator (>=, <=, ==, >, <)
  • threshold: Numeric threshold value

Metadata Integration

Add external sample metadata using the --metadata option:

metadata.csv:

sample_id,location,sequencing_date,batch
sample1,Lab A,2024-01-15,Batch1
sample2,Lab B,2024-01-16,Batch1
speccheck collect data/* --metadata metadata.csv --output-file results.csv

Metadata columns are automatically merged with QC metrics based on sample_id.


Output Format

CSV Column Order

Output files are automatically organized for readability:

  1. Sample identifier (sample_id or Sample)
  2. Overall checks (columns ending with all_checks_passed)
  3. Individual checks (columns ending with .check) - sorted alphabetically
  4. Metrics (remaining columns) - sorted alphabetically

Example Output

sample_id,all_checks_passed,Checkm.all_checks_passed,Checkm.Completeness.check,Checkm.Contamination.check,Checkm.Completeness,Checkm.Contamination
sample1,True,True,True,True,98.5,1.2
sample2,False,False,False,True,89.3,0.8

Development

Running Tests

pytest
pytest --cov=speccheck  # With coverage

Code Quality

pylint speccheck/

Project Structure

speccheck/
โ”œโ”€โ”€ speccheck/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ main.py              # Core logic
โ”‚   โ”œโ”€โ”€ collect.py           # File collection & writing
โ”‚   โ”œโ”€โ”€ criteria.py          # Criteria validation
โ”‚   โ”œโ”€โ”€ report.py            # Report generation
โ”‚   โ”œโ”€โ”€ modules/             # Tool-specific parsers
โ”‚   โ”‚   โ”œโ”€โ”€ checkm.py
โ”‚   โ”‚   โ”œโ”€โ”€ quast.py
โ”‚   โ”‚   โ”œโ”€โ”€ speciator.py
โ”‚   โ”‚   โ”œโ”€โ”€ ariba.py
โ”‚   โ”‚   โ””โ”€โ”€ sylph.py
โ”‚   โ””โ”€โ”€ plot_modules/        # Visualization modules
โ”‚       โ”œโ”€โ”€ plot_checkm.py
โ”‚       โ”œโ”€โ”€ plot_quast.py
โ”‚       โ””โ”€โ”€ ...
โ”œโ”€โ”€ tests/                   # Pytest test suite
โ”œโ”€โ”€ templates/               # HTML templates
โ”œโ”€โ”€ speccheck.py            # CLI entry point
โ””โ”€โ”€ setup.py                # Package configuration

Dependencies

  • Core: rich, typer, pandas, jinja2, plotly
  • Dev: pytest, pytest-cov, pylint, coverage

Version

Check the installed version:

speccheck --version

License

This project is licensed under the GNU General Public License v3.0 (GPLv3). See LICENSE for details.


Contributing

Contributions are welcome! We appreciate bug reports, feature requests, documentation improvements, and code contributions.

Quick Start for Contributors

  1. Fork the repository
  2. Install development dependencies: pip install -e '.[dev]'
  3. Install pre-commit hooks: pre-commit install
  4. Create a feature branch: git checkout -b feature/your-feature
  5. Make your changes and add tests
  6. Run checks: pytest --cov=speccheck && ruff check speccheck/
  7. Submit a pull request

For detailed guidelines, see CONTRIBUTING.md.

Code Quality

This project uses:

  • Black for code formatting
  • Ruff for fast linting
  • Pylint for comprehensive code analysis
  • pytest with coverage reporting
  • pre-commit hooks for automated checks

All PRs must pass CI checks including tests on Python 3.10, 3.11, and 3.12 across Ubuntu, macOS, and Windows.


Citation

If you use speccheck in your research, please cite:

[Citation information to be added]

Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

speccheck_qc-1.2.0.tar.gz (124.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

speccheck_qc-1.2.0-py3-none-any.whl (49.3 kB view details)

Uploaded Python 3

File details

Details for the file speccheck_qc-1.2.0.tar.gz.

File metadata

  • Download URL: speccheck_qc-1.2.0.tar.gz
  • Upload date:
  • Size: 124.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for speccheck_qc-1.2.0.tar.gz
Algorithm Hash digest
SHA256 15a57ca0d5d5ab40ac0c992727c8dabb73b7af0832aae4557cf08eb5d87e028b
MD5 66f2acd8ce6c29eb9cdeabd50630f41e
BLAKE2b-256 9adf7b878c860c421097ecef18a3b4e3c10fdeec3819e1fb3929342ad1b6c54b

See more details on using hashes here.

File details

Details for the file speccheck_qc-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: speccheck_qc-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 49.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for speccheck_qc-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d3556a07d73a46d6e1941f6fa8a662a0cccd0a0bd4a079152823d7c02bba5b20
MD5 fcfbf6cc631dad942bfd3ff13e3a8767
BLAKE2b-256 50a85b8ce1404134c829066c002a634e894d18bed0b0ec27c1586b3d3cf6ceca

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page