Skip to main content

A rule-based validation engine for RNA-seq count matrices and sample metadata

Project description

BioFlowValidator

A transparent, rule-based validator for RNA-seq differential expression analysis workflows.

BioFlowValidator catches common scientific and computational errors in RNA-seq data before expensive analysis begins โ€” acting as a pre-analysis guard rail for wet-lab biologists, students, and clinical researchers.


Features

  • โœ… 32 validation rules across 5 categories (format, sample, gene ID, normalization, biology)
  • ๐Ÿ”ฌ Detects: sample mismatches, mixed gene ID namespaces, pre-normalized counts, too few replicates, library size outliers, and more
  • ๐Ÿ“Š Human-readable HTML report + machine-readable JSON
  • ๐Ÿš€ REST API (FastAPI) + React/TypeScript frontend
  • ๐Ÿณ Single-command Docker startup

Quick Start

Docker (recommended)

git clone https://github.com/Rashidmstar12/BioFlowValidator.git
cd BioFlowValidator
docker compose up --build

Open http://localhost:3000 in your browser.

Local Development

Backend:

cd backend
pip install -r requirements.txt
uvicorn app.main:app --reload --port 8000

Frontend:

cd frontend
npm install
npm run dev

Open http://localhost:5173.


Inputs

File Format Required
Count matrix TSV / CSV / XLSX (genes ร— samples or samples ร— genes) โœ…
Sample metadata TSV / CSV (sample IDs + condition column) Optional

Validation Rule Categories

Category Rules Description
Format FMT-001 โ€“ FMT-008 Encoding, delimiters, headers, duplicates, non-negatives, matrix orientation
Sample SMP-001 โ€“ SMP-005 Sample ID matching, duplicates, replicates, near-identical replicate diagnostics
Gene ID GEN-001 โ€“ GEN-005 Namespace consistency, duplicates, version suffixes, organism detection
Normalization NRM-001 โ€“ NRM-006 Integer counts, library size ratios, zero genes, duplicate count profiles
Biology BIO-001 โ€“ BIO-008 Single condition, MT fraction, label sanity, batch confounding, ERCC spike-ins

See docs/validation_rules.md for the full rule reference.


Running Tests

cd backend
python -m pytest tests/ -v

Run the dataset benchmark:

python datasets/benchmark.py

API Reference

See docs/api_spec.md or browse the interactive docs at http://localhost:8000/docs.


Repository Structure

BioFlowValidator/
โ”œโ”€โ”€ backend/           # Python FastAPI application
โ”‚   โ”œโ”€โ”€ app/
โ”‚   โ”‚   โ”œโ”€โ”€ engine/    # FileParser, RuleRegistry, RuleRunner
โ”‚   โ”‚   โ”œโ”€โ”€ models/    # RuleResult, ValidationReport, ValidationContext
โ”‚   โ”‚   โ”œโ”€โ”€ rules/     # format/, sample/, gene/, normalization/, biology/
โ”‚   โ”‚   โ”œโ”€โ”€ report/    # JSONExporter, HTMLExporter
โ”‚   โ”‚   โ””โ”€โ”€ routers/   # FastAPI route handlers
โ”‚   โ””โ”€โ”€ tests/         # Unit + integration tests
โ”œโ”€โ”€ frontend/          # React + TypeScript + Vite SPA
โ”œโ”€โ”€ datasets/          # Valid + faulty example datasets + benchmark
โ”œโ”€โ”€ docs/              # API spec, validation rules reference
โ”œโ”€โ”€ Dockerfile.backend
โ”œโ”€โ”€ Dockerfile.frontend
โ””โ”€โ”€ docker-compose.yml

Design Principles

  • Validation only โ€” no analysis, no statistical computation
  • Transparent โ€” every rule has a documented ID, description, and suggestion
  • Auditable โ€” JSON report includes file SHA-256 hash and timestamp
  • Scientifically conservative โ€” ambiguous cases produce WARNING not ERROR
  • Reproducible โ€” same inputs always produce identical outputs

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bioflowvalidator-1.0.0.tar.gz (36.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bioflowvalidator-1.0.0-py3-none-any.whl (41.7 kB view details)

Uploaded Python 3

File details

Details for the file bioflowvalidator-1.0.0.tar.gz.

File metadata

  • Download URL: bioflowvalidator-1.0.0.tar.gz
  • Upload date:
  • Size: 36.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bioflowvalidator-1.0.0.tar.gz
Algorithm Hash digest
SHA256 a996d0aa2e45ef7c51c2401748a60353451d9e373a010edec5347ca9bd9d5034
MD5 ad4e7ee62ca9952e518a9d67dea01746
BLAKE2b-256 de3e663b483630d1aafd1cc24f5d279bc914bee872104ab430f6d972981b3691

See more details on using hashes here.

Provenance

The following attestation bundles were made for bioflowvalidator-1.0.0.tar.gz:

Publisher: publish.yml on Rashidmstar12/BioFlowValidator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bioflowvalidator-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for bioflowvalidator-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 75bd6a612205e756ebe912b6325262159446ef5cd67981872d3c4fec12da93a7
MD5 efde47aa45a5fd7cb630d880a37539d9
BLAKE2b-256 8e62f9a530b4002d3fd9422f96064158c17c31cbfd9387a356b97f2fc9bc4273

See more details on using hashes here.

Provenance

The following attestation bundles were made for bioflowvalidator-1.0.0-py3-none-any.whl:

Publisher: publish.yml on Rashidmstar12/BioFlowValidator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page