Skip to main content

Human-friendly schema diff and contract drift detection for CSV, JSON, JSONL, Parquet, OpenAPI, Avro, and protobuf.

Project description

SchemaGlow

Human-friendly schema diff for CSV, JSON, JSONL, Parquet, OpenAPI, Avro, and protobuf.

SchemaGlow compares data files, schema artifacts, directory trees, and saved contract snapshots. It tells you what changed, whether it is safe, and what might break. It is built for pull request review, CI checks, repository-wide drift scans, and baseline contract validation when raw git diffs are not enough.

Why

Most nearby tools validate data contracts, inspect file structure, or diff technical schemas in a format-specific way. SchemaGlow focuses on a narrower workflow:

  • compare two file versions quickly
  • explain changes in plain language
  • classify impact as SAFE, WARNING, or BREAKING
  • export machine-readable and review-friendly reports

Features

  • Compare CSV, JSON, JSONL, Parquet, OpenAPI, Avro, and protobuf sources with one CLI.
  • Infer normalized schema snapshots from both raw data files and schema-definition files.
  • Classify compatibility changes as SAFE, WARNING, or BREAKING.
  • Export diff output as terminal text, JSON, Markdown, or HTML.
  • Save schema snapshots and compare them later without re-reading source files.
  • Scan two directory trees recursively and aggregate drift into one report.
  • Capture baseline contract files and check candidate trees against committed baselines.
  • Detect optional nested expansions, removals, type changes, nullability changes, sample-shape ambiguity, and column-order-only changes.
  • Support ignore rules, strict numeric widening, and rename heuristics with sample overlap.

Installation

pip install schemaglow

For local development:

python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

CLI

schemaglow diff

Compare two files directly.

schemaglow diff old.parquet new.parquet
schemaglow diff baseline.jsonl candidate.jsonl --format json
schemaglow diff old.openapi.yaml new.openapi.yaml
schemaglow diff old.avsc new.avsc
schemaglow diff old.proto new.proto --report html --report-path proto-report.html
schemaglow diff old.csv new.csv --report markdown --report-path schema-report.md
schemaglow diff old.csv new.csv --ignore-fields '(^_loaded_at$|^metadata\.)'
schemaglow diff old.csv new.csv --strict --rename-heuristics

Example text output:

SchemaGlow Report

BREAKING
old: old.csv
new: new.csv
counts: SAFE=1 WARNING=0 BREAKING=1
BREAKING
- removed field: order_total
SAFE
+ column order changed only

schemaglow inspect

Infer a snapshot from one file and print its normalized field model.

schemaglow inspect data.json
schemaglow inspect data.parquet --format json
schemaglow inspect openapi.yaml --format json
schemaglow inspect schema.proto

schemaglow snapshot

Persist an inferred snapshot to JSON for later comparison.

schemaglow snapshot data.jsonl -o snapshots/baseline.schema.json
schemaglow snapshot schema.avsc -o snapshots/avro.schema.json

schemaglow compare

Compare two saved schema snapshots.

schemaglow compare old.schema.json new.schema.json
schemaglow compare old.schema.json new.schema.json --format json

schemaglow scan

Compare two directory trees recursively and aggregate the results.

schemaglow scan datasets/baseline datasets/candidate
schemaglow scan specs/old specs/new --format json
schemaglow scan repo-old repo-new --pattern '*.proto' --report markdown --report-path scan.md

schemaglow baseline capture

Capture a repository-local contract baseline made of saved snapshots.

schemaglow baseline capture data/ -o .schemaglow-baseline
schemaglow baseline capture specs/ -o contracts/api --pattern '*.yaml'

schemaglow baseline check

Compare a candidate tree against committed baseline contract files.

schemaglow baseline check .schemaglow-baseline data/
schemaglow baseline check contracts/api specs/ --format json

Supported Inputs

Format Typical suffixes Notes
CSV .csv Header-driven field discovery with scalar inference
JSON .json Raw object or array data; OpenAPI JSON is auto-detected
JSONL .jsonl One JSON object per line
Parquet .parquet Schema extracted with PyArrow
OpenAPI .yaml, .yml, .json Local refs, component schemas, request/response schemas
Avro .avsc Records, arrays, maps, enums, unions
Protobuf .proto Messages, enums, repeated fields, and maps

Compatibility Rules

SAFE

  • new nullable or optional top-level field
  • column order changed only
  • numeric widening from integer to number unless --strict is enabled
  • no schema change

WARNING

  • new required top-level field
  • nested object shape expanded
  • required to nullable change
  • ambiguous or mixed-type widening
  • sample shape changed while remaining string-typed
  • likely rename detected with --rename-heuristics

BREAKING

  • field removed
  • nullable to required change
  • incompatible type change such as string -> integer

Architecture

The package uses a small pipeline that mirrors the product brief.

src/schemaglow/
├── cli.py         # Typer command surface
├── service.py     # File and snapshot orchestration
├── infer.py       # Format detection and schema inference
├── schema_sources.py # OpenAPI, Avro, and protobuf parsers
├── diffing.py     # Compatibility rules and event generation
├── renderers.py   # Text, JSON, Markdown, and HTML output
└── models.py      # Pydantic models for snapshots and reports

Processing flow:

  1. Detect the input format from suffix and schema-document heuristics.
  2. Infer a normalized field map with type, nullability, order, and sample hints.
  3. Compare old and new field sets against compatibility rules.
  4. Aggregate file-level results for scans and baseline checks when needed.
  5. Render the result for humans or CI consumers.

Tools Used

Tool Purpose
Python 3.11+ Runtime and packaging baseline
Typer CLI commands and help output
Rich Terminal rendering
Pydantic Snapshot and report models
PyArrow Parquet schema reading and test fixture creation
PyYAML OpenAPI YAML parsing
Jinja2 HTML report templating
pytest + pytest-cov Unit and integration tests with coverage
mypy Strict type checking
ruff Linting and formatting
pip-audit Dependency vulnerability checks

Testing and Verification

Local verification commands:

ruff format --check .
ruff check .
mypy src
pytest
pip-audit

Manual end-to-end commands using committed sample files are documented in TESTING.md.

The automated test suite covers:

  • CSV inference and numeric widening behavior
  • JSON and JSONL nested shape, nullability, and sample-shape changes
  • OpenAPI, Avro, and protobuf schema parsing
  • nested diff collapsing and rename heuristics
  • snapshot and baseline round-trips
  • CLI integration for inspect, snapshot, compare, diff, scan, and baseline
  • Parquet and directory report generation

Repository Standards

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schemaglow-1.0.0.tar.gz (64.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

schemaglow-1.0.0-py3-none-any.whl (24.1 kB view details)

Uploaded Python 3

File details

Details for the file schemaglow-1.0.0.tar.gz.

File metadata

  • Download URL: schemaglow-1.0.0.tar.gz
  • Upload date:
  • Size: 64.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for schemaglow-1.0.0.tar.gz
Algorithm Hash digest
SHA256 fdf10ed766d89ca463e8194a4e72479136f2569471b86a412da8f4aca070ebdd
MD5 40f70d9dde9047e569232c255bc71121
BLAKE2b-256 bd498010239a2efa6f8dd3632238fc32e0faf0b6c6c46da3a66d44809e5f590e

See more details on using hashes here.

File details

Details for the file schemaglow-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: schemaglow-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 24.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for schemaglow-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 61be6348f3b699a05a561f57817c95a91d70710fa6ff2895afa6d39b9fbd89cc
MD5 7095d6d2c8f4c4a5971c8c6c4a576e5f
BLAKE2b-256 42ae4391354e8cc4e66bdea2c107b0cfa178671334ee34fdb06fcfd3d145b360

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page