Skip to main content

Data processing toolkit: YAML/JSON to relational tables, schema comparison, and metadata management

Project description

Schema Sentinel

CI License Python Version Code style: ruff

A comprehensive data processing and schema management toolkit for data engineers and analysts. Schema Sentinel provides powerful tools for transforming nested YAML/JSON data into relational structures, generating dynamic schemas, comparing data, and tracking database schema changes.

Perfect for data engineers, analytics teams, and DBAs working with complex configuration files, API responses, nested data structures, or needing to track schema changes across environments.

๐ŸŽฏ Key Features

YAML Shredder - Transform Nested Data

  • ๐Ÿ”„ Automatic Schema Generation - Dynamically infer JSON Schema from YAML/JSON files with auto-detection of types and patterns
  • ๐Ÿ“Š Relational Table Conversion - Convert deeply nested YAML/JSON into normalized relational tables with automatic relationship mapping
  • ๐Ÿ—„๏ธ Multi-Database DDL Generation - Generate SQL DDL for Snowflake, PostgreSQL, MySQL, and SQLite
  • โšก Data Loading - Load transformed data directly into SQLite databases with automatic indexing
  • ๐Ÿ” Structure Analysis - Analyze and identify nested structures, arrays, and potential table candidates
  • ๏ฟฝ YAML Comparison - Compare two YAML files by converting to databases and analyzing structural/data differences
  • ๏ฟฝ๐Ÿ’ป CLI & Python API - Command-line interface and Python API for seamless integration

Schema Comparison (Bonus)

  • ๐Ÿ“‹ Metadata Extraction - Extract complete schema information from Snowflake databases
  • ๐Ÿ’พ Version Control - Store metadata snapshots in SQLite for historical tracking
  • ๐Ÿ”Ž Environment Comparison - Compare schemas between dev, staging, and production
  • ๐Ÿ“ Multiple Report Formats - Generate comparison reports in Markdown, HTML, and JSON
  • ๐Ÿ”’ Secure - Best practices for credential management and data security

๐ŸŽ“ Use Cases

YAML Shredder Use Cases

  • Configuration Management - Transform YAML configs into queryable database tables
  • API Response Processing - Convert nested JSON API responses into relational format
  • Data Pipeline Transformation - Normalize complex nested data for analytics
  • Schema Discovery - Automatically infer schemas from example data
  • Multi-Source Integration - Combine data from different YAML/JSON sources
  • Data Versioning - Track changes in configuration files over time
  • Configuration Drift Detection - Compare YAML configs across environments to identify differences

Schema Comparison Use Cases

  • Environment Synchronization - Ensure dev, staging, and production schemas are aligned
  • Change Tracking - Monitor database schema evolution over time
  • Deployment Validation - Verify schema changes after deployments
  • Compliance & Auditing - Maintain schema change history for compliance
  • Migration Planning - Identify schema differences before migrations

๐Ÿ“‹ Requirements

  • Python 3.13 or higher
  • uv - Modern Python package manager
  • Snowflake account (optional, only for schema comparison features)

๐Ÿš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/Igladyshev/schema-sentinel.git
cd schema-sentinel

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh  # Linux/macOS
# or
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"  # Windows

# Set up environment and install dependencies
./setup.sh

# Or manually:
uv venv
source .venv/bin/activate  # Linux/macOS or .venv\Scripts\activate on Windows
uv pip install -e ".[dev,jupyter]"

Quick Start - YAML Processing

Command Line Interface

Schema Sentinel provides organized command groups for different tasks:

YAML Processing Commands (schema-sentinel yaml)

# Analyze YAML structure
uv run schema-sentinel yaml analyze config.yaml

# Generate JSON schema
uv run schema-sentinel yaml schema config.yaml -o schema.json

# Generate relational tables
uv run schema-sentinel yaml tables config.yaml -o output/ -f csv

# Generate SQL DDL
uv run schema-sentinel yaml ddl config.yaml -o schema.sql -d snowflake

# Load data into SQLite
uv run schema-sentinel yaml load config.yaml -db output.db -r CONFIG

# Complete workflow: analyze โ†’ tables โ†’ DDL โ†’ load
uv run schema-sentinel yaml shred config.yaml -db output.db -r CONFIG

# Compare two YAML files
uv run schema-sentinel yaml compare file1.yaml file2.yaml -o comparison.md

Schema Management Commands (schema-sentinel schema)

# Extract Snowflake schema metadata
uv run schema-sentinel schema extract MY_DATABASE --env prod

# Compare two schema snapshots
uv run schema-sentinel schema compare snapshot1 snapshot2 -o report.md

Python API

from yaml_shredder import TableGenerator, DDLGenerator, SQLiteLoader

# Load and convert YAML to tables
table_gen = TableGenerator()
tables = table_gen.generate_tables(data, root_table_name="CONFIG")

# Generate SQL DDL
ddl_gen = DDLGenerator(dialect="sqlite")
ddl = ddl_gen.generate_ddl(tables, table_gen.relationships)

# Load into SQLite
loader = SQLiteLoader("output.db")
loader.load_tables(tables)

YAML Comparison

Python API:

from pathlib import Path
from schema_sentinel.yaml_comparator import YAMLComparator

# Create comparator
comparator = YAMLComparator(output_dir=Path("./temp_dbs"))

# Compare YAML files
report = comparator.compare_yaml_files(
    yaml1_path=Path("config1.yaml"),
    yaml2_path=Path("config2.yaml"),
    output_report=Path("comparison.md"),
    keep_dbs=False,  # Clean up temporary databases
    root_table_name="root"
)

print(report)

Configuration (For Schema Comparison)

For Snowflake schema comparison features, create .env with credentials:

SNOWFLAKE_ACCOUNT=your_account
SNOWFLAKE_USER=your_username
SNOWFLAKE_PASSWORD=your_password
SNOWFLAKE_WAREHOUSE=your_warehouse
SNOWFLAKE_DATABASE=your_database
SNOWFLAKE_ROLE=your_role
SNOWFLAKE_SCHEMAS=PUBLIC,ANALYTICS  # Optional

๐Ÿ“– Documentation

YAML Shredder

  • YAML Shredder CLI Guide - Complete CLI reference and examples
  • Notebooks Guide - Jupyter notebooks for data comparison and analysis
  • Generic Table Comparison - See MPM Comparison and Migration.ipynb for examples

General Documentation

๐Ÿ› ๏ธ Development

Setup Development Environment

# Install with development dependencies
uv pip install -e ".[dev,jupyter]"

# Install pre-commit hooks
pre-commit install

# Run tests
make test

# Format code
make format

# Lint code
make lint

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=schema_sentinel --cov-report=html

# Run specific test file
pytest tests/test_metadata.py

Code Quality

# Format code with Ruff
ruff format .

# Lint code
ruff check .

# Type checking
mypy schema_sentinel/

# Run all pre-commit hooks
pre-commit run --all-files

๐Ÿ—๏ธ Architecture

schema-sentinel/
โ”œโ”€โ”€ schema_sentinel/              # Main package
โ”‚   โ”œโ”€โ”€ __init__.py             # Package initialization
โ”‚   โ”œโ”€โ”€ config/                  # Configuration management
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ””โ”€โ”€ manager.py          # ConfigManager class
โ”‚   โ”œโ”€โ”€ markdown_utils/          # Markdown report generation
โ”‚   โ”‚   โ””โ”€โ”€ markdown.py
โ”‚   โ””โ”€โ”€ metadata_manager/        # Core metadata management
โ”‚       โ”œโ”€โ”€ engine.py           # Database connection engines
โ”‚       โ”œโ”€โ”€ metadata.py         # Metadata extraction logic
โ”‚       โ”œโ”€โ”€ changeset.py        # Change detection and tracking
โ”‚       โ”œโ”€โ”€ enums.py            # Enumerations and constants
โ”‚       โ”œโ”€โ”€ utils.py            # Utility functions
โ”‚       โ”œโ”€โ”€ model/              # Data models
โ”‚       โ”‚   โ”œโ”€โ”€ database.py     # Database model
โ”‚       โ”‚   โ”œโ”€โ”€ schema.py       # Schema model
โ”‚       โ”‚   โ”œโ”€โ”€ table.py        # Table model
โ”‚       โ”‚   โ”œโ”€โ”€ column.py       # Column model
โ”‚       โ”‚   โ”œโ”€โ”€ view.py         # View model
โ”‚       โ”‚   โ”œโ”€โ”€ procedure.py    # Stored procedure model
โ”‚       โ”‚   โ”œโ”€โ”€ function.py     # Function model
โ”‚       โ”‚   โ”œโ”€โ”€ constraint.py   # Constraint models
โ”‚       โ”‚   โ””โ”€โ”€ ...             # Other object models
โ”‚       โ””โ”€โ”€ lookup/             # Reference data
โ”‚           โ””โ”€โ”€ sql_data_type.py
โ”œโ”€โ”€ yaml_shredder/               # YAML/JSON processing toolkit
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ schema_generator.py     # Auto JSON Schema generation
โ”‚   โ”œโ”€โ”€ structure_analyzer.py   # Nested structure analysis
โ”‚   โ”œโ”€โ”€ table_generator.py      # Relational table conversion
โ”‚   โ”œโ”€โ”€ ddl_generator.py        # SQL DDL generation
โ”‚   โ””โ”€โ”€ data_loader.py          # SQLite data loading
โ”œโ”€โ”€ resources/                   # Configuration and templates
โ”‚   โ”œโ”€โ”€ examples/               # Example files and configurations
โ”‚   โ”‚   โ”œโ”€โ”€ .env.example        # Environment variables template
โ”‚   โ”‚   โ”œโ”€โ”€ example_sqlite_workflow.py  # SQLite workflow example
โ”‚   โ”‚   โ””โ”€โ”€ ...                 # Other example files
โ”‚   โ”œโ”€โ”€ db.properties           # Database config template
โ”‚   โ”œโ”€โ”€ datacompy/templates/    # Report templates
โ”‚   โ”œโ”€โ”€ meta-db/                # SQLite metadata storage
โ”‚   โ””โ”€โ”€ migrations-ddl/         # DDL migration procedures
โ”œโ”€โ”€ tests/                       # Test suite
โ”‚   โ”œโ”€โ”€ test_config.py          # Configuration tests
โ”‚   โ”œโ”€โ”€ test_imports.py         # Import tests
โ”‚   โ””โ”€โ”€ ...                     # Other test files
โ”œโ”€โ”€ docs/                        # API documentation (pdoc)
โ”œโ”€โ”€ wiki/                        # Project wiki and guides
โ””โ”€โ”€ notebooks/                   # Jupyter notebooks
    โ”œโ”€โ”€ MPM Comparison and Migration.ipynb
    โ””โ”€โ”€ ...

Supported Database Objects

  • โœ… Databases
  • โœ… Schemas
  • โœ… Tables (with columns, data types, nullability)
  • โœ… Views
  • โœ… Materialized Views
  • โœ… Stored Procedures
  • โœ… Functions (UDFs)
  • โœ… Primary Keys
  • โœ… Foreign Keys
  • โœ… Unique Constraints
  • โœ… Streams
  • โœ… Tasks
  • โœ… Pipes
  • โœ… Stages

๐Ÿค Contributing

We welcome contributions! This is an open source project and we'd love your help to make it better.

How to Contribute

  1. Fork the repository
  2. Create a feature branch from dev (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Add tests for your changes
  5. Ensure tests pass (pytest)
  6. Format code (ruff format .)
  7. Commit changes (git commit -m 'feat: add amazing feature')
  8. Push to branch (git push origin feature/amazing-feature)
  9. Open a Pull Request to merge into dev branch

See CONTRIBUTING.md for detailed guidelines and BRANCHING.md for our branching strategy.

Development Guidelines

  • Follow PEP 8 style guide (enforced by Ruff)
  • Add tests for new features
  • Update documentation
  • Use conventional commit messages
  • Ensure CI passes before requesting review

๐Ÿ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

๐Ÿ”’ Security

Security is a top priority. Please see SECURITY.md for:

  • Reporting vulnerabilities
  • Security best practices
  • Credential management guidelines

Never commit credentials or sensitive data to the repository.

๐ŸŒŸ Acknowledgments

๐Ÿ“Š Project Status

Current Status: Active Development ๐Ÿšง

This project is being actively developed and prepared for production use. We're working towards v2.1.0 with:

  • โœ… Modern Python packaging (pyproject.toml)
  • โœ… Comprehensive testing framework
  • โœ… CI/CD pipelines
  • โœ… Documentation
  • ๐Ÿšง Enhanced metadata extraction
  • ๐Ÿšง Additional database support
  • ๐Ÿšง Web UI (planned)

Roadmap

  • v2.1.0 - Current release with uv support, modern tooling
  • v2.2.0 - DuckDB integration, enhanced data comparator, PostgreSQL & MySQL support
  • v2.3.0 - REST API, CLI interface, Oracle & SQL Server support
  • v3.0.0 - Web UI, multi-user support, RBAC, CI/CD integration

๐Ÿ“‹ See the detailed Future Development Plan for comprehensive roadmap and planned features

๐Ÿ’ฌ Support & Community

๐Ÿ“ˆ Stats

GitHub stars GitHub forks GitHub watchers


Made with โค๏ธ for the data engineering community

If you find this project useful, please consider giving it a โญ๏ธ on GitHub!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schema_sentinel-3.0.2.tar.gz (903.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

schema_sentinel-3.0.2-py3-none-any.whl (70.0 kB view details)

Uploaded Python 3

File details

Details for the file schema_sentinel-3.0.2.tar.gz.

File metadata

  • Download URL: schema_sentinel-3.0.2.tar.gz
  • Upload date:
  • Size: 903.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for schema_sentinel-3.0.2.tar.gz
Algorithm Hash digest
SHA256 762d2f26ab2d4acd21bdbd52f63604a99d2d11cfee6183df756b7a44e7d09afb
MD5 0b8997210d98e9667f8936cb51fd7b06
BLAKE2b-256 a24b2985abfeed092e37b3decdd6bd9bc007dbaee208ab2ccaeb5f192ec3a0ed

See more details on using hashes here.

Provenance

The following attestation bundles were made for schema_sentinel-3.0.2.tar.gz:

Publisher: publish.yml on Igladyshev/schema-sentinel

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file schema_sentinel-3.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for schema_sentinel-3.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ac8ae2d06e85365627ad08289b5b12f67acc78b64b828d348892e4352c584852
MD5 665172ee2af80affd64d1ef7b3b5ba24
BLAKE2b-256 16cbd9e40ab7e8d0c4bd66f88d84ece7ce127b3205e9d3eab073fa69f5138264

See more details on using hashes here.

Provenance

The following attestation bundles were made for schema_sentinel-3.0.2-py3-none-any.whl:

Publisher: publish.yml on Igladyshev/schema-sentinel

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page