Skip to main content

Data processing toolkit: YAML/JSON to relational tables, schema comparison, and metadata management

Project description

Schema Sentinel

CI License Python Version Code style: ruff

A comprehensive data processing and schema management toolkit for data engineers and analysts. Schema Sentinel provides powerful tools for transforming nested YAML/JSON data into relational structures, generating dynamic schemas, comparing data, and tracking database schema changes.

Perfect for data engineers, analytics teams, and DBAs working with complex configuration files, API responses, nested data structures, or needing to track schema changes across environments.

๐ŸŽฏ Key Features

YAML Shredder - Transform Nested Data

  • ๐Ÿ”„ Automatic Schema Generation - Dynamically infer JSON Schema from YAML/JSON files with auto-detection of types and patterns
  • ๐Ÿ“Š Relational Table Conversion - Convert deeply nested YAML/JSON into normalized relational tables with automatic relationship mapping
  • ๐Ÿ—„๏ธ Multi-Database DDL Generation - Generate SQL DDL for Snowflake, PostgreSQL, MySQL, and SQLite
  • โšก Data Loading - Load transformed data directly into SQLite databases with automatic indexing
  • ๐Ÿ” Structure Analysis - Analyze and identify nested structures, arrays, and potential table candidates
  • ๏ฟฝ YAML Comparison - Compare two YAML files by converting to databases and analyzing structural/data differences
  • ๏ฟฝ๐Ÿ’ป CLI & Python API - Command-line interface and Python API for seamless integration

Schema Comparison (Bonus)

  • ๐Ÿ“‹ Metadata Extraction - Extract complete schema information from Snowflake databases
  • ๐Ÿ’พ Version Control - Store metadata snapshots in SQLite for historical tracking
  • ๐Ÿ”Ž Environment Comparison - Compare schemas between dev, staging, and production
  • ๐Ÿ“ Multiple Report Formats - Generate comparison reports in Markdown, HTML, and JSON
  • ๐Ÿ”’ Secure - Best practices for credential management and data security

๐ŸŽ“ Use Cases

YAML Shredder Use Cases

  • Configuration Management - Transform YAML configs into queryable database tables
  • API Response Processing - Convert nested JSON API responses into relational format
  • Data Pipeline Transformation - Normalize complex nested data for analytics
  • Schema Discovery - Automatically infer schemas from example data
  • Multi-Source Integration - Combine data from different YAML/JSON sources
  • Data Versioning - Track changes in configuration files over time
  • Configuration Drift Detection - Compare YAML configs across environments to identify differences

Schema Comparison Use Cases

  • Environment Synchronization - Ensure dev, staging, and production schemas are aligned
  • Change Tracking - Monitor database schema evolution over time
  • Deployment Validation - Verify schema changes after deployments
  • Compliance & Auditing - Maintain schema change history for compliance
  • Migration Planning - Identify schema differences before migrations

๐Ÿ“‹ Requirements

  • Python 3.13 or higher
  • uv - Modern Python package manager
  • Snowflake account (optional, only for schema comparison features)

๐Ÿš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/Igladyshev/schema-sentinel.git
cd schema-sentinel

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh  # Linux/macOS
# or
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"  # Windows

# Set up environment and install dependencies
./setup.sh

# Or manually:
uv venv
source .venv/bin/activate  # Linux/macOS or .venv\Scripts\activate on Windows
uv pip install -e ".[dev,jupyter]"

Quick Start - YAML Processing

Command Line Interface

Schema Sentinel provides organized command groups for different tasks:

YAML Processing Commands (schema-sentinel yaml)

# Analyze YAML structure
uv run schema-sentinel yaml analyze config.yaml

# Generate JSON schema
uv run schema-sentinel yaml schema config.yaml -o schema.json

# Generate relational tables
uv run schema-sentinel yaml tables config.yaml -o output/ -f csv

# Generate SQL DDL
uv run schema-sentinel yaml ddl config.yaml -o schema.sql -d snowflake

# Load data into SQLite
uv run schema-sentinel yaml load config.yaml -db output.db -r CONFIG

# Complete workflow: analyze โ†’ tables โ†’ DDL โ†’ load
uv run schema-sentinel yaml shred config.yaml -db output.db -r CONFIG

# Compare two YAML files
uv run schema-sentinel yaml compare file1.yaml file2.yaml -o comparison.md

Schema Management Commands (schema-sentinel schema)

# Extract Snowflake schema metadata
uv run schema-sentinel schema extract MY_DATABASE --env prod

# Compare two schema snapshots
uv run schema-sentinel schema compare snapshot1 snapshot2 -o report.md

Python API

from yaml_shredder import TableGenerator, DDLGenerator, SQLiteLoader

# Load and convert YAML to tables
table_gen = TableGenerator()
tables = table_gen.generate_tables(data, root_table_name="CONFIG")

# Generate SQL DDL
ddl_gen = DDLGenerator(dialect="sqlite")
ddl = ddl_gen.generate_ddl(tables, table_gen.relationships)

# Load into SQLite
loader = SQLiteLoader("output.db")
loader.load_tables(tables)

YAML Comparison

Python API:

from pathlib import Path
from schema_sentinel.yaml_comparator import YAMLComparator

# Create comparator
comparator = YAMLComparator(output_dir=Path("./temp_dbs"))

# Compare YAML files
report = comparator.compare_yaml_files(
    yaml1_path=Path("config1.yaml"),
    yaml2_path=Path("config2.yaml"),
    output_report=Path("comparison.md"),
    keep_dbs=False,  # Clean up temporary databases
    root_table_name="root"
)

print(report)

Configuration (For Schema Comparison)

For Snowflake schema comparison features, create .env with credentials:

SNOWFLAKE_ACCOUNT=your_account
SNOWFLAKE_USER=your_username
SNOWFLAKE_PASSWORD=your_password
SNOWFLAKE_WAREHOUSE=your_warehouse
SNOWFLAKE_DATABASE=your_database
SNOWFLAKE_ROLE=your_role
SNOWFLAKE_SCHEMAS=PUBLIC,ANALYTICS  # Optional

๐Ÿ“– Documentation

YAML Shredder

  • YAML Shredder CLI Guide - Complete CLI reference and examples
  • Notebooks Guide - Jupyter notebooks for data comparison and analysis
  • Generic Table Comparison - See MPM Comparison and Migration.ipynb for examples

General Documentation

๐Ÿ› ๏ธ Development

Setup Development Environment

# Install with development dependencies
uv pip install -e ".[dev,jupyter]"

# Install pre-commit hooks
pre-commit install

# Run tests
make test

# Format code
make format

# Lint code
make lint

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=schema_sentinel --cov-report=html

# Run specific test file
pytest tests/test_metadata.py

Code Quality

# Format code with Ruff
ruff format .

# Lint code
ruff check .

# Type checking
mypy schema_sentinel/

# Run all pre-commit hooks
pre-commit run --all-files

๐Ÿ—๏ธ Architecture

schema-sentinel/
โ”œโ”€โ”€ schema_sentinel/              # Main package
โ”‚   โ”œโ”€โ”€ __init__.py             # Package initialization
โ”‚   โ”œโ”€โ”€ config/                  # Configuration management
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ””โ”€โ”€ manager.py          # ConfigManager class
โ”‚   โ”œโ”€โ”€ markdown_utils/          # Markdown report generation
โ”‚   โ”‚   โ””โ”€โ”€ markdown.py
โ”‚   โ””โ”€โ”€ metadata_manager/        # Core metadata management
โ”‚       โ”œโ”€โ”€ engine.py           # Database connection engines
โ”‚       โ”œโ”€โ”€ metadata.py         # Metadata extraction logic
โ”‚       โ”œโ”€โ”€ changeset.py        # Change detection and tracking
โ”‚       โ”œโ”€โ”€ enums.py            # Enumerations and constants
โ”‚       โ”œโ”€โ”€ utils.py            # Utility functions
โ”‚       โ”œโ”€โ”€ model/              # Data models
โ”‚       โ”‚   โ”œโ”€โ”€ database.py     # Database model
โ”‚       โ”‚   โ”œโ”€โ”€ schema.py       # Schema model
โ”‚       โ”‚   โ”œโ”€โ”€ table.py        # Table model
โ”‚       โ”‚   โ”œโ”€โ”€ column.py       # Column model
โ”‚       โ”‚   โ”œโ”€โ”€ view.py         # View model
โ”‚       โ”‚   โ”œโ”€โ”€ procedure.py    # Stored procedure model
โ”‚       โ”‚   โ”œโ”€โ”€ function.py     # Function model
โ”‚       โ”‚   โ”œโ”€โ”€ constraint.py   # Constraint models
โ”‚       โ”‚   โ””โ”€โ”€ ...             # Other object models
โ”‚       โ””โ”€โ”€ lookup/             # Reference data
โ”‚           โ””โ”€โ”€ sql_data_type.py
โ”œโ”€โ”€ yaml_shredder/               # YAML/JSON processing toolkit
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ schema_generator.py     # Auto JSON Schema generation
โ”‚   โ”œโ”€โ”€ structure_analyzer.py   # Nested structure analysis
โ”‚   โ”œโ”€โ”€ table_generator.py      # Relational table conversion
โ”‚   โ”œโ”€โ”€ ddl_generator.py        # SQL DDL generation
โ”‚   โ””โ”€โ”€ data_loader.py          # SQLite data loading
โ”œโ”€โ”€ resources/                   # Configuration and templates
โ”‚   โ”œโ”€โ”€ examples/               # Example files and configurations
โ”‚   โ”‚   โ”œโ”€โ”€ .env.example        # Environment variables template
โ”‚   โ”‚   โ”œโ”€โ”€ example_sqlite_workflow.py  # SQLite workflow example
โ”‚   โ”‚   โ””โ”€โ”€ ...                 # Other example files
โ”‚   โ”œโ”€โ”€ db.properties           # Database config template
โ”‚   โ”œโ”€โ”€ datacompy/templates/    # Report templates
โ”‚   โ”œโ”€โ”€ meta-db/                # SQLite metadata storage
โ”‚   โ””โ”€โ”€ migrations-ddl/         # DDL migration procedures
โ”œโ”€โ”€ tests/                       # Test suite
โ”‚   โ”œโ”€โ”€ test_config.py          # Configuration tests
โ”‚   โ”œโ”€โ”€ test_imports.py         # Import tests
โ”‚   โ””โ”€โ”€ ...                     # Other test files
โ”œโ”€โ”€ docs/                        # API documentation (pdoc)
โ”œโ”€โ”€ wiki/                        # Project wiki and guides
โ””โ”€โ”€ notebooks/                   # Jupyter notebooks
    โ”œโ”€โ”€ MPM Comparison and Migration.ipynb
    โ””โ”€โ”€ ...

Supported Database Objects

  • โœ… Databases
  • โœ… Schemas
  • โœ… Tables (with columns, data types, nullability)
  • โœ… Views
  • โœ… Materialized Views
  • โœ… Stored Procedures
  • โœ… Functions (UDFs)
  • โœ… Primary Keys
  • โœ… Foreign Keys
  • โœ… Unique Constraints
  • โœ… Streams
  • โœ… Tasks
  • โœ… Pipes
  • โœ… Stages

๐Ÿค Contributing

We welcome contributions! This is an open source project and we'd love your help to make it better.

How to Contribute

  1. Fork the repository
  2. Create a feature branch from dev (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Add tests for your changes
  5. Ensure tests pass (pytest)
  6. Format code (ruff format .)
  7. Commit changes (git commit -m 'feat: add amazing feature')
  8. Push to branch (git push origin feature/amazing-feature)
  9. Open a Pull Request to merge into dev branch

See CONTRIBUTING.md for detailed guidelines and BRANCHING.md for our branching strategy.

Development Guidelines

  • Follow PEP 8 style guide (enforced by Ruff)
  • Add tests for new features
  • Update documentation
  • Use conventional commit messages
  • Ensure CI passes before requesting review

๐Ÿ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

๐Ÿ”’ Security

Security is a top priority. Please see SECURITY.md for:

  • Reporting vulnerabilities
  • Security best practices
  • Credential management guidelines

Never commit credentials or sensitive data to the repository.

๐ŸŒŸ Acknowledgments

๐Ÿ“Š Project Status

Current Status: Active Development ๐Ÿšง

This project is being actively developed and prepared for production use. We're working towards v2.1.0 with:

  • โœ… Modern Python packaging (pyproject.toml)
  • โœ… Comprehensive testing framework
  • โœ… CI/CD pipelines
  • โœ… Documentation
  • ๐Ÿšง Enhanced metadata extraction
  • ๐Ÿšง Additional database support
  • ๐Ÿšง Web UI (planned)

Roadmap

  • v2.1.0 - Current release with uv support, modern tooling
  • v2.2.0 - DuckDB integration, enhanced data comparator, PostgreSQL & MySQL support
  • v2.3.0 - REST API, CLI interface, Oracle & SQL Server support
  • v3.0.0 - Web UI, multi-user support, RBAC, CI/CD integration

๐Ÿ“‹ See the detailed Future Development Plan for comprehensive roadmap and planned features

๐Ÿ’ฌ Support & Community

๐Ÿ“ˆ Stats

GitHub stars GitHub forks GitHub watchers


Made with โค๏ธ for the data engineering community

If you find this project useful, please consider giving it a โญ๏ธ on GitHub!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schema_sentinel-3.0.3.tar.gz (885.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

schema_sentinel-3.0.3-py3-none-any.whl (71.3 kB view details)

Uploaded Python 3

File details

Details for the file schema_sentinel-3.0.3.tar.gz.

File metadata

  • Download URL: schema_sentinel-3.0.3.tar.gz
  • Upload date:
  • Size: 885.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for schema_sentinel-3.0.3.tar.gz
Algorithm Hash digest
SHA256 14bd97dc7b2fd8226bec376cfebce33024697bec21628f509e3cabef71751bde
MD5 000e02ab2a7fefe21839ee70b13aba45
BLAKE2b-256 03630aae268275571af3c56b6b49d30599aca3ee9e8459a117a621029954cf31

See more details on using hashes here.

Provenance

The following attestation bundles were made for schema_sentinel-3.0.3.tar.gz:

Publisher: release.yml on Igladyshev/schema-sentinel

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file schema_sentinel-3.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for schema_sentinel-3.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 56f510cd3b66b8304f8ef5569a3369161601fc7c24326747270ae29332f2236a
MD5 34abba1e3131aceeef0b339620734880
BLAKE2b-256 9c8354f5b3131f444b6c8cf35c016cc29d69e5cc3ff5320e3c5f5046aa102f24

See more details on using hashes here.

Provenance

The following attestation bundles were made for schema_sentinel-3.0.3-py3-none-any.whl:

Publisher: release.yml on Igladyshev/schema-sentinel

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page