Data processing toolkit: YAML/JSON to relational tables, schema comparison, and metadata management
Project description
Schema Sentinel
A comprehensive data processing and schema management toolkit for data engineers and analysts. Schema Sentinel provides powerful tools for transforming nested YAML/JSON data into relational structures, generating dynamic schemas, comparing data, and tracking database schema changes.
Perfect for data engineers, analytics teams, and DBAs working with complex configuration files, API responses, nested data structures, or needing to track schema changes across environments.
๐ฏ Key Features
YAML Shredder - Transform Nested Data
- ๐ Automatic Schema Generation - Dynamically infer JSON Schema from YAML/JSON files with auto-detection of types and patterns
- ๐ Relational Table Conversion - Convert deeply nested YAML/JSON into normalized relational tables with automatic relationship mapping
- ๐๏ธ Multi-Database DDL Generation - Generate SQL DDL for Snowflake, PostgreSQL, MySQL, and SQLite
- โก Data Loading - Load transformed data directly into SQLite databases with automatic indexing
- ๐ Structure Analysis - Analyze and identify nested structures, arrays, and potential table candidates
- ๏ฟฝ YAML Comparison - Compare two YAML files by converting to databases and analyzing structural/data differences
- ๏ฟฝ๐ป CLI & Python API - Command-line interface and Python API for seamless integration
Schema Comparison (Bonus)
- ๐ Metadata Extraction - Extract complete schema information from Snowflake databases
- ๐พ Version Control - Store metadata snapshots in SQLite for historical tracking
- ๐ Environment Comparison - Compare schemas between dev, staging, and production
- ๐ Multiple Report Formats - Generate comparison reports in Markdown, HTML, and JSON
- ๐ Secure - Best practices for credential management and data security
๐ Use Cases
YAML Shredder Use Cases
- Configuration Management - Transform YAML configs into queryable database tables
- API Response Processing - Convert nested JSON API responses into relational format
- Data Pipeline Transformation - Normalize complex nested data for analytics
- Schema Discovery - Automatically infer schemas from example data
- Multi-Source Integration - Combine data from different YAML/JSON sources
- Data Versioning - Track changes in configuration files over time
- Configuration Drift Detection - Compare YAML configs across environments to identify differences
Schema Comparison Use Cases
- Environment Synchronization - Ensure dev, staging, and production schemas are aligned
- Change Tracking - Monitor database schema evolution over time
- Deployment Validation - Verify schema changes after deployments
- Compliance & Auditing - Maintain schema change history for compliance
- Migration Planning - Identify schema differences before migrations
๐ Requirements
- Python 3.13 or higher
- uv - Modern Python package manager
- Snowflake account (optional, only for schema comparison features)
๐ Quick Start
Installation
# Clone the repository
git clone https://github.com/Igladyshev/schema-sentinel.git
cd schema-sentinel
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh # Linux/macOS
# or
powershell -c "irm https://astral.sh/uv/install.ps1 | iex" # Windows
# Set up environment and install dependencies
./setup.sh
# Or manually:
uv venv
source .venv/bin/activate # Linux/macOS or .venv\Scripts\activate on Windows
uv pip install -e ".[dev,jupyter]"
Quick Start - YAML Processing
Command Line Interface
Schema Sentinel provides organized command groups for different tasks:
YAML Processing Commands (schema-sentinel yaml)
# Analyze YAML structure
uv run schema-sentinel yaml analyze config.yaml
# Generate JSON schema
uv run schema-sentinel yaml schema config.yaml -o schema.json
# Generate relational tables
uv run schema-sentinel yaml tables config.yaml -o output/ -f csv
# Generate SQL DDL
uv run schema-sentinel yaml ddl config.yaml -o schema.sql -d snowflake
# Load data into SQLite
uv run schema-sentinel yaml load config.yaml -db output.db -r CONFIG
# Complete workflow: analyze โ tables โ DDL โ load
uv run schema-sentinel yaml shred config.yaml -db output.db -r CONFIG
# Compare two YAML files
uv run schema-sentinel yaml compare file1.yaml file2.yaml -o comparison.md
Schema Management Commands (schema-sentinel schema)
# Extract Snowflake schema metadata
uv run schema-sentinel schema extract MY_DATABASE --env prod
# Compare two schema snapshots
uv run schema-sentinel schema compare snapshot1 snapshot2 -o report.md
Python API
from yaml_shredder import TableGenerator, DDLGenerator, SQLiteLoader
# Load and convert YAML to tables
table_gen = TableGenerator()
tables = table_gen.generate_tables(data, root_table_name="CONFIG")
# Generate SQL DDL
ddl_gen = DDLGenerator(dialect="sqlite")
ddl = ddl_gen.generate_ddl(tables, table_gen.relationships)
# Load into SQLite
loader = SQLiteLoader("output.db")
loader.load_tables(tables)
YAML Comparison
Python API:
from pathlib import Path
from schema_sentinel.yaml_comparator import YAMLComparator
# Create comparator
comparator = YAMLComparator(output_dir=Path("./temp_dbs"))
# Compare YAML files
report = comparator.compare_yaml_files(
yaml1_path=Path("config1.yaml"),
yaml2_path=Path("config2.yaml"),
output_report=Path("comparison.md"),
keep_dbs=False, # Clean up temporary databases
root_table_name="root"
)
print(report)
Configuration (For Schema Comparison)
For Snowflake schema comparison features, create .env with credentials:
SNOWFLAKE_ACCOUNT=your_account
SNOWFLAKE_USER=your_username
SNOWFLAKE_PASSWORD=your_password
SNOWFLAKE_WAREHOUSE=your_warehouse
SNOWFLAKE_DATABASE=your_database
SNOWFLAKE_ROLE=your_role
SNOWFLAKE_SCHEMAS=PUBLIC,ANALYTICS # Optional
๐ Documentation
YAML Shredder
- YAML Shredder CLI Guide - Complete CLI reference and examples
- Notebooks Guide - Jupyter notebooks for data comparison and analysis
- Generic Table Comparison - See
MPM Comparison and Migration.ipynbfor examples
General Documentation
- ๐ Project Wiki - Comprehensive documentation hub
- Getting Started - Installation and quick start
- Architecture - System design and architecture
- Development Guide - Development environment and guidelines
- Contributing Guide - How to contribute
- Security Guide - Security best practices
- Future Development Plan - Roadmap and upcoming features
- Installation & Setup Guide
- Development Guide - Detailed development instructions
- Contributing Guide - How to contribute
- Security Policy - Security guidelines and reporting
- Changelog - Version history
- Production Checklist - Production readiness guide
๐ ๏ธ Development
Setup Development Environment
# Install with development dependencies
uv pip install -e ".[dev,jupyter]"
# Install pre-commit hooks
pre-commit install
# Run tests
make test
# Format code
make format
# Lint code
make lint
Running Tests
# Run all tests
pytest
# Run with coverage
pytest --cov=schema_sentinel --cov-report=html
# Run specific test file
pytest tests/test_metadata.py
Code Quality
# Format code with Ruff
ruff format .
# Lint code
ruff check .
# Type checking
mypy schema_sentinel/
# Run all pre-commit hooks
pre-commit run --all-files
๐๏ธ Architecture
schema-sentinel/
โโโ schema_sentinel/ # Main package
โ โโโ __init__.py # Package initialization
โ โโโ config/ # Configuration management
โ โ โโโ __init__.py
โ โ โโโ manager.py # ConfigManager class
โ โโโ markdown_utils/ # Markdown report generation
โ โ โโโ markdown.py
โ โโโ metadata_manager/ # Core metadata management
โ โโโ engine.py # Database connection engines
โ โโโ metadata.py # Metadata extraction logic
โ โโโ changeset.py # Change detection and tracking
โ โโโ enums.py # Enumerations and constants
โ โโโ utils.py # Utility functions
โ โโโ model/ # Data models
โ โ โโโ database.py # Database model
โ โ โโโ schema.py # Schema model
โ โ โโโ table.py # Table model
โ โ โโโ column.py # Column model
โ โ โโโ view.py # View model
โ โ โโโ procedure.py # Stored procedure model
โ โ โโโ function.py # Function model
โ โ โโโ constraint.py # Constraint models
โ โ โโโ ... # Other object models
โ โโโ lookup/ # Reference data
โ โโโ sql_data_type.py
โโโ yaml_shredder/ # YAML/JSON processing toolkit
โ โโโ __init__.py
โ โโโ schema_generator.py # Auto JSON Schema generation
โ โโโ structure_analyzer.py # Nested structure analysis
โ โโโ table_generator.py # Relational table conversion
โ โโโ ddl_generator.py # SQL DDL generation
โ โโโ data_loader.py # SQLite data loading
โโโ resources/ # Configuration and templates
โ โโโ examples/ # Example files and configurations
โ โ โโโ .env.example # Environment variables template
โ โ โโโ example_sqlite_workflow.py # SQLite workflow example
โ โ โโโ ... # Other example files
โ โโโ db.properties # Database config template
โ โโโ datacompy/templates/ # Report templates
โ โโโ meta-db/ # SQLite metadata storage
โ โโโ migrations-ddl/ # DDL migration procedures
โโโ tests/ # Test suite
โ โโโ test_config.py # Configuration tests
โ โโโ test_imports.py # Import tests
โ โโโ ... # Other test files
โโโ docs/ # API documentation (pdoc)
โโโ wiki/ # Project wiki and guides
โโโ notebooks/ # Jupyter notebooks
โโโ MPM Comparison and Migration.ipynb
โโโ ...
Supported Database Objects
- โ Databases
- โ Schemas
- โ Tables (with columns, data types, nullability)
- โ Views
- โ Materialized Views
- โ Stored Procedures
- โ Functions (UDFs)
- โ Primary Keys
- โ Foreign Keys
- โ Unique Constraints
- โ Streams
- โ Tasks
- โ Pipes
- โ Stages
๐ค Contributing
We welcome contributions! This is an open source project and we'd love your help to make it better.
How to Contribute
- Fork the repository
- Create a feature branch from
dev(git checkout -b feature/amazing-feature) - Make your changes
- Add tests for your changes
- Ensure tests pass (
pytest) - Format code (
ruff format .) - Commit changes (
git commit -m 'feat: add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request to merge into
devbranch
See CONTRIBUTING.md for detailed guidelines and BRANCHING.md for our branching strategy.
Development Guidelines
- Follow PEP 8 style guide (enforced by Ruff)
- Add tests for new features
- Update documentation
- Use conventional commit messages
- Ensure CI passes before requesting review
๐ License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
๐ Security
Security is a top priority. Please see SECURITY.md for:
- Reporting vulnerabilities
- Security best practices
- Credential management guidelines
Never commit credentials or sensitive data to the repository.
๐ Acknowledgments
- Built with modern Python tooling: uv, Ruff
- Powered by SQLAlchemy and Snowflake SQLAlchemy
- Inspired by the need for better database change management in data engineering
๐ Project Status
Current Status: Active Development ๐ง
This project is being actively developed and prepared for production use. We're working towards v2.1.0 with:
- โ Modern Python packaging (pyproject.toml)
- โ Comprehensive testing framework
- โ CI/CD pipelines
- โ Documentation
- ๐ง Enhanced metadata extraction
- ๐ง Additional database support
- ๐ง Web UI (planned)
Roadmap
- v2.1.0 - Current release with uv support, modern tooling
- v2.2.0 - DuckDB integration, enhanced data comparator, PostgreSQL & MySQL support
- v2.3.0 - REST API, CLI interface, Oracle & SQL Server support
- v3.0.0 - Web UI, multi-user support, RBAC, CI/CD integration
๐ See the detailed Future Development Plan for comprehensive roadmap and planned features
๐ฌ Support & Community
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Questions: Use the
questionissue template
๐ Stats
Made with โค๏ธ for the data engineering community
If you find this project useful, please consider giving it a โญ๏ธ on GitHub!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file schema_sentinel-3.0.3.tar.gz.
File metadata
- Download URL: schema_sentinel-3.0.3.tar.gz
- Upload date:
- Size: 885.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
14bd97dc7b2fd8226bec376cfebce33024697bec21628f509e3cabef71751bde
|
|
| MD5 |
000e02ab2a7fefe21839ee70b13aba45
|
|
| BLAKE2b-256 |
03630aae268275571af3c56b6b49d30599aca3ee9e8459a117a621029954cf31
|
Provenance
The following attestation bundles were made for schema_sentinel-3.0.3.tar.gz:
Publisher:
release.yml on Igladyshev/schema-sentinel
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
schema_sentinel-3.0.3.tar.gz -
Subject digest:
14bd97dc7b2fd8226bec376cfebce33024697bec21628f509e3cabef71751bde - Sigstore transparency entry: 929119170
- Sigstore integration time:
-
Permalink:
Igladyshev/schema-sentinel@7d37b6200c830f7a05df0de79e64da54da1b240f -
Branch / Tag:
refs/tags/v3.0.3 - Owner: https://github.com/Igladyshev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@7d37b6200c830f7a05df0de79e64da54da1b240f -
Trigger Event:
push
-
Statement type:
File details
Details for the file schema_sentinel-3.0.3-py3-none-any.whl.
File metadata
- Download URL: schema_sentinel-3.0.3-py3-none-any.whl
- Upload date:
- Size: 71.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56f510cd3b66b8304f8ef5569a3369161601fc7c24326747270ae29332f2236a
|
|
| MD5 |
34abba1e3131aceeef0b339620734880
|
|
| BLAKE2b-256 |
9c8354f5b3131f444b6c8cf35c016cc29d69e5cc3ff5320e3c5f5046aa102f24
|
Provenance
The following attestation bundles were made for schema_sentinel-3.0.3-py3-none-any.whl:
Publisher:
release.yml on Igladyshev/schema-sentinel
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
schema_sentinel-3.0.3-py3-none-any.whl -
Subject digest:
56f510cd3b66b8304f8ef5569a3369161601fc7c24326747270ae29332f2236a - Sigstore transparency entry: 929119173
- Sigstore integration time:
-
Permalink:
Igladyshev/schema-sentinel@7d37b6200c830f7a05df0de79e64da54da1b240f -
Branch / Tag:
refs/tags/v3.0.3 - Owner: https://github.com/Igladyshev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@7d37b6200c830f7a05df0de79e64da54da1b240f -
Trigger Event:
push
-
Statement type: