Skip to main content

Detect schema drift before your SQL queries blow up.

Project description

sqldrift

Detect schema drift before your SQL queries blow up.

sqldrift validates SQL queries against a live database schema to catch missing tables before execution. Built on sqlglot for robust, multi-dialect SQL parsing.

Why?

When AI agents or automated pipelines generate SQL, they may reference tables that have been renamed, dropped, or migrated. Running those queries blindly causes production failures. sqldrift catches the problem before it happens.

Features

  • Schema drift detection — identifies missing tables before query execution
  • CTE-aware — correctly ignores Common Table Expressions
  • Case-insensitive matching (configurable)
  • Schema-qualified name support — handles schema.table references
  • Optimized for scale — class-based validator with O(1) lookups for 4,000+ tables
  • LRU caching — up to 282x speedup for repeated queries
  • Multi-dialect — PostgreSQL, MySQL, BigQuery, and more via sqlglot

Installation

pip install .

For development:

pip install -e ".[dev]"

Quick Start

Simple one-off validation

from sqldrift import validate_query

live_tables = ["users", "orders", "products"]

success, message = validate_query(
    "SELECT * FROM users JOIN orders ON users.id = orders.user_id",
    live_tables,
)
# (True, "Query is safe to execute.")

success, message = validate_query(
    "SELECT * FROM deleted_table",
    live_tables,
)
# (False, "Schema Drift Detected: The following tables were not found: ['deleted_table']")

Reusable validator (recommended for large schemas)

from sqldrift import SchemaValidator

validator = SchemaValidator(live_tables)

# Validate many queries — the table set is built once
success, message = validator.validate("SELECT * FROM users")

Column-Level Drift Detection

While the standard validator checks for missing tables, ColumnValidator checks for missing columns. This requires providing a schema definition with column names.

from sqldrift import ColumnValidator

# Define your schema (table -> columns)
schema = {
    "users": {
        "columns": ["id", "name", "email", "created_at"],
        "types": ["INTEGER", "VARCHAR", "VARCHAR", "TIMESTAMP"], # optional
    },
    "orders": {
        "columns": ["id", "user_id", "total"],
    },
}

validator = ColumnValidator(schema)

# Valid query
success, msg = validator.validate("SELECT name FROM users")
# (True, "All columns exist.")

# Invalid query (column 'tier' missing)
success, msg = validator.validate("SELECT tier FROM users")
# (False, "Column Drift Detected: The following columns were not found: ['tier']")

Features

  • Qualified names: Handles table.column and alias references (u.name) correctly.
  • Suggestions: Offers specific column names if a mismatch is close (e.g. user_id vs userid).
  • Caching: Use CachedColumnValidator for high-performance repeated validation.

Cached validator (best for repeated queries)

from sqldrift import CachedSchemaValidator

cached = CachedSchemaValidator(live_tables, cache_size=256)
success, message = cached.validate(query)  # subsequent identical calls are cached

Where Do Table Lists Come From?

sqldrift does not connect to a database — you provide the list of tables that currently exist in your schema. This keeps the library database-agnostic and flexible.

Common ways to source your table list:

# PostgreSQL via psycopg2
cur.execute("SELECT table_name FROM information_schema.tables WHERE table_schema = 'public'")
live_tables = [row[0] for row in cur.fetchall()]

# MySQL via mysql-connector
cur.execute("SHOW TABLES")
live_tables = [row[0] for row in cur.fetchall()]

# SQLAlchemy (any backend)
from sqlalchemy import inspect
live_tables = inspect(engine).get_table_names()

# Static config (YAML, JSON, etc.)
import json
with open("schema.json") as f:
    live_tables = json.load(f)["tables"]

Then pass the list to any validator:

validator = SchemaValidator(live_tables)
success, msg = validator.validate("SELECT * FROM users JOIN orders ON ...")

Integration Examples

Detailed guides with runnable scripts for specific database integrations:

Integration Guide
AWS Glue Catalog + Athena examples/glue_athena/
GCP BigQuery examples/bigquery/
Azure Synapse Analytics examples/azure_synapse/
AWS Redshift (Data API) examples/redshift/
PostgreSQL examples/postgresql/
MySQL examples/mysql/

API Reference

validate_query(sql_query, live_tables, *, dialect=None)

Standalone function for one-off validation.

Parameter Type Description
sql_query str The SQL query to validate
live_tables list[str] Tables that exist in the schema
dialect str | None SQL dialect ("postgres", "mysql", etc.)

Returns: tuple[bool, str]

SchemaValidator(live_tables, *, case_sensitive=False, preserve_schema=False)

Class-based validator with pre-computed table lookups.

Option Default Description
case_sensitive False Enable case-sensitive table name matching
preserve_schema False Match full schema.table names instead of base name

Methods:

Method Description
validate(sql_query, dialect) Validate a query against the schema
update_schema(live_tables) Hot-swap the schema at runtime
table_exists(table_name) Check if a specific table exists
get_table_count() Return the number of registered tables

CachedSchemaValidator

Extends SchemaValidator with LRU caching. Accepts an additional cache_size parameter (default: 128).

Additional methods: clear_cache(), get_cache_info()

Project Structure

sqldrift/
├── pyproject.toml
├── LICENSE
├── README.md
├── src/
│   └── sqldrift/
│       ├── __init__.py        # Public API
│       ├── validator.py       # Core validate_query function
│       └── optimized.py       # SchemaValidator & CachedSchemaValidator
├── tests/
│   └── test_validator.py      # pytest suite (20 tests)
├── examples/
│   └── usage_examples.py
└── benchmarks/
    └── benchmark.py

Performance

Benchmarked with 4,000 tables:

Method Speedup vs function
SchemaValidator ~2.6x faster
CachedValidator ~282x faster
python benchmarks/benchmark.py

Testing

pip install -e ".[dev]"
pytest

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sqldrift-0.1.5.tar.gz (17.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sqldrift-0.1.5-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file sqldrift-0.1.5.tar.gz.

File metadata

  • Download URL: sqldrift-0.1.5.tar.gz
  • Upload date:
  • Size: 17.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sqldrift-0.1.5.tar.gz
Algorithm Hash digest
SHA256 1e78ba3346af2b8cd650b9acd699c1a698e85e725e6a7a9415491ef0f14b404f
MD5 ac693ad4d26a6dbeee9127355613ccff
BLAKE2b-256 d0768a83c5e5ffa8e7dc8da7b0f9670f8a7b944a2fe64a11a0baae17ad30782a

See more details on using hashes here.

Provenance

The following attestation bundles were made for sqldrift-0.1.5.tar.gz:

Publisher: publish.yml on aruncse01/sqldrift

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sqldrift-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: sqldrift-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 13.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sqldrift-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 46f04b34b47d16ad8abb044e2199a8c163370ed362fd97a9f485a498ef312df4
MD5 de49ff9bce3ff6839713e9dd2994a330
BLAKE2b-256 725e5f92987d50b3b5a7f6335938edcf1d1a35b45d643f14fb08d533183d4be0

See more details on using hashes here.

Provenance

The following attestation bundles were made for sqldrift-0.1.5-py3-none-any.whl:

Publisher: publish.yml on aruncse01/sqldrift

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page