Skip to main content

SQL Dependency Extractor

Project description

SQLDeps: SQL Dependency Extractor

SQLDeps Logo

Test Documentation Supported Python versions Package version License

A tool that automatically extracts and maps SQL dependencies and outputs using Large Language Models (LLMs).



Overview

SQLDeps analyzes SQL scripts to identify:

  1. Dependencies: Tables and columns that must exist BEFORE query execution
  2. Outputs: Tables and columns permanently CREATED or MODIFIED by the query

It intelligently filters out temporary constructs like CTEs and derived tables, focusing only on the real database objects that matter.

Benefits

  • 🛠️ Change Management: Safely modify schemas by identifying true dependencies
  • 💾 Storage Optimization: Focus resources on essential tables and columns
  • 🚢 Migration Planning: Precisely determine what needs to be migrated
  • 📝 Project Documentation: Create comprehensive maps of database dependencies

Installation

pip install sqldeps

Quick Start

SQLDeps provides both API and CLI interfaces:

  • API: Enables flexibility for Python developers to integrate with their applications
  • CLI: Offers an easy-to-use command-line interface for quick analysis

API Usage

from sqldeps.llm_parsers import create_extractor

# Create extractor with default settings (framework="groq", model="llama-3.3-70b-versatile")
extractor = create_extractor()

# Extract dependencies and outputs from a SQL query
sql_query = """
WITH user_orders AS (
    SELECT o.user_id, COUNT(*) AS order_count
    FROM orders o
    JOIN users u ON o.user_id = u.id
    WHERE u.status = 'active'
    GROUP BY o.user_id
)

CREATE TABLE transactions.user_order_summary AS
SELECT * FROM user_orders;
"""
result = extractor.extract_from_query(sql_query)

# Print the results
print("Dependencies:")
print(result.dependencies)
print("\nOutputs:")
print(result.outputs)

# Or extract from a file
result = extractor.extract_from_file('path/to/query.sql')

# Convert to dictionary or DataFrame
dict_format = result.to_dict()
df_format = result.to_dataframe()

CLI Usage

# Basic example with default settings
sqldeps path/to/query.sql

# Specify framework and output format
sqldeps path/to/query.sql --framework=openai --model=gpt-4o-mini -o query_deps.csv

# Process a folder recursively with database validation
sqldeps data/sql_folder \
    --recursive \
    --framework=deepseek \
    --db-match-schema \
    --db-target-schemas public,sales \
    --db-credentials configs/database.yml \
    -o folder_deps.csv

Example

Given this SQL query:

-- Common Table Expression (CTE) to count user orders for active users
WITH user_orders AS (
    SELECT o.user_id, COUNT(*) AS order_count
    FROM orders o
    JOIN users u ON o.user_id = u.id
    WHERE u.status = 'active'
    GROUP BY o.user_id
)

-- Create a new table from the CTE
CREATE TABLE transactions.user_order_summary AS
SELECT * FROM user_orders;

-- Truncate an existing table before repopulating
TRUNCATE TABLE order_summary;

SQLDeps will extract:

{
  "dependencies": {
    "orders": ["user_id"],
    "users": ["id", "status"],
    "order_summary": []
  },
  "outputs": {
    "transactions.user_order_summary": ["*"],
    "order_summary": []
  }
}

Note how:

  • CTE (user_orders) is correctly excluded
  • Real source tables (orders, users) are included as dependencies
  • Target tables (transactions.user_order_summary, order_summary) are correctly identified as outputs
  • For TRUNCATE operations, the table appears in both dependencies and outputs because it must exist before truncating and the operation modifies the table

Supported Models

All models available on Groq, OpenAI, and DeepSeek.
For up-to-date pricing details, please check Groq, OpenAI, DeepSeek.

API Keys / Configuration

You'll need to set up API keys for your chosen LLM provider. Create a .env file in your project root (see the provided .env.example):

# LLM API Keys
GROQ_API_KEY=your_groq_api_key
OPENAI_API_KEY=your_openai_api_key
DEEPSEEK_API_KEY=your_deepseek_api_key

# Database credentials (for schema validation)
DB_HOST=localhost
DB_PORT=5432
DB_NAME=mydatabase
DB_USER=username
DB_PASSWORD=password

For custom database YAML configuration file (optional):

# database.yml
database:
  host: localhost
  port: 5432
  database: mydatabase
  username: username
  password: password

Advanced Usage

Database Schema Validation

from sqldeps.database import PostgreSQLConnector
from sqldeps.llm_parsers import create_extractor

# Extract dependencies
extractor = create_extractor(framework="openai", model="gpt-4o")
result = extractor.extract_from_file('query.sql')

# Connect to database and validate
conn = PostgreSQLConnector(
    host="localhost",
    port=5432,
    database="mydatabase",
    username="username"
)

# Match extracted dependencies against database schema
validated_schema = extractor.match_database_schema(
    result,
    db_connection=conn,
    target_schemas=["public", "sales"]
)

# View validation results (pandas DataFrame)
print(validated_schema)

Processing Multiple Files

# Extract dependencies from all SQL files in a folder
result = extractor.extract_from_folder('/path/to/sql_folder', recursive=True)

Using Custom Prompts

You can customize the prompts used to instruct the LLM:

# Create extractor with custom prompt
extractor = create_extractor(
    framework="groq",
    model="llama-3.3-70b-versatile",
    prompt_path="path/to/custom_prompt.yml"
)

The custom prompt YAML should include:

system_prompt: |
  Detailed instructions to the model...

user_prompt: |
  Extract SQL dependencies and outputs from this query:
  {sql}

Web Application

SQLDeps includes a Streamlit-based web interface for interactive exploration of single SQL files:

# Install with web app dependencies
pip install "sqldeps[app]"

# Run the app
streamlit run app/main.py

Note: The web application is currently designed for single-file extraction and demonstration purposes. For processing multiple files or entire folders, use the API or CLI.

Documentation

For comprehensive documentation, including API reference and examples, visit https://sqldeps.readthedocs.io.

Contributing

Contributions are welcome!

  • Found a bug? Please open an issue with detailed information.
  • Missing a feature? Feel free to suggest enhancements or submit a pull request.

Check out the issues page or submit a pull request.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sqldeps-0.0.10.tar.gz (23.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sqldeps-0.0.10-py3-none-any.whl (28.3 kB view details)

Uploaded Python 3

File details

Details for the file sqldeps-0.0.10.tar.gz.

File metadata

  • Download URL: sqldeps-0.0.10.tar.gz
  • Upload date:
  • Size: 23.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.9

File hashes

Hashes for sqldeps-0.0.10.tar.gz
Algorithm Hash digest
SHA256 11b16f01b8d6b442eef975fb2d29e5ad4eafe29bc2d413e599f4b2c643547f65
MD5 f4c6a9507a86eca81cfd2a8e60c803a9
BLAKE2b-256 46d66f1454bc6ff84e6f9b0fc947df72d16c16b61fbecaef6398c8e26f9375ba

See more details on using hashes here.

File details

Details for the file sqldeps-0.0.10-py3-none-any.whl.

File metadata

  • Download URL: sqldeps-0.0.10-py3-none-any.whl
  • Upload date:
  • Size: 28.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.9

File hashes

Hashes for sqldeps-0.0.10-py3-none-any.whl
Algorithm Hash digest
SHA256 2b4f40e2aafa320bd842d956df650bd962c0057512a8740284314bfaaca40c82
MD5 6fcbcd35633783638a5d5bcafa204d91
BLAKE2b-256 e07ee4a87044be6ddef2703226ac3743f03248be294616702f288472959d4d7e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page