Skip to main content

A multi-agent orchestration system that converts natural language queries into safe, optimized SQL using LangGraph and OpenRouter LLMs.

Project description

NL2SQL - Multi-Agent Natural Language to SQL System

Python 3.13+ LangGraph License: MIT Code style: black

A sophisticated multi-agent orchestration system that converts natural language queries into safe, optimized SQL using LangGraph and OpenRouter LLMs. Built with security, performance, and explainability at its core.

NL2SQL Demo

Key Features

  • Multi-Agent Architecture: 8 specialized agents working in harmony (Discovery, Security, Schema Formatting, Query Generation, Validation, Explanation)
  • Security-First: Built-in SQL injection prevention, dangerous operation filtering, and security scoring
  • Smart Table Discovery: Multi-signal ranking (keyword matching, semantic search, foreign key graph analysis)
  • Performance Validation: Query optimization checks and performance score reporting
  • Explainability: Human-readable explanations for every generated query
  • Async Architecture: Non-blocking operations throughout the pipeline
  • Schema Caching: Intelligent caching to reduce database introspection overhead
  • Comprehensive Logging: Detailed file-based logging for debugging and monitoring

Architecture

NL2SQL Multi-Agent System Architecture

Quick Start

Prerequisites

  • Python 3.13 or higher
  • OpenRouter API key (or compatible LLM provider)
  • SQLite databases (Spider dataset recommended)

Installation

  1. Clone the repository

    git clone https://github.com/ToheedAsghar/NL2SQL.git
    cd NL2SQL
    
  2. Create virtual environment

    python -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    
  3. Install dependencies

    pip install -r requirements.txt
    
  4. Download Spider dataset

    The Spider dataset (877MB) is not included in the repository. Download it from the official source:

    # Visit https://yale-lily.github.io/spider
    # Download and extract to ./spider/ directory
    
    # Verify structure
    ls spider/database/
    # Should show: academic, concert_singer, car_1, etc.
    
  5. Configure environment

    Create a .env file in the project root:

    # LLM Provider Configuration
    OPENAI_API_KEY=sk-or-v1-your-openrouter-api-key
    OPENAI_BASE_URL=https://openrouter.ai/api/v1
    OPENAI_MODEL=openai/gpt-4o-mini
    OPENAI_EMBEDDING_MODEL=openai/text-embedding-3-small
    
    # Database Configuration
    DB_TYPE=sqlite
    DB_PATH=./spider/database/concert_singer/concert_singer.sqlite
    
    # Optional: Adjust discovery agent weights
    # KEYWORD_WEIGHT=0.35
    # SEMANTIC_WEIGHT=0.45
    # FK_GRAPH_WEIGHT=0.20
    
  6. Run the application

    python main.py
    

Usage Examples

Interactive Mode

python main.py
Enter your question (or 'quit'/'exit'/'q'): 
> Show me all singers from France

══════════════════════════════════════════════════════════════
 SQL QUERY
══════════════════════════════════════════════════════════════
SELECT s.*
FROM concert_singer.singer AS s
WHERE s.Country = 'France'

══════════════════════════════════════════════════════════════
 EXPLANATION
══════════════════════════════════════════════════════════════
This query retrieves all singers from France by:
1. Selecting all columns from the singer table
2. Filtering records where Country equals 'France'
3. Using table alias 's' for readability

Safety Score: 3.5/4.0
✓ Security: Safe (no SQL injection risks)
✓ Syntax: Valid SQL
✓ Logic: Correct table and column usage
⚠ Performance: Consider adding index on Country column for large datasets

══════════════════════════════════════════════════════════════

Sample Questions (Concert Singer Database)

- Show me all singers from France
- List all concerts held in 2024
- What are the top 5 stadiums by capacity?
- How many concerts did each singer perform?
- Which stadium hosted the most concerts?
- Find singers who have never performed in a concert
- Show me concerts with ticket prices above $100

Configuration

LLM Provider

The system uses OpenRouter by default, but you can configure any OpenAI-compatible API:

OPENAI_API_KEY=your-api-key
OPENAI_BASE_URL=https://api.openai.com/v1  # or your provider
OPENAI_MODEL=gpt-4-turbo
OPENAI_EMBEDDING_MODEL=text-embedding-3-small

Database Configuration

Switch between different databases by updating DB_PATH:

# Concert Singer (music industry)
DB_PATH=./spider/database/concert_singer/concert_singer.sqlite

# Academic (papers and conferences)
DB_PATH=./spider/database/academic/academic.sqlite

# Car Sales
DB_PATH=./spider/database/car_1/car_1.sqlite

Discovery Agent Tuning

Adjust table discovery signal weights in .env:

KEYWORD_WEIGHT=0.35    # Exact keyword matching
SEMANTIC_WEIGHT=0.45   # Semantic similarity
FK_GRAPH_WEIGHT=0.20   # Foreign key relationships

Project Structure

NL2SQL/
├── nl2sql_agents/                     # Installable Python package
│   ├── __init__.py                    # Package version & metadata
│   ├── cli.py                         # Rich CLI entry point
│   ├── py.typed                       # PEP 561 typed marker
│   ├── agents/                        # Agent modules
│   │   ├── __init__.py
│   │   ├── base_agent.py              # Base agent with LLM calling
│   │   ├── query_generator.py         # SQL generation agent
│   │   ├── schema_formatter.py        # Schema formatting agent
│   │   ├── discovery/                 # Table discovery agents
│   │   │   ├── __init__.py
│   │   │   ├── discovery_agent.py     # Multi-signal orchestrator
│   │   │   ├── keyword_agent.py       # Keyword matching
│   │   │   ├── semantic_agent.py      # Embedding-based search
│   │   │   └── fk_graph_agent.py      # Foreign key analysis
│   │   ├── validator/                 # Validation agents
│   │   │   ├── __init__.py
│   │   │   ├── validator_agent.py     # Orchestrator
│   │   │   ├── syntax_validator.py    # SQL syntax check
│   │   │   ├── logic_validator.py     # Logical correctness
│   │   │   ├── security_validator.py  # SQL injection check
│   │   │   └── performance_validator.py # Performance analysis
│   │   └── explainer/                 # Explanation agents
│   │       ├── __init__.py
│   │       ├── explainer_agent.py     # Orchestrator
│   │       ├── explanation_agent.py   # Query explanation
│   │       ├── safety_report_agent.py # Safety scoring
│   │       └── optimization_agent.py  # Optimization tips
│   ├── orchestrator/                  # LangGraph pipeline
│   │   ├── __init__.py
│   │   ├── pipeline.py                # Main graph definition
│   │   └── nodes.py                   # Node implementations
│   ├── filters/                       # Pre-processing filters
│   │   ├── __init__.py
│   │   ├── gate.py                    # Table gating
│   │   └── security_filter.py         # Query security filter
│   ├── db/                            # Database layer
│   │   ├── __init__.py
│   │   └── connector.py               # Async SQLite connector
│   ├── cache/                         # Caching layer
│   │   ├── __init__.py
│   │   └── schema_cache.py            # Schema caching
│   ├── config/                        # Configuration
│   │   ├── __init__.py
│   │   └── settings.py                # LLM provider settings
│   └── models/                        # Data models
│       ├── __init__.py
│       └── schemas.py                 # GraphState and schemas
├── data/                              # Local data assets
│   └── spider/
│       └── database/                  # Spider dataset databases
├── Assets/                            # Static assets
│   ├── architecture.svg
│   ├── nl2sql.gif                     # Demo GIF
│   └── nl2sql-architecture.svg        # Architecture diagram
├── logs/                              # Application logs (auto-created)
│   └── *.log                          # Timestamped log files
├── spider/                            # Spider benchmark files
│   ├── dev.json
│   ├── dev_gold.sql
│   ├── tables.json
│   ├── train_gold.sql
│   ├── train_others.json
│   ├── train_spider.json
│   ├── README.txt
│   └── database/                      # 200+ domain databases
├── main.py                            # Legacy CLI entry point
├── pyproject.toml                     # Package metadata & build config
├── requirements.txt                   # Python dependencies
├── FAME_PLAN.md                       # FAME methodology plan
├── MEMORY_IMPLEMENTATION.md           # Memory system design doc
├── .env                               # Environment configuration
├── .env.example                       # Example env template
├── .gitignore                         # Git ignore rules
├── LICENSE                            # MIT License
└── README.md                          # This file

How It Works

1. Table Discovery

The Discovery Agent uses three signals to rank tables:

  • Keyword Matching (35%): Direct token overlap with query
  • Semantic Search (45%): Embedding similarity using LLM
  • Foreign Key Graph (20%): Relationship centrality analysis

Tables are ranked by weighted score, and top-K are passed forward.

2. Security Filtering

Before processing, queries are checked for:

  • DROP, DELETE, TRUNCATE operations
  • System table access attempts
  • SQL injection patterns
  • Dangerous function calls

Dangerous queries are blocked immediately.

3. Query Generation

The generator uses:

  • One-shot retry: If first generation fails validation, retry with error context
  • Schema-aware prompting: Full context of relevant tables, columns, types
  • Example-based learning: Few-shot examples in system prompt

4. Multi-Stage Validation

Four validators run in parallel:

  • Syntax: SQL parsing and syntax validation
  • Logic: Table/column existence, join correctness
  • Security: SQL injection and dangerous operation check
  • Performance: Query complexity, missing indexes, optimization tips

5. Explainability

The Explainer generates:

  • Step-by-step query breakdown
  • Safety score (0-4 scale across 4 dimensions)
  • Optimization recommendations
  • Human-readable summary

Database Setup

Spider Dataset

The Spider dataset contains 200+ databases across diverse domains:

  • Academic papers and conferences
  • Music (concert_singer)
  • Sports (baseball, football, soccer)
  • Business (car sales, employee management, HR)
  • E-commerce, healthcare, real estate, and more

Download: Visit https://yale-lily.github.io/spider

  1. Download the database ZIP file
  2. Extract to ./spider/ directory in project root
  3. Verify: ls spider/database/ should show ~200 subdirectories
  4. Update .env with desired database path

Custom Databases

To use your own SQLite database:

  1. Place your .sqlite or .db file anywhere accessible
  2. Update .env:
    DB_PATH=/path/to/your/database.sqlite
    
  3. Run the application

The system will automatically:

  • Introspect schema (tables, columns, types, foreign keys)
  • Cache schema for performance
  • Build embedding index for semantic search

Logging

Logs are automatically written to logs/ directory with timestamps:

logs/
├── nl2sql_20260301_143022.log
├── nl2sql_20260301_151134.log
└── nl2sql_20260301_163045.log

Log format includes:

  • Timestamp
  • Logger name (agent identifier)
  • Log level
  • Message (including token usage, scores, timings)

View logs:

tail -f logs/nl2sql_*.log

Development

Running Tests

# Unit tests
pytest tests/

# Integration tests
pytest tests/integration/

# Coverage report
pytest --cov=agents --cov=orchestrator --cov-report=html

Code Style

This project follows PEP 8 with Black formatting:

# Format code
black .

# Lint
ruff check .

# Type checking
mypy agents/ orchestrator/

Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Add tests for new functionality
  5. Ensure all tests pass (pytest)
  6. Format code (black .)
  7. Commit your changes (git commit -m 'Add amazing feature')
  8. Push to the branch (git push origin feature/amazing-feature)
  9. Open a Pull Request

Development Setup

# Install dev dependencies
pip install -r requirements-dev.txt

# Install pre-commit hooks
pre-commit install

Roadmap

  • Multi-database support: PostgreSQL, MySQL, SQL Server
  • Query execution: Safely execute generated queries and return results
  • Result visualization: Charts and graphs for query results
  • Query history: Track and reuse previous queries
  • Fine-tuned models: Domain-specific LLM fine-tuning
  • Web UI: Browser-based interface with real-time feedback
  • API server: REST/GraphQL API for integration
  • Streaming responses: Real-time query generation feedback
  • Multi-turn conversations: Context-aware follow-up queries
  • Natural language results: Convert SQL results back to natural language

License

This project is licensed under the MIT License - see the LICENSE file for details.


Acknowledgments

  • Spider Dataset: Yale University's semantic parsing and text-to-SQL benchmark
  • LangGraph: LangChain's graph-based agent orchestration framework
  • OpenRouter: LLM API aggregation service
  • LangChain: Building blocks for LLM applications

Contact

Toheed Asghar - Project Maintainer


Star History

If you find this project useful, please consider giving it a star! It helps others discover it.


Built with ❤️ using LangGraph and OpenRouter

Report Bug · Request Feature · Documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nl2sql_agents-0.1.0.tar.gz (203.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nl2sql_agents-0.1.0-py3-none-any.whl (39.0 kB view details)

Uploaded Python 3

File details

Details for the file nl2sql_agents-0.1.0.tar.gz.

File metadata

  • Download URL: nl2sql_agents-0.1.0.tar.gz
  • Upload date:
  • Size: 203.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for nl2sql_agents-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d9998eae235b57259db9a8ec86fb95a2030757211a35ea0823e35b33cd9d8b1e
MD5 2f65bbb320800f7b2c6827db1b9fdf42
BLAKE2b-256 2cc67d742b39fb36c45edfb84add02c31ca391a86b1dab33c422250cc3b0882a

See more details on using hashes here.

File details

Details for the file nl2sql_agents-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: nl2sql_agents-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 39.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for nl2sql_agents-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8304a1e4952bc5610c3cb30b083c37c507b85f66db763853ba0d9738d04e538c
MD5 9ef72a8173071bab67cebfb5fc8a5ed7
BLAKE2b-256 b44302393aa218c039b08f3e0210a20ef63d47697287cfe90485376858e46bba

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page