A multi-agent orchestration system that converts natural language queries into safe, optimized SQL using LangGraph and OpenRouter LLMs.

These details have not been verified by PyPI

Project links

Project description

NL2SQL - Multi-Agent Natural Language to SQL System

A sophisticated multi-agent orchestration system that converts natural language queries into safe, optimized SQL using LangGraph and OpenRouter LLMs. Built with security, performance, and explainability at its core.

Key Features

Multi-Agent Architecture: 8 specialized agents working in harmony (Discovery, Security, Schema Formatting, Query Generation, Validation, Explanation)
Security-First: Built-in SQL injection prevention, dangerous operation filtering, and security scoring
Smart Table Discovery: Multi-signal ranking (keyword matching, semantic search, foreign key graph analysis)
Performance Validation: Query optimization checks and performance score reporting
Explainability: Human-readable explanations for every generated query
Async Architecture: Non-blocking operations throughout the pipeline
Schema Caching: Intelligent caching to reduce database introspection overhead
Comprehensive Logging: Detailed file-based logging for debugging and monitoring

Architecture

Quick Start

Prerequisites

Python 3.13 or higher
OpenRouter API key (or compatible LLM provider)
SQLite databases (Spider dataset recommended)

Installation

Clone the repository

git clone https://github.com/ToheedAsghar/NL2SQL.git
cd NL2SQL

Create virtual environment

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies
```
pip install -r requirements.txt
```

Download Spider dataset

The Spider dataset (877MB) is not included in the repository. Download it from the official source:

# Visit https://yale-lily.github.io/spider
# Download and extract to ./spider/ directory

# Verify structure
ls spider/database/
# Should show: academic, concert_singer, car_1, etc.

Configure environment

Create a .env file in the project root:

# LLM Provider Configuration
OPENAI_API_KEY=sk-or-v1-your-openrouter-api-key
OPENAI_BASE_URL=https://openrouter.ai/api/v1
OPENAI_MODEL=openai/gpt-4o-mini
OPENAI_EMBEDDING_MODEL=openai/text-embedding-3-small

# Database Configuration
DB_TYPE=sqlite
DB_PATH=./spider/database/concert_singer/concert_singer.sqlite

# Optional: Adjust discovery agent weights
# KEYWORD_WEIGHT=0.35
# SEMANTIC_WEIGHT=0.45
# FK_GRAPH_WEIGHT=0.20

Run the application
```
python main.py
```

Usage Examples

Interactive Mode

python main.py

Enter your question (or 'quit'/'exit'/'q'): 
> Show me all singers from France

══════════════════════════════════════════════════════════════
 SQL QUERY
══════════════════════════════════════════════════════════════
SELECT s.*
FROM concert_singer.singer AS s
WHERE s.Country = 'France'

══════════════════════════════════════════════════════════════
 EXPLANATION
══════════════════════════════════════════════════════════════
This query retrieves all singers from France by:
1. Selecting all columns from the singer table
2. Filtering records where Country equals 'France'
3. Using table alias 's' for readability

Safety Score: 3.5/4.0
✓ Security: Safe (no SQL injection risks)
✓ Syntax: Valid SQL
✓ Logic: Correct table and column usage
⚠ Performance: Consider adding index on Country column for large datasets

══════════════════════════════════════════════════════════════

Sample Questions (Concert Singer Database)

- Show me all singers from France
- List all concerts held in 2024
- What are the top 5 stadiums by capacity?
- How many concerts did each singer perform?
- Which stadium hosted the most concerts?
- Find singers who have never performed in a concert
- Show me concerts with ticket prices above $100

Configuration

LLM Provider

The system uses OpenRouter by default, but you can configure any OpenAI-compatible API:

OPENAI_API_KEY=your-api-key
OPENAI_BASE_URL=https://api.openai.com/v1  # or your provider
OPENAI_MODEL=gpt-4-turbo
OPENAI_EMBEDDING_MODEL=text-embedding-3-small

Database Configuration

Switch between different databases by updating DB_PATH:

# Concert Singer (music industry)
DB_PATH=./spider/database/concert_singer/concert_singer.sqlite

# Academic (papers and conferences)
DB_PATH=./spider/database/academic/academic.sqlite

# Car Sales
DB_PATH=./spider/database/car_1/car_1.sqlite

Discovery Agent Tuning

Adjust table discovery signal weights in .env:

KEYWORD_WEIGHT=0.35    # Exact keyword matching
SEMANTIC_WEIGHT=0.45   # Semantic similarity
FK_GRAPH_WEIGHT=0.20   # Foreign key relationships

Project Structure

NL2SQL/
├── nl2sql_agents/                     # Installable Python package
│   ├── __init__.py                    # Package version & metadata
│   ├── cli.py                         # Rich CLI entry point
│   ├── py.typed                       # PEP 561 typed marker
│   ├── agents/                        # Agent modules
│   │   ├── __init__.py
│   │   ├── base_agent.py              # Base agent with LLM calling
│   │   ├── query_generator.py         # SQL generation agent
│   │   ├── schema_formatter.py        # Schema formatting agent
│   │   ├── discovery/                 # Table discovery agents
│   │   │   ├── __init__.py
│   │   │   ├── discovery_agent.py     # Multi-signal orchestrator
│   │   │   ├── keyword_agent.py       # Keyword matching
│   │   │   ├── semantic_agent.py      # Embedding-based search
│   │   │   └── fk_graph_agent.py      # Foreign key analysis
│   │   ├── validator/                 # Validation agents
│   │   │   ├── __init__.py
│   │   │   ├── validator_agent.py     # Orchestrator
│   │   │   ├── syntax_validator.py    # SQL syntax check
│   │   │   ├── logic_validator.py     # Logical correctness
│   │   │   ├── security_validator.py  # SQL injection check
│   │   │   └── performance_validator.py # Performance analysis
│   │   └── explainer/                 # Explanation agents
│   │       ├── __init__.py
│   │       ├── explainer_agent.py     # Orchestrator
│   │       ├── explanation_agent.py   # Query explanation
│   │       ├── safety_report_agent.py # Safety scoring
│   │       └── optimization_agent.py  # Optimization tips
│   ├── orchestrator/                  # LangGraph pipeline
│   │   ├── __init__.py
│   │   ├── pipeline.py                # Main graph definition
│   │   └── nodes.py                   # Node implementations
│   ├── filters/                       # Pre-processing filters
│   │   ├── __init__.py
│   │   ├── gate.py                    # Table gating
│   │   └── security_filter.py         # Query security filter
│   ├── db/                            # Database layer
│   │   ├── __init__.py
│   │   └── connector.py               # Async SQLite connector
│   ├── cache/                         # Caching layer
│   │   ├── __init__.py
│   │   └── schema_cache.py            # Schema caching
│   ├── config/                        # Configuration
│   │   ├── __init__.py
│   │   └── settings.py                # LLM provider settings
│   └── models/                        # Data models
│       ├── __init__.py
│       └── schemas.py                 # GraphState and schemas
├── data/                              # Local data assets
│   └── spider/
│       └── database/                  # Spider dataset databases
├── Assets/                            # Static assets
│   ├── architecture.svg
│   ├── nl2sql.gif                     # Demo GIF
│   └── nl2sql-architecture.svg        # Architecture diagram
├── logs/                              # Application logs (auto-created)
│   └── *.log                          # Timestamped log files
├── spider/                            # Spider benchmark files
│   ├── dev.json
│   ├── dev_gold.sql
│   ├── tables.json
│   ├── train_gold.sql
│   ├── train_others.json
│   ├── train_spider.json
│   ├── README.txt
│   └── database/                      # 200+ domain databases
├── main.py                            # Legacy CLI entry point
├── pyproject.toml                     # Package metadata & build config
├── requirements.txt                   # Python dependencies
├── FAME_PLAN.md                       # FAME methodology plan
├── MEMORY_IMPLEMENTATION.md           # Memory system design doc
├── .env                               # Environment configuration
├── .env.example                       # Example env template
├── .gitignore                         # Git ignore rules
├── LICENSE                            # MIT License
└── README.md                          # This file

How It Works

1. Table Discovery

The Discovery Agent uses three signals to rank tables:

Keyword Matching (35%): Direct token overlap with query
Semantic Search (45%): Embedding similarity using LLM
Foreign Key Graph (20%): Relationship centrality analysis

Tables are ranked by weighted score, and top-K are passed forward.

2. Security Filtering

Before processing, queries are checked for:

DROP, DELETE, TRUNCATE operations
System table access attempts
SQL injection patterns
Dangerous function calls

Dangerous queries are blocked immediately.

3. Query Generation

The generator uses:

One-shot retry: If first generation fails validation, retry with error context
Schema-aware prompting: Full context of relevant tables, columns, types
Example-based learning: Few-shot examples in system prompt

4. Multi-Stage Validation

Four validators run in parallel:

Syntax: SQL parsing and syntax validation
Logic: Table/column existence, join correctness
Security: SQL injection and dangerous operation check
Performance: Query complexity, missing indexes, optimization tips

5. Explainability

The Explainer generates:

Step-by-step query breakdown
Safety score (0-4 scale across 4 dimensions)
Optimization recommendations
Human-readable summary

Database Setup

Spider Dataset

The Spider dataset contains 200+ databases across diverse domains:

Academic papers and conferences
Music (concert_singer)
Sports (baseball, football, soccer)
Business (car sales, employee management, HR)
E-commerce, healthcare, real estate, and more

Download: Visit https://yale-lily.github.io/spider

Download the database ZIP file
Extract to ./spider/ directory in project root
Verify: ls spider/database/ should show ~200 subdirectories
Update .env with desired database path

Custom Databases

To use your own SQLite database:

Place your .sqlite or .db file anywhere accessible
Update .env:
```
DB_PATH=/path/to/your/database.sqlite
```
Run the application

The system will automatically:

Introspect schema (tables, columns, types, foreign keys)
Cache schema for performance
Build embedding index for semantic search

Logging

Logs are automatically written to logs/ directory with timestamps:

logs/
├── nl2sql_20260301_143022.log
├── nl2sql_20260301_151134.log
└── nl2sql_20260301_163045.log

Log format includes:

Timestamp
Logger name (agent identifier)
Log level
Message (including token usage, scores, timings)

View logs:

tail -f logs/nl2sql_*.log

Development

Running Tests

# Unit tests
pytest tests/

# Integration tests
pytest tests/integration/

# Coverage report
pytest --cov=agents --cov=orchestrator --cov-report=html

Code Style

This project follows PEP 8 with Black formatting:

# Format code
black .

# Lint
ruff check .

# Type checking
mypy agents/ orchestrator/

Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Add tests for new functionality
Ensure all tests pass (pytest)
Format code (black .)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Setup

# Install dev dependencies
pip install -r requirements-dev.txt

# Install pre-commit hooks
pre-commit install

Roadmap

Multi-database support: PostgreSQL, MySQL, SQL Server
Query execution: Safely execute generated queries and return results
Result visualization: Charts and graphs for query results
Query history: Track and reuse previous queries
Fine-tuned models: Domain-specific LLM fine-tuning
Web UI: Browser-based interface with real-time feedback
API server: REST/GraphQL API for integration
Streaming responses: Real-time query generation feedback
Multi-turn conversations: Context-aware follow-up queries
Natural language results: Convert SQL results back to natural language

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Spider Dataset: Yale University's semantic parsing and text-to-SQL benchmark
LangGraph: LangChain's graph-based agent orchestration framework
OpenRouter: LLM API aggregation service
LangChain: Building blocks for LLM applications

Contact

Toheed Asghar - Project Maintainer

GitHub: @toheedasghar
Email: toheedasghar1@gmail.com

Star History

If you find this project useful, please consider giving it a star! It helps others discover it.

Built with ❤️ using LangGraph and OpenRouter

Report Bug · Request Feature · Documentation

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nl2sql_agents-0.1.0.tar.gz (203.6 kB view details)

Uploaded Mar 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nl2sql_agents-0.1.0-py3-none-any.whl (39.0 kB view details)

Uploaded Mar 5, 2026 Python 3

File details

Details for the file nl2sql_agents-0.1.0.tar.gz.

File metadata

Download URL: nl2sql_agents-0.1.0.tar.gz
Upload date: Mar 5, 2026
Size: 203.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for nl2sql_agents-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d9998eae235b57259db9a8ec86fb95a2030757211a35ea0823e35b33cd9d8b1e`
MD5	`2f65bbb320800f7b2c6827db1b9fdf42`
BLAKE2b-256	`2cc67d742b39fb36c45edfb84add02c31ca391a86b1dab33c422250cc3b0882a`

See more details on using hashes here.

File details

Details for the file nl2sql_agents-0.1.0-py3-none-any.whl.

File metadata

Download URL: nl2sql_agents-0.1.0-py3-none-any.whl
Upload date: Mar 5, 2026
Size: 39.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for nl2sql_agents-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8304a1e4952bc5610c3cb30b083c37c507b85f66db763853ba0d9738d04e538c`
MD5	`9ef72a8173071bab67cebfb5fc8a5ed7`
BLAKE2b-256	`b44302393aa218c039b08f3e0210a20ef63d47697287cfe90485376858e46bba`

See more details on using hashes here.

nl2sql-agents 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

NL2SQL - Multi-Agent Natural Language to SQL System

Key Features

Architecture

Quick Start

Prerequisites

Installation

Usage Examples

Interactive Mode

Sample Questions (Concert Singer Database)

Configuration

LLM Provider

Database Configuration

Discovery Agent Tuning

Project Structure

How It Works

1. Table Discovery

2. Security Filtering

3. Query Generation

4. Multi-Stage Validation

5. Explainability

Database Setup

Spider Dataset

Custom Databases

Logging

Development

Running Tests

Code Style

Contributing

Development Setup

Roadmap

License

Acknowledgments

Contact

Star History

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes