A multi-agent orchestration system that converts natural language queries into safe, optimized SQL using LangGraph and OpenRouter LLMs.
Project description
NL2SQL - Multi-Agent Natural Language to SQL System
A sophisticated multi-agent orchestration system that converts natural language queries into safe, optimized SQL using LangGraph and OpenRouter LLMs. Built with security, performance, and explainability at its core.
Key Features
- Multi-Agent Architecture: 8 specialized agents working in harmony (Discovery, Security, Schema Formatting, Query Generation, Validation, Explanation)
- Security-First: Built-in SQL injection prevention, dangerous operation filtering, and security scoring
- Smart Table Discovery: Multi-signal ranking (keyword matching, semantic search, foreign key graph analysis)
- Performance Validation: Query optimization checks and performance score reporting
- Explainability: Human-readable explanations for every generated query
- Async Architecture: Non-blocking operations throughout the pipeline
- Schema Caching: Intelligent caching to reduce database introspection overhead
- Comprehensive Logging: Detailed file-based logging for debugging and monitoring
Architecture
Quick Start
Prerequisites
- Python 3.13 or higher
- OpenRouter API key (or compatible LLM provider)
- SQLite databases (Spider dataset recommended)
Installation
-
Clone the repository
git clone https://github.com/ToheedAsghar/NL2SQL.git cd NL2SQL
-
Create virtual environment
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Download Spider dataset
The Spider dataset (877MB) is not included in the repository. Download it from the official source:
# Visit https://yale-lily.github.io/spider # Download and extract to ./spider/ directory # Verify structure ls spider/database/ # Should show: academic, concert_singer, car_1, etc.
-
Configure environment
Create a
.envfile in the project root:# LLM Provider Configuration OPENAI_API_KEY=sk-or-v1-your-openrouter-api-key OPENAI_BASE_URL=https://openrouter.ai/api/v1 OPENAI_MODEL=openai/gpt-4o-mini OPENAI_EMBEDDING_MODEL=openai/text-embedding-3-small # Database Configuration DB_TYPE=sqlite DB_PATH=./spider/database/concert_singer/concert_singer.sqlite # Optional: Adjust discovery agent weights # KEYWORD_WEIGHT=0.35 # SEMANTIC_WEIGHT=0.45 # FK_GRAPH_WEIGHT=0.20
-
Run the application
python main.py
Usage Examples
Interactive Mode
python main.py
Enter your question (or 'quit'/'exit'/'q'):
> Show me all singers from France
══════════════════════════════════════════════════════════════
SQL QUERY
══════════════════════════════════════════════════════════════
SELECT s.*
FROM concert_singer.singer AS s
WHERE s.Country = 'France'
══════════════════════════════════════════════════════════════
EXPLANATION
══════════════════════════════════════════════════════════════
This query retrieves all singers from France by:
1. Selecting all columns from the singer table
2. Filtering records where Country equals 'France'
3. Using table alias 's' for readability
Safety Score: 3.5/4.0
✓ Security: Safe (no SQL injection risks)
✓ Syntax: Valid SQL
✓ Logic: Correct table and column usage
⚠ Performance: Consider adding index on Country column for large datasets
══════════════════════════════════════════════════════════════
Sample Questions (Concert Singer Database)
- Show me all singers from France
- List all concerts held in 2024
- What are the top 5 stadiums by capacity?
- How many concerts did each singer perform?
- Which stadium hosted the most concerts?
- Find singers who have never performed in a concert
- Show me concerts with ticket prices above $100
Configuration
LLM Provider
The system uses OpenRouter by default, but you can configure any OpenAI-compatible API:
OPENAI_API_KEY=your-api-key
OPENAI_BASE_URL=https://api.openai.com/v1 # or your provider
OPENAI_MODEL=gpt-4-turbo
OPENAI_EMBEDDING_MODEL=text-embedding-3-small
Database Configuration
Switch between different databases by updating DB_PATH:
# Concert Singer (music industry)
DB_PATH=./spider/database/concert_singer/concert_singer.sqlite
# Academic (papers and conferences)
DB_PATH=./spider/database/academic/academic.sqlite
# Car Sales
DB_PATH=./spider/database/car_1/car_1.sqlite
Discovery Agent Tuning
Adjust table discovery signal weights in .env:
KEYWORD_WEIGHT=0.35 # Exact keyword matching
SEMANTIC_WEIGHT=0.45 # Semantic similarity
FK_GRAPH_WEIGHT=0.20 # Foreign key relationships
Project Structure
NL2SQL/
├── nl2sql_agents/ # Installable Python package
│ ├── __init__.py # Package version & metadata
│ ├── cli.py # Rich CLI entry point
│ ├── py.typed # PEP 561 typed marker
│ ├── agents/ # Agent modules
│ │ ├── __init__.py
│ │ ├── base_agent.py # Base agent with LLM calling
│ │ ├── query_generator.py # SQL generation agent
│ │ ├── schema_formatter.py # Schema formatting agent
│ │ ├── discovery/ # Table discovery agents
│ │ │ ├── __init__.py
│ │ │ ├── discovery_agent.py # Multi-signal orchestrator
│ │ │ ├── keyword_agent.py # Keyword matching
│ │ │ ├── semantic_agent.py # Embedding-based search
│ │ │ └── fk_graph_agent.py # Foreign key analysis
│ │ ├── validator/ # Validation agents
│ │ │ ├── __init__.py
│ │ │ ├── validator_agent.py # Orchestrator
│ │ │ ├── syntax_validator.py # SQL syntax check
│ │ │ ├── logic_validator.py # Logical correctness
│ │ │ ├── security_validator.py # SQL injection check
│ │ │ └── performance_validator.py # Performance analysis
│ │ └── explainer/ # Explanation agents
│ │ ├── __init__.py
│ │ ├── explainer_agent.py # Orchestrator
│ │ ├── explanation_agent.py # Query explanation
│ │ ├── safety_report_agent.py # Safety scoring
│ │ └── optimization_agent.py # Optimization tips
│ ├── orchestrator/ # LangGraph pipeline
│ │ ├── __init__.py
│ │ ├── pipeline.py # Main graph definition
│ │ └── nodes.py # Node implementations
│ ├── filters/ # Pre-processing filters
│ │ ├── __init__.py
│ │ ├── gate.py # Table gating
│ │ └── security_filter.py # Query security filter
│ ├── db/ # Database layer
│ │ ├── __init__.py
│ │ └── connector.py # Async SQLite connector
│ ├── cache/ # Caching layer
│ │ ├── __init__.py
│ │ └── schema_cache.py # Schema caching
│ ├── config/ # Configuration
│ │ ├── __init__.py
│ │ └── settings.py # LLM provider settings
│ └── models/ # Data models
│ ├── __init__.py
│ └── schemas.py # GraphState and schemas
├── data/ # Local data assets
│ └── spider/
│ └── database/ # Spider dataset databases
├── Assets/ # Static assets
│ ├── architecture.svg
│ ├── nl2sql.gif # Demo GIF
│ └── nl2sql-architecture.svg # Architecture diagram
├── logs/ # Application logs (auto-created)
│ └── *.log # Timestamped log files
├── spider/ # Spider benchmark files
│ ├── dev.json
│ ├── dev_gold.sql
│ ├── tables.json
│ ├── train_gold.sql
│ ├── train_others.json
│ ├── train_spider.json
│ ├── README.txt
│ └── database/ # 200+ domain databases
├── main.py # Legacy CLI entry point
├── pyproject.toml # Package metadata & build config
├── requirements.txt # Python dependencies
├── FAME_PLAN.md # FAME methodology plan
├── MEMORY_IMPLEMENTATION.md # Memory system design doc
├── .env # Environment configuration
├── .env.example # Example env template
├── .gitignore # Git ignore rules
├── LICENSE # MIT License
└── README.md # This file
How It Works
1. Table Discovery
The Discovery Agent uses three signals to rank tables:
- Keyword Matching (35%): Direct token overlap with query
- Semantic Search (45%): Embedding similarity using LLM
- Foreign Key Graph (20%): Relationship centrality analysis
Tables are ranked by weighted score, and top-K are passed forward.
2. Security Filtering
Before processing, queries are checked for:
- DROP, DELETE, TRUNCATE operations
- System table access attempts
- SQL injection patterns
- Dangerous function calls
Dangerous queries are blocked immediately.
3. Query Generation
The generator uses:
- One-shot retry: If first generation fails validation, retry with error context
- Schema-aware prompting: Full context of relevant tables, columns, types
- Example-based learning: Few-shot examples in system prompt
4. Multi-Stage Validation
Four validators run in parallel:
- Syntax: SQL parsing and syntax validation
- Logic: Table/column existence, join correctness
- Security: SQL injection and dangerous operation check
- Performance: Query complexity, missing indexes, optimization tips
5. Explainability
The Explainer generates:
- Step-by-step query breakdown
- Safety score (0-4 scale across 4 dimensions)
- Optimization recommendations
- Human-readable summary
Database Setup
Spider Dataset
The Spider dataset contains 200+ databases across diverse domains:
- Academic papers and conferences
- Music (concert_singer)
- Sports (baseball, football, soccer)
- Business (car sales, employee management, HR)
- E-commerce, healthcare, real estate, and more
Download: Visit https://yale-lily.github.io/spider
- Download the database ZIP file
- Extract to
./spider/directory in project root - Verify:
ls spider/database/should show ~200 subdirectories - Update
.envwith desired database path
Custom Databases
To use your own SQLite database:
- Place your
.sqliteor.dbfile anywhere accessible - Update
.env:DB_PATH=/path/to/your/database.sqlite
- Run the application
The system will automatically:
- Introspect schema (tables, columns, types, foreign keys)
- Cache schema for performance
- Build embedding index for semantic search
Logging
Logs are automatically written to logs/ directory with timestamps:
logs/
├── nl2sql_20260301_143022.log
├── nl2sql_20260301_151134.log
└── nl2sql_20260301_163045.log
Log format includes:
- Timestamp
- Logger name (agent identifier)
- Log level
- Message (including token usage, scores, timings)
View logs:
tail -f logs/nl2sql_*.log
Development
Running Tests
# Unit tests
pytest tests/
# Integration tests
pytest tests/integration/
# Coverage report
pytest --cov=agents --cov=orchestrator --cov-report=html
Code Style
This project follows PEP 8 with Black formatting:
# Format code
black .
# Lint
ruff check .
# Type checking
mypy agents/ orchestrator/
Contributing
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests for new functionality
- Ensure all tests pass (
pytest) - Format code (
black .) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Development Setup
# Install dev dependencies
pip install -r requirements-dev.txt
# Install pre-commit hooks
pre-commit install
Roadmap
- Multi-database support: PostgreSQL, MySQL, SQL Server
- Query execution: Safely execute generated queries and return results
- Result visualization: Charts and graphs for query results
- Query history: Track and reuse previous queries
- Fine-tuned models: Domain-specific LLM fine-tuning
- Web UI: Browser-based interface with real-time feedback
- API server: REST/GraphQL API for integration
- Streaming responses: Real-time query generation feedback
- Multi-turn conversations: Context-aware follow-up queries
- Natural language results: Convert SQL results back to natural language
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Spider Dataset: Yale University's semantic parsing and text-to-SQL benchmark
- LangGraph: LangChain's graph-based agent orchestration framework
- OpenRouter: LLM API aggregation service
- LangChain: Building blocks for LLM applications
Contact
Toheed Asghar - Project Maintainer
- GitHub: @toheedasghar
- Email: toheedasghar1@gmail.com
Star History
If you find this project useful, please consider giving it a star! It helps others discover it.
Built with ❤️ using LangGraph and OpenRouter
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nl2sql_agents-0.1.0.tar.gz.
File metadata
- Download URL: nl2sql_agents-0.1.0.tar.gz
- Upload date:
- Size: 203.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d9998eae235b57259db9a8ec86fb95a2030757211a35ea0823e35b33cd9d8b1e
|
|
| MD5 |
2f65bbb320800f7b2c6827db1b9fdf42
|
|
| BLAKE2b-256 |
2cc67d742b39fb36c45edfb84add02c31ca391a86b1dab33c422250cc3b0882a
|
File details
Details for the file nl2sql_agents-0.1.0-py3-none-any.whl.
File metadata
- Download URL: nl2sql_agents-0.1.0-py3-none-any.whl
- Upload date:
- Size: 39.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8304a1e4952bc5610c3cb30b083c37c507b85f66db763853ba0d9738d04e538c
|
|
| MD5 |
9ef72a8173071bab67cebfb5fc8a5ed7
|
|
| BLAKE2b-256 |
b44302393aa218c039b08f3e0210a20ef63d47697287cfe90485376858e46bba
|