Enterprise-ready Natural Language to SQL converter with multi-provider support. Built for production scale (1000+ tables) with Clean Architecture.
Project description
nlp2sql
Enterprise-ready Natural Language to SQL converter with multi-provider support
Convert natural language queries to optimized SQL using multiple AI providers. Built with Clean Architecture principles for enterprise-scale applications handling 1000+ table databases.
Features
- Multiple AI Providers: OpenAI, Anthropic Claude, Google Gemini - no vendor lock-in
- Database Support: PostgreSQL, Amazon Redshift
- Large Schema Handling: Vector embeddings and intelligent filtering for 1000+ tables
- Smart Caching: Query and schema embedding caching for improved performance
- Async Support: Full async/await support
- Clean Architecture: Ports & Adapters pattern for maintainability
Documentation
| Document | Description |
|---|---|
| Architecture | Component diagram and data flow |
| API Reference | Python API and CLI command reference |
| Configuration | Environment variables and schema filters |
| Enterprise Guide | Large-scale deployment and migration |
| Redshift Support | Amazon Redshift setup and examples |
| Contributing | Contribution guidelines |
Installation
# With UV (recommended)
uv add nlp2sql
# With pip
pip install nlp2sql
# With specific providers
pip install nlp2sql[anthropic,gemini]
pip install nlp2sql[all-providers]
# With embeddings
pip install nlp2sql[embeddings-local] # Local embeddings (free)
pip install nlp2sql[embeddings-openai] # OpenAI embeddings
Quick Start
1. Set Environment Variables
# At least one AI provider key required
export OPENAI_API_KEY="your-openai-key"
# export ANTHROPIC_API_KEY="your-anthropic-key"
# export GOOGLE_API_KEY="your-google-key"
2. One-Line Usage
import asyncio
import os
from nlp2sql import generate_sql_from_db
async def main():
result = await generate_sql_from_db(
database_url="postgresql://user:pass@localhost:5432/mydb",
question="Show me all active users",
ai_provider="openai",
api_key=os.getenv("OPENAI_API_KEY")
)
print(result['sql'])
print(f"Confidence: {result['confidence']}")
asyncio.run(main())
3. Pre-Initialized Service (Better Performance)
from nlp2sql import create_and_initialize_service
async def main():
# Initialize once
service = await create_and_initialize_service(
database_url="postgresql://user:pass@localhost:5432/mydb",
ai_provider="openai",
api_key=os.getenv("OPENAI_API_KEY")
)
# Use multiple times
result1 = await service.generate_sql("Count total users")
result2 = await service.generate_sql("Show recent orders")
4. Large Database with Schema Filtering
from nlp2sql import create_and_initialize_service
service = await create_and_initialize_service(
database_url="postgresql://localhost/enterprise",
ai_provider="anthropic", # Best for large schemas (200K context)
api_key=os.getenv("ANTHROPIC_API_KEY"),
schema_filters={
"include_schemas": ["sales", "finance"],
"exclude_system_tables": True
}
)
result = await service.generate_sql("Show revenue by month")
5. CLI Usage
# Generate SQL
nlp2sql query \
--database-url postgresql://user:pass@localhost:5432/mydb \
--question "Show all active users" \
--explain
# Inspect schema
nlp2sql inspect --database-url postgresql://localhost/mydb
# Benchmark providers
nlp2sql benchmark --database-url postgresql://localhost/mydb
How It Works
Question ──► Cache check ──► Schema retrieval ──► Relevance filtering ──► Context building ──► AI generation ──► Validation
│ │ │
SchemaRepository FAISS + TF-IDF hybrid Reuses precomputed
(+ disk cache) + batch scoring relevance scores
- Schema retrieval -- Fetches tables from database via
SchemaRepository(with disk cache for Redshift) - Relevance filtering -- FAISS dense search + TF-IDF sparse search (50/50 hybrid) finds candidate tables; batch scoring refines with precomputed embeddings
- Context building -- Builds optimized schema context within token limits, reusing scores from step 2 (zero additional embedding calls)
- SQL generation -- AI provider (OpenAI, Anthropic, or Gemini) generates SQL from question + schema context
- Validation -- SQL syntax and safety checks before returning results
See Architecture for the detailed flow with method references and design decisions.
Provider Comparison
| Provider | Context Size | Best For |
|---|---|---|
| OpenAI GPT-4 | 128K | Complex reasoning |
| Anthropic Claude | 200K | Large schemas |
| Google Gemini | 1M | High volume, cost efficiency |
See Configuration for detailed provider setup.
Architecture
Clean Architecture (Ports & Adapters) with three layers: core entities, port interfaces, and adapter implementations. The schema management layer uses FAISS + TF-IDF hybrid search for relevance filtering at scale.
nlp2sql/
├── core/ # Business entities (pure Python, no dependencies)
├── ports/ # Interfaces (AIProviderPort, SchemaRepositoryPort, EmbeddingProviderPort)
├── adapters/ # Implementations (OpenAI, Anthropic, Gemini, PostgreSQL, Redshift)
├── services/ # Orchestration (QueryGenerationService)
├── schema/ # Schema management (SchemaManager, SchemaAnalyzer, SchemaEmbeddingManager)
├── config/ # Pydantic Settings configuration
└── exceptions/ # Custom exception hierarchy
See Architecture for the full component diagram, data flow, and design decisions.
Development
# Clone and install
git clone https://github.com/luiscarbonel1991/nlp2sql.git
cd nlp2sql
uv sync
# Start test databases
cd docker && docker-compose up -d
# Run tests
uv run pytest
# Code quality
uv run ruff format .
uv run ruff check .
uv run mypy src/
MCP Server
nlp2sql includes a Model Context Protocol server for AI assistant integration.
{
"mcpServers": {
"nlp2sql": {
"command": "python",
"args": ["/path/to/nlp2sql/mcp_server/server.py"],
"env": {
"OPENAI_API_KEY": "${OPENAI_API_KEY}",
"NLP2SQL_DEFAULT_DB_URL": "postgresql://user:pass@localhost:5432/mydb"
}
}
}
}
Tools: ask_database, explore_schema, run_sql, list_databases, explain_sql
See mcp_server/README.md for complete setup.
Contributing
We welcome contributions. See CONTRIBUTING.md for guidelines.
License
MIT License - see LICENSE.
Author
Luis Carbonel - @luiscarbonel1991
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nlp2sql-0.2.0rc6.tar.gz.
File metadata
- Download URL: nlp2sql-0.2.0rc6.tar.gz
- Upload date:
- Size: 561.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
97b95a884a9a30a3a88e5208d399429a8fa47ae5c432f167acf6f2a0ada2504c
|
|
| MD5 |
def1a4aff0b27f4b07f45b7ae0c44e6c
|
|
| BLAKE2b-256 |
6f2777452e4d8173decf68a2c20b69cff0d9664fbfa5d050d736f43bac458392
|
Provenance
The following attestation bundles were made for nlp2sql-0.2.0rc6.tar.gz:
Publisher:
publish-pypi.yml on luiscarbonel1991/nlp2sql
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nlp2sql-0.2.0rc6.tar.gz -
Subject digest:
97b95a884a9a30a3a88e5208d399429a8fa47ae5c432f167acf6f2a0ada2504c - Sigstore transparency entry: 1097182808
- Sigstore integration time:
-
Permalink:
luiscarbonel1991/nlp2sql@7be24f29ae9a6f372b967d2cd0639ba112dd3438 -
Branch / Tag:
refs/tags/v0.2.0rc6 - Owner: https://github.com/luiscarbonel1991
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@7be24f29ae9a6f372b967d2cd0639ba112dd3438 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file nlp2sql-0.2.0rc6-py3-none-any.whl.
File metadata
- Download URL: nlp2sql-0.2.0rc6-py3-none-any.whl
- Upload date:
- Size: 82.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
771cd26a932aaf505bf9fd25b3fb600a099bf2f16d2d84d70cc2849d89550e9c
|
|
| MD5 |
18def4372e7ce86ceb6e08a820dff73d
|
|
| BLAKE2b-256 |
d34cf1994def679048d4ea9ac851614d4b928f002dbd09515ed1b741d54227aa
|
Provenance
The following attestation bundles were made for nlp2sql-0.2.0rc6-py3-none-any.whl:
Publisher:
publish-pypi.yml on luiscarbonel1991/nlp2sql
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nlp2sql-0.2.0rc6-py3-none-any.whl -
Subject digest:
771cd26a932aaf505bf9fd25b3fb600a099bf2f16d2d84d70cc2849d89550e9c - Sigstore transparency entry: 1097182839
- Sigstore integration time:
-
Permalink:
luiscarbonel1991/nlp2sql@7be24f29ae9a6f372b967d2cd0639ba112dd3438 -
Branch / Tag:
refs/tags/v0.2.0rc6 - Owner: https://github.com/luiscarbonel1991
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@7be24f29ae9a6f372b967d2cd0639ba112dd3438 -
Trigger Event:
workflow_dispatch
-
Statement type: