Skip to main content

Enterprise-ready Natural Language to SQL converter with multi-provider support. Built for production scale (1000+ tables) with Clean Architecture.

Project description

nlp2sql logo

PyPI Downloads License: MIT Python 3.9+ Code style: black

nlp2sql

Enterprise-ready Natural Language to SQL converter with multi-provider support

Convert natural language queries to optimized SQL using multiple AI providers. Built with Clean Architecture principles for enterprise-scale applications handling 1000+ table databases.

Features

  • Multiple AI Providers: OpenAI, Anthropic Claude, Google Gemini - no vendor lock-in
  • Database Support: PostgreSQL, Amazon Redshift
  • Large Schema Handling: Vector embeddings and intelligent filtering for 1000+ tables
  • Smart Caching: Query and schema embedding caching for improved performance
  • Async Support: Full async/await support
  • Clean Architecture: Ports & Adapters pattern for maintainability

Documentation

Document Description
Architecture Component diagram and data flow
API Reference Python API and CLI command reference
Configuration Environment variables and schema filters
Enterprise Guide Large-scale deployment and migration
Redshift Support Amazon Redshift setup and examples
Contributing Contribution guidelines

Installation

# With UV (recommended)
uv add nlp2sql

# With pip
pip install nlp2sql

# With specific providers
pip install nlp2sql[anthropic,gemini]
pip install nlp2sql[all-providers]

# With embeddings
pip install nlp2sql[embeddings-local]   # Local embeddings (free)
pip install nlp2sql[embeddings-openai]  # OpenAI embeddings

Quick Start

1. Set an API Key

export OPENAI_API_KEY="your-openai-key"
# or ANTHROPIC_API_KEY, GOOGLE_API_KEY

2. Connect and Ask

import asyncio
import nlp2sql
from nlp2sql import ProviderConfig

async def main():
    nlp = await nlp2sql.connect(
        "postgresql://user:pass@localhost:5432/mydb",
        provider=ProviderConfig(provider="openai", api_key="sk-..."),
    )

    result = await nlp.ask("Show me all active users")
    print(result.sql)
    print(result.confidence)
    print(result.is_valid)

asyncio.run(main())

connect() auto-detects the database type from the URL, loads the schema, and builds the FAISS embedding index. Subsequent ask() calls reuse everything from disk cache.

3. Few-Shot Examples

Pass a list of dicts -- connect() handles embedding and indexing automatically:

nlp = await nlp2sql.connect(
    "redshift://user:pass@host:5439/db",
    provider=ProviderConfig(provider="openai", api_key="sk-..."),
    schema="dwh_data_share_llm",
    examples=[
        {
            "question": "Total revenue last month?",
            "sql": "SELECT SUM(revenue) FROM sales WHERE date >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')",
            "database_type": "redshift",
        },
    ],
)

result = await nlp.ask("Show me total sales this quarter")

4. Schema Filtering (Large Databases)

nlp = await nlp2sql.connect(
    "postgresql://localhost/enterprise",
    provider=ProviderConfig(provider="anthropic", api_key="sk-ant-..."),
    schema_filters={
        "include_schemas": ["sales", "finance"],
        "exclude_system_tables": True,
    },
)

5. Custom Model and Temperature

nlp = await nlp2sql.connect(
    "postgresql://localhost/mydb",
    provider=ProviderConfig(
        provider="openai",
        api_key="sk-...",
        model="gpt-4o",
        temperature=0.0,
        max_tokens=4000,
    ),
)

6. CLI

nlp2sql query \
  --database-url postgresql://user:pass@localhost:5432/mydb \
  --question "Show all active users" \
  --explain

nlp2sql inspect --database-url postgresql://localhost/mydb

Advanced: Direct Service Access

For full control over the lifecycle, the lower-level API is still available:

from nlp2sql import create_and_initialize_service, ProviderConfig, DatabaseType

service = await create_and_initialize_service(
    database_url="postgresql://localhost/mydb",
    provider_config=ProviderConfig(provider="openai", api_key="sk-..."),
    database_type=DatabaseType.POSTGRES,
)
result = await service.generate_sql("Count total users", database_type=DatabaseType.POSTGRES)

How It Works

Question ──► Cache check ──► Schema retrieval ──► Relevance filtering ──► Context building ──► AI generation ──► Validation
                                    │                     │                      │
                              SchemaRepository    FAISS + TF-IDF hybrid   Reuses precomputed
                              (+ disk cache)      + batch scoring          relevance scores
  1. Schema retrieval -- Fetches tables from database via SchemaRepository (with disk cache for Redshift)
  2. Relevance filtering -- FAISS dense search + TF-IDF sparse search (50/50 hybrid) finds candidate tables; batch scoring refines with precomputed embeddings
  3. Context building -- Builds optimized schema context within token limits, reusing scores from step 2 (zero additional embedding calls)
  4. SQL generation -- AI provider (OpenAI, Anthropic, or Gemini) generates SQL from question + schema context
  5. Validation -- SQL syntax and safety checks before returning results

See Architecture for the detailed flow with method references and design decisions.

Provider Comparison

Provider Default Model Context Size Best For
OpenAI gpt-4o-mini 128K Cost-effective, fast
Anthropic claude-sonnet-4-20250514 200K Large schemas
Google Gemini gemini-2.0-flash 1M High volume

All models are configurable via ProviderConfig(model="..."). See Configuration for details.

Architecture

Clean Architecture (Ports & Adapters) with three layers: core entities, port interfaces, and adapter implementations. The schema management layer uses FAISS + TF-IDF hybrid search for relevance filtering at scale.

nlp2sql/
├── client.py       # DSL: connect() + NLP2SQL class (recommended entry point)
├── core/           # Pure Python: entities, ProviderConfig, QueryResult, sql_safety, sql_keywords
├── ports/          # Interfaces: AIProviderPort, SchemaRepositoryPort, EmbeddingProviderPort,
│                   #   ExampleRepositoryPort, QuerySafetyPort, QueryValidatorPort, CachePort
├── adapters/       # Implementations: OpenAI, Anthropic, Gemini, PostgreSQL, Redshift,
│                   #   RegexQueryValidator
├── services/       # Orchestration: QueryGenerationService
├── schema/         # Schema management: SchemaManager, SchemaAnalyzer, SchemaEmbeddingManager,
│                   #   ExampleStore
├── config/         # Pydantic Settings (centralized defaults)
└── exceptions/     # Exception hierarchy (NLP2SQLException -> 8 subclasses)

See Architecture for the full component diagram, data flow, and design decisions.

Development

# Clone and install
git clone https://github.com/luiscarbonel1991/nlp2sql.git
cd nlp2sql
uv sync

# Start test databases
cd docker && docker-compose up -d

# Run tests
uv run pytest

# Code quality
uv run ruff format .
uv run ruff check .
uv run mypy src/

MCP Server

nlp2sql includes a Model Context Protocol server for AI assistant integration.

{
  "mcpServers": {
    "nlp2sql": {
      "command": "python",
      "args": ["/path/to/nlp2sql/mcp_server/server.py"],
      "env": {
        "OPENAI_API_KEY": "${OPENAI_API_KEY}",
        "NLP2SQL_DEFAULT_DB_URL": "postgresql://user:pass@localhost:5432/mydb"
      }
    }
  }
}

Tools: ask_database, explore_schema, run_sql, list_databases, explain_sql

See mcp_server/README.md for complete setup.

Contributing

We welcome contributions. See CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE.

Author

Luis Carbonel - @luiscarbonel1991

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlp2sql-0.2.0rc12.tar.gz (577.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nlp2sql-0.2.0rc12-py3-none-any.whl (97.7 kB view details)

Uploaded Python 3

File details

Details for the file nlp2sql-0.2.0rc12.tar.gz.

File metadata

  • Download URL: nlp2sql-0.2.0rc12.tar.gz
  • Upload date:
  • Size: 577.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nlp2sql-0.2.0rc12.tar.gz
Algorithm Hash digest
SHA256 53c386dc3e0e3bdf3f9d7f46a0e83355b63cf3e38a53366f7d19e298aaa153d0
MD5 7fd5b4840ba330f39c3309eef5643ac8
BLAKE2b-256 fd908009dc154fcae26d34ad87dfbcb86aa410daac5631f73c734220f04c4ed0

See more details on using hashes here.

Provenance

The following attestation bundles were made for nlp2sql-0.2.0rc12.tar.gz:

Publisher: publish-pypi.yml on luiscarbonel1991/nlp2sql

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file nlp2sql-0.2.0rc12-py3-none-any.whl.

File metadata

  • Download URL: nlp2sql-0.2.0rc12-py3-none-any.whl
  • Upload date:
  • Size: 97.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nlp2sql-0.2.0rc12-py3-none-any.whl
Algorithm Hash digest
SHA256 adea7c58eecad1557605f59171cb751706b666211e01c63daa39c5bba3234fa0
MD5 84ef019e3c4e9924c255635d327f7720
BLAKE2b-256 2451aad21e95f59555f3e95ba4cdb759d0e96c594b9e8aa7f5f5e1e47f691b8a

See more details on using hashes here.

Provenance

The following attestation bundles were made for nlp2sql-0.2.0rc12-py3-none-any.whl:

Publisher: publish-pypi.yml on luiscarbonel1991/nlp2sql

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page