Skip to main content

Enterprise-ready Natural Language to SQL converter with multi-provider support. Built for production scale (1000+ tables) with Clean Architecture.

Project description

nlp2sql logo

PyPI Downloads License: MIT Python 3.9+ Code style: black

nlp2sql

DSL-first natural language to SQL for PostgreSQL and Redshift

nlp2sql turns a natural language question into SQL through a reusable Python DSL:

  • await connect(...)
  • await nlp.ask(...)
  • optional few-shot examples
  • optional semantic context
  • optional validation and repair

The library is designed for both simple schemas and large warehouses, but all public examples in this repository use the local e-commerce domain shipped with the project itself.

Features

  • DSL-first API: connect() returns an NLP2SQL client with ask(), validate(), explain(), and suggest()
  • Business-aware generation: optional SemanticContext adds canonical tables, metrics, dimensions, rules, and mappings
  • Execution modes: generate only, generate plus validate, and generate plus validate plus repair
  • Few-shot examples: pass example lists directly or use an example repository implementation
  • Large schema support: FAISS plus TF-IDF hybrid retrieval, schema filters, and disk-backed caches
  • Multiple providers: OpenAI, Anthropic, and Gemini
  • Database support: PostgreSQL and Amazon Redshift
  • Async by default: built for services, APIs, notebooks, and workers

Documentation

Document Description
Architecture Runtime flow, services, ports, and diagrams
API Reference Python API, CLI, hooks, and metadata reference
Configuration Environment variables, examples, semantic context, cache behavior
Enterprise Guide Governed usage, scale, and deployment patterns
Redshift Support Redshift-specific notes using public examples
Examples Safe public examples based on the local e-commerce domain
Contributing Contribution guidelines

Installation

# With UV (recommended)
uv add nlp2sql

# With pip
pip install nlp2sql

# With specific providers
pip install nlp2sql[anthropic,gemini]
pip install nlp2sql[all-providers]

# With embeddings
pip install nlp2sql[embeddings-local]
pip install nlp2sql[embeddings-openai]

Quick Start

1. Set a Provider Key

export OPENAI_API_KEY="your-openai-key"
# or ANTHROPIC_API_KEY / GOOGLE_API_KEY

2. Use the DSL

import asyncio

import nlp2sql
from nlp2sql import ProviderConfig


async def main():
    nlp = await nlp2sql.connect(
        "postgresql://testuser:testpass@localhost:5432/testdb",
        provider=ProviderConfig(provider="openai", api_key="sk-..."),
    )

    result = await nlp.ask("Show active users by region")
    print(result.sql)
    print(result.confidence)
    print(result.metadata["sql_intent_plan"])


asyncio.run(main())

connect() loads the schema, initializes retrieval indexes, and returns a reusable NLP2SQL client. ask() returns a typed QueryResult.

3. Add Few-Shot Examples

Pass examples directly to connect(). The library handles indexing for you.

nlp = await nlp2sql.connect(
    "postgresql://testuser:testpass@localhost:5432/testdb",
    provider=ProviderConfig(provider="openai", api_key="sk-..."),
    examples=[
        {
            "question": "Show revenue by source category for the flagship store",
            "sql": (
                "SELECT d.metric_date, mc.source_category, SUM(d.revenue) AS revenue "
                "FROM daily_channel_metrics d "
                "JOIN stores s ON d.store_id = s.id "
                "JOIN marketing_channels mc ON d.channel_id = mc.id "
                "WHERE s.code = 'na_flagship' "
                "GROUP BY d.metric_date, mc.source_category"
            ),
            "database_type": "postgres",
        }
    ],
)

4. Add In-Memory Semantic Context

Use semantic context when the same question could map to multiple plausible tables or dimensions.

from nlp2sql import (
    DimensionDefinition,
    DomainRule,
    MetricDefinition,
    SemanticContext,
    SemanticEntityMapping,
)

semantic_context = SemanticContext(
    domain="ecommerce_channel_performance",
    canonical_tables=["daily_channel_metrics"],
    required_filters=["s.code = 'na_flagship'", "s.region = 'North America'"],
    entity_mappings=[
        SemanticEntityMapping(
            source_term="North America flagship store",
            target="store_scope",
            resolved_value="na_flagship / North America",
            filter_expression="s.code = 'na_flagship' AND s.region = 'North America'",
        )
    ],
    metric_definitions=[
        MetricDefinition(name="revenue", description="Revenue by day and source category."),
        MetricDefinition(name="orders_count", description="Orders by day and source category."),
    ],
    dimension_definitions=[
        DimensionDefinition(name="metric_date", description="Daily grain."),
        DimensionDefinition(name="source_category", description="Channel grouping."),
    ],
    rules=[
        DomainRule(
            name="preserve_source_breakdown",
            description="Keep source_category when the question asks for a source breakdown.",
            required_dimensions=["source_category"],
            preferred_tables=["daily_channel_metrics"],
        )
    ],
)

result = await nlp.ask(
    "Show daily revenue and order count by source category for the North America flagship store",
    semantic_context=semantic_context,
)

5. Validate and Repair

ask() supports execution-aware modes directly.

result = await nlp.ask(
    "Show revenue by source category for the flagship store",
    validate=True,
    repair=True,
)
  • generate_only: generate SQL only
  • generate_and_validate: execute readonly validation when execution is wired
  • generate_validate_repair: retry on semantic or execution failures when repair hooks are available

6. CLI Parity

The CLI exposes the same concepts:

nlp2sql query \
  --database-url postgresql://testuser:testpass@localhost:5432/testdb \
  --question "Show daily revenue by source category for the North America flagship store" \
  --examples-file examples.json \
  --semantic-context-file semantic-context.json \
  --validate \
  --repair \
  --show-semantic-context \
  --show-sql-intent-plan \
  --show-selected-examples

How It Works

flowchart TD
    userCode[UserCodeOrCLI] --> dsl[connectAndAskDSL]
    dsl --> analysis[QueryAnalysisService]
    analysis --> semantic[SemanticResolutionService]
    semantic --> retrieval[SchemaRetrievalAndExampleSelection]
    retrieval --> intent[SqlIntentPlanningService]
    intent --> prompt[PromptAssemblyAndAdapters]
    prompt --> llm[LLMGeneration]
    llm --> semval[SemanticValidation]
    semval --> exec[OptionalExecutionAndRepair]
    exec --> result[QueryResultMetadata]

At runtime the library:

  1. analyzes the question
  2. optionally resolves and merges semantic context
  3. retrieves relevant schema and examples
  4. builds a structured SQL intent plan
  5. assembles the prompt
  6. generates SQL
  7. optionally validates, executes, and repairs
  8. returns a QueryResult with debug metadata

See Architecture for the full breakdown.

Public Example Domain

This repository ships a local e-commerce integration domain used in tests and docs. It includes:

  • stores
  • marketing_channels
  • users
  • products
  • orders
  • order_items
  • daily_channel_metrics

The public examples intentionally stay inside that domain to avoid leaking any private warehouse schema.

To start it locally:

cd docker
docker compose up -d postgres

The default URL is:

postgresql://testuser:testpass@localhost:5432/testdb

Provider Comparison

Provider Default Model Context Size Best For
OpenAI gpt-4o-mini 128K Fast general purpose usage
Anthropic claude-sonnet-4-20250514 200K Larger schemas and long prompts
Gemini gemini-2.0-flash 1M High-volume and very large contexts

All models are configurable through ProviderConfig.

Lower-Level API

connect() is the recommended path. Lower-level entry points still exist for advanced wiring:

from nlp2sql import DatabaseType, ProviderConfig, create_and_initialize_service

service = await create_and_initialize_service(
    database_url="postgresql://testuser:testpass@localhost:5432/testdb",
    provider_config=ProviderConfig(provider="openai", api_key="sk-..."),
    database_type=DatabaseType.POSTGRES,
)

result = await service.generate_sql(
    "Count active users by region",
    database_type=DatabaseType.POSTGRES,
)
print(result["sql"])

Development

git clone https://github.com/luiscarbonel1991/nlp2sql.git
cd nlp2sql
uv sync

# Start the local public e-commerce database
cd docker && docker compose up -d postgres

# Integration tests without llm
cd ..
uv run pytest -m "integration and not llm"

# Optional llm integration tests
uv run pytest -m "integration and llm"

MCP Server

nlp2sql includes a Model Context Protocol server for assistant integration.

{
  "mcpServers": {
    "nlp2sql": {
      "command": "python",
      "args": ["/path/to/nlp2sql/mcp_server/server.py"],
      "env": {
        "OPENAI_API_KEY": "${OPENAI_API_KEY}",
        "NLP2SQL_DEFAULT_DB_URL": "postgresql://testuser:testpass@localhost:5432/testdb"
      }
    }
  }
}

See mcp_server/README.md for details.

Contributing

See CONTRIBUTING.md.

License

MIT License. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlp2sql-0.2.0rc13.tar.gz (605.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nlp2sql-0.2.0rc13-py3-none-any.whl (128.2 kB view details)

Uploaded Python 3

File details

Details for the file nlp2sql-0.2.0rc13.tar.gz.

File metadata

  • Download URL: nlp2sql-0.2.0rc13.tar.gz
  • Upload date:
  • Size: 605.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nlp2sql-0.2.0rc13.tar.gz
Algorithm Hash digest
SHA256 ee8db97a79fdbcd7a8b11c3e702434b85bbdf5be5736b3e84e55ee944fbd7c67
MD5 85dccc2ec2b74bde1607cd535604334a
BLAKE2b-256 ae61ceefa12084be7911f61ce93190e6aabeb72f62124babf6c222e514556056

See more details on using hashes here.

Provenance

The following attestation bundles were made for nlp2sql-0.2.0rc13.tar.gz:

Publisher: publish-pypi.yml on luiscarbonel1991/nlp2sql

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file nlp2sql-0.2.0rc13-py3-none-any.whl.

File metadata

  • Download URL: nlp2sql-0.2.0rc13-py3-none-any.whl
  • Upload date:
  • Size: 128.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nlp2sql-0.2.0rc13-py3-none-any.whl
Algorithm Hash digest
SHA256 8c7d515b479b6d54f9d29c1da7880a003544a3a997698d8bb9d3532e670af1fe
MD5 a750300d75fc637ef91eeebd8dc49ebf
BLAKE2b-256 c1b4de9b376156888b63c42c110869a16b42a9d4808590cc386765b0d9aea42e

See more details on using hashes here.

Provenance

The following attestation bundles were made for nlp2sql-0.2.0rc13-py3-none-any.whl:

Publisher: publish-pypi.yml on luiscarbonel1991/nlp2sql

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page