Skip to main content

Generate SQL from natural language using DB schema + vector retrieval

Project description

NaturalSQL

Documentación en español: README_SPANISH.md

NaturalSQL Logo

PyPI version Python versions Downloads License

Lightweight library to convert your SQL database schema into a vector database, so LLMs can use real table/column context to generate more accurate SQL from natural language.

Problem

I wanted to interact with SQL databases through an LLM without pulling in large frameworks with many unrelated features.

Solution

NaturalSQL extracts your database schema, vectorizes it using a configurable backend (chroma or sqlite) and configurable embedding provider (local or gemini), then performs semantic retrieval to return relevant schema context for your LLM.

Installation

# Base installation
pip install naturalsql

# Chroma + local embeddings
pip install "naturalsql[chroma-local]"

# SQLite + local embeddings
pip install "naturalsql[sqlite-local]"

# SQLite + Gemini embeddings
pip install "naturalsql[sqlite-gemini]"

# Chroma + Gemini embeddings
pip install "naturalsql[chroma-gemini]"

# PostgreSQL driver support
pip install naturalsql[postgresql]

# MySQL driver support
pip install naturalsql[mysql]

# SQL Server driver support
pip install naturalsql[sqlserver]

# All DB drivers
pip install naturalsql[all-db]

SQLite is included in the Python standard library. For Gemini, the current SDK is google-genai (import path: google.genai). Use environment variables for API keys (e.g. GEMINI_API_KEY), never hardcode secrets.

Supported SQL engines

  • PostgreSQL
  • MySQL
  • SQL Server
  • SQLite

Quick start

from naturalsql import NaturalSQL

# 1) Create an instance
nsql = NaturalSQL(
    db_url="postgresql://user:password@localhost:5432/mydb",
    db_type="postgresql",
)

# 2) Build vector DB from schema
result = nsql.build_vector_db()
print(f"Indexed tables: {result['indexed_documents']}")
print(f"From cache: {result['from_cache']}")

# 3) Retrieve relevant tables for a question
tables = nsql.search("Show me sales from last month")

# 4) Use returned tables as LLM context
for table in tables:
    print(table)

Automatic cache

The first search() call loads the embedding model (~2-5s). Subsequent calls reuse the in-memory instance and typically run in ~10-15ms.

Similarly, build_vector_db() reuses existing vector storage when available (forced_reset=False) to avoid unnecessary reindexing.

E2E examples

from naturalsql import NaturalSQL

# A) Chroma + local (practical tuning for Chroma 1.5.x)
nsql = NaturalSQL(
    db_url="postgresql://user:pass@localhost:5432/mydb",
    db_type="postgresql",
    vector_backend="chroma",
    embedding_provider="local",
    vector_distance_threshold=1.6,  # If search() returns [], try 1.4-1.6
)

nsql.build_vector_db(storage_path="./metadata_vdb", forced_reset=False)
tables = nsql.search("sales for the last month", limit=3)
print(tables)
import os
from naturalsql import NaturalSQL

# B) SQLite + Gemini
nsql = NaturalSQL(
    db_url="sqlite:///./app.db",
    db_type="sqlite",
    vector_backend="sqlite",
    embedding_provider="gemini",
    gemini_api_key=os.environ["GEMINI_API_KEY"],
    gemini_embedding_model="gemini-embedding-2-preview",
)

nsql.build_vector_db(storage_path="./metadata_vdb_sqlite", forced_reset=False)
tables = nsql.search("users with recent purchases", limit=3)
print(tables)

API

NaturalSQL(**kwargs)

Creates an instance with DB and embedding configuration.

Parameter Type Default Description
db_url str | None None Database connection URL
db_type str "" Engine: postgresql, mysql, sqlite, sqlserver
db_normalize_embeddings bool True Normalize embedding vectors
device str "cpu" Embedding device: cpu or cuda
vector_backend Literal["chroma", "sqlite"] "chroma" Vector backend
embedding_provider Literal["local", "gemini"] "local" Embedding provider
gemini_api_key str | None None Required when embedding_provider="gemini"
gemini_embedding_model str "gemini-embedding-2-preview" Gemini embedding model
vector_distance_threshold float 1.0 Max distance threshold used by search() filtering. For Chroma 1.5.x, if you get [], try 1.4-1.6 (validated with 1.6).

nsql.build_vector_db(...) -> dict

Connects to the DB, extracts schema, and indexes it in the configured vector backend (chroma or sqlite).

Parameter Type Default Description
storage_path str "./metadata_vdb" Vector storage path
forced_reset bool False Rebuild vector collection from scratch

Returns:

Key Type Description
storage_path str Storage path used
indexed_documents int Number of indexed tables
from_cache bool True if existing vector store was reused

nsql.search(...) -> list

Retrieves semantically relevant tables for a natural-language question.

Parameter Type Default Description
request str required Natural-language question
storage_path str "./metadata_vdb" Vector storage path
limit int 3 Maximum number of tables to return

build_prompt(relevant_tables, user_question) -> str

Helper function in naturalsql.utils.prompt to create an LLM prompt using relevant tables + user question.

from naturalsql.utils.prompt import build_prompt

prompt = build_prompt(tables, "Show me sales from last month")

How It Works Internally

Think of NaturalSQL as a 5-step pipeline:

  1. Read DB structure with simple SQL/system queries.
  2. Normalize that structure into one common Python dictionary.
  3. Convert dictionary entries into semantic text documents.
  4. Turn those texts into embedding vectors.
  5. Store vectors in Chroma or SQLite and retrieve best matches for RAG.

1) Schema extraction strategy by DB engine

NaturalSQL does not parse SQL files. It asks the live database for metadata.

  • PostgreSQL: information_schema.columns for columns + joins over information_schema.table_constraints, key_column_usage, constraint_column_usage for foreign keys.
  • MySQL: information_schema.columns and information_schema.key_column_usage (scoped by DATABASE()).
  • SQL Server: INFORMATION_SCHEMA.COLUMNS for columns + sys.foreign_key_columns with sys.tables, sys.schemas, sys.columns for FK relationships.
  • SQLite: sqlite_master (table list), PRAGMA table_info('<table>') for columns, PRAGMA foreign_key_list('<table>') for relationships.

Essential idea: most engines use information_schema; SQLite uses PRAGMA.

2) Unified schema dictionary (same format for all engines)

After extraction, data is normalized to one shape:

{
  "tables": {
    "public.users": {
      "schema": "public",
      "table": "users",
      "columns": [("id", "integer"), ("email", "text")]
    }
  },
  "relationships": [
    {
      "from_schema": "public",
      "from_table": "orders",
      "from_column": "user_id",
      "to_schema": "public",
      "to_table": "users",
      "to_column": "id"
    }
  ]
}

This is what makes the next steps backend-agnostic.

3) Dictionary to semantic documents (for embedding)

The extractor then generates documents for two kinds of knowledge:

  • kind=table: table + columns + data types
  • kind=relationship: FK direction between tables

Example payload item:

{
  "id": "table::public.users",
  "content": "Schema: public. Table name: users. It has the following columns: id (integer), email (text)",
  "metadata": {
    "kind": "table",
    "schema": "public",
    "table": "users"
  }
}

Relationship example:

{
  "id": "rel::public.orders.user_id->public.users.id",
  "content": "Relationship: public.orders.user_id -> public.users.id",
  "metadata": {
    "kind": "relationship"
  }
}

4) Embedding generation

NaturalSQL supports two embedding providers:

  • local: sentence-transformers (all-MiniLM-L6-v2)
  • gemini: google-genai (gemini-embedding-2-preview by default)

The system embeds:

  • documents with retrieval-document mode
  • user query with retrieval-query mode

This gives a vector for each schema document and one vector for each user question.

5) Vector storage strategy (Chroma vs SQLite)

Chroma backend

  • Uses chromadb.PersistentClient(path=storage_path).
  • Stores ids, documents, embeddings, metadatas in collection db_schema.
  • Retrieval uses native vector query + metadata filter, for example where={"kind": "table"}.

SQLite backend

  • Creates vectors.db under storage_path.
  • Table schema: id, content, embedding (JSON text), metadata_json (JSON text).
  • Retrieval loads rows and computes cosine distance in Python (NumPy if available, pure Python fallback if not).
  • Also filters by kind from metadata_json.

6) Retrieval flow for RAG

When you call search("...", limit=N), the flow is:

  1. Embed query text.
  2. Query kind=table and kind=relationship separately.
  3. Merge both result sets.
  4. Sort by distance (smaller is better).
  5. Apply vector_distance_threshold.
  6. Return top N schema contexts for your prompt.

So the LLM receives real schema context instead of guessing table/column names.

7) Minimal end-to-end internal view

from naturalsql.sql.sqlschema import SQLSchemaExtractor
from naturalsql.controller.controllervector import VectorManager

# 1) extract
extractor = SQLSchemaExtractor(connection, db_type="postgresql")
schema_bundle = extractor.extract_schema()

# 2) normalize -> semantic docs
documents_payload = extractor.formated_for_ia(schema_bundle)

# 3) embed + persist (chroma or sqlite depending on config)
vm = VectorManager(storage_path="./metadata_vdb", force_reset=False, config=config)
vm.index_documents(documents_payload)

# 4) retrieve for RAG
context_docs = vm.search_relevant_tables("sales from last month", limit=3)

Backend comparison

Aspect Chroma SQLite
Storage files Chroma persistent directory vectors.db file
Vector query Native Chroma ANN query Python cosine distance over stored vectors
Metadata filtering Native where filter JSON metadata filter in Python
Dependency profile Requires chromadb Uses stdlib sqlite3 + optional NumPy
Operational style Dedicated vector DB behavior Lightweight embedded local store

Current limitations

These points are based on current implementation behavior:

  • No hybrid ranking strategy yet: retrieval is distance-based over embeddings only (no lexical/BM25 rerank).
  • Relationship text is concise (A.col -> B.col), so very complex semantics are not explicitly encoded.
  • SQLite backend computes similarity in Python, so very large corpora may be slower than native vector engines.
  • build_vector_db() returns indexed_documents as total indexed documents (tables + relationships), while wording may be interpreted as only tables.
  • Schema extraction is structural (tables/columns/FKs). It does not ingest business definitions, comments, or curated ontology unless you add them as extra documents.

In practice, this still provides strong RAG context for SQL generation because table fields and foreign-key topology are the highest-value grounding signals.

Tests

See test/README.md for testing instructions.

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

naturalsql-1.2.5.tar.gz (30.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

naturalsql-1.2.5-py3-none-any.whl (31.9 kB view details)

Uploaded Python 3

File details

Details for the file naturalsql-1.2.5.tar.gz.

File metadata

  • Download URL: naturalsql-1.2.5.tar.gz
  • Upload date:
  • Size: 30.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for naturalsql-1.2.5.tar.gz
Algorithm Hash digest
SHA256 efd53de95bd8372273a141ccd4ff105e5c216bfbd2235e50086e0f6b8257ea89
MD5 d1e0c67560c08bd1f615a83d6a650fd4
BLAKE2b-256 2feb3696d8777cefe579bae4cd29a6ace80338a6719dc5edd425df5c504f8918

See more details on using hashes here.

File details

Details for the file naturalsql-1.2.5-py3-none-any.whl.

File metadata

  • Download URL: naturalsql-1.2.5-py3-none-any.whl
  • Upload date:
  • Size: 31.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for naturalsql-1.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 d86f902213d6a9c0a5b66f2fa50d9efa0cf40da0068ce88d9d4bb331a5c094a6
MD5 52869d867dff9af92b2893621b7cfafe
BLAKE2b-256 e984a581c7d871331b3c883b6dbec46245eb212760d7a9b8f4cb8b9a96f60e2e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page