Skip to main content

Text-to-analytics toolkit built on PyDough

Project description

pydough-analytics

Community Edition toolkit that combines the PyDough DSL with LLM-based prompting to deliver text-to-analytics workflows. The package provides:

  • Metadata generator that turns relational databases into PyDough knowledge graphs.
  • Prompt construction and llm-powered inference to translate natural language questions into PyDough code.
  • Safe execution layer that materialises PyDough results as SQL and DataFrames.
  • A Typer-based CLI for metadata generation and ad-hoc querying.

Provider Setup — Env (Vertex vs API‑Key)

Below are concise .env examples reflecting the two modes we support and a variant with explicit region.

Do not commit real credentials or API keys to Git. Use placeholders in docs and local .env files.


1) Minimal Vertex (recommended, default) (Gemini & Claude)

Use ADC + Vertex. No API key required. The SDK will use Vertex if you pass project/location in code or set GOOGLE_GENAI_USE_VERTEXAI=true.

# .env — minimal Vertex
GOOGLE_PROJECT_ID="your-gcp-project-id"
GOOGLE_APPLICATION_CREDENTIALS=/abs/path/to/service-account.json
GOOGLE_GENAI_USE_VERTEXAI=true
# Optional: explicit region selection (see #3), defaults noted below
# GOOGLE_REGION="us-east5"    # e.g., Claude default region
# GOOGLE_REGION="us-central1" # e.g., Gemini default region

Defaults / notes

  • Gemini on Vertex: default region on code if not provided is us-central1.
  • Claude on Vertex: default region on code if not provided is us-east5.
  • Vertex can also use credentials via gcloud auth application-default login
  • Ensure IAM role like roles/aiplatform.user and Vertex AI API enabled.

2) API‑Key mode (no Vertex) — Gemini only

If you set GOOGLE_GENAI_USE_VERTEXAI=false, the code will use the Google AI Studio (API‑key) SDK for Gemini. In this mode, GOOGLE_API_KEY is required, and ADC / project / region are not used by the Gemini client.

# .env — API‑key mode (Gemini via Google AI Studio API)
GOOGLE_API_KEY="your-google-api-key"
GOOGLE_GENAI_USE_VERTEXAI=false

# These may exist in your shell and are harmless here, but are not required by API‑key mode:
# GOOGLE_PROJECT_ID="your-gcp-project-id"
# GOOGLE_APPLICATION_CREDENTIALS=/abs/path/to/service-account.json
# GOOGLE_REGION="us-central1"

Notes

  • No IAM or Vertex regional control; intended for quick tests or limited environments.

3) Vertex with explicit region (Gemini & Claude)

Set an explicit region that supports the models you plan to use. if you do not set either one of the they have the next default values:

  • Geminius-central1
  • Claudeus-east5
# .env — Vertex with explicit region
GOOGLE_PROJECT_ID="your-gcp-project-id"
GOOGLE_REGION="us-east5"  # or us-central1, europe-west4, etc., if supported
GOOGLE_APPLICATION_CREDENTIALS=/abs/path/to/service-account.json
GOOGLE_GENAI_USE_VERTEXAI=true
# GOOGLE_API_KEY can be unset in Vertex mode

Recap

  • Switch between modes using GOOGLE_GENAI_USE_VERTEXAI:
    • true → Vertex (ADC). Requires GOOGLE_PROJECT_ID (+ GOOGLE_REGION optional) and credentials.
    • false → API‑key mode for Gemini. Requires GOOGLE_API_KEY.
  • Claude in this repo runs only via Vertex (ADC), so it needs project and a supported region (e.g., us-east5).

Using Local Models (Ollama)

PyDough-CE can also be used with locally hosted open-source models via Ollama. However, it’s important to note that results may vary and are not fully standardized when using local models.

The current PyDough prompting strategy and DSL generation were primarily designed and optimized for high-performing hosted models, specifically Gemini 2.5 Pro and Anthropic Claude (Opus / Sonnet). As a result, local models may occasionally produce PyDough code that is syntactically close but semantically incorrect (for example, small naming mismatches or invalid expressions).

That said, local inference can still be useful for experimentation, development, or environments without external API access.

Recommended local models:

  • gemma3:12b
  • qwen3:8b
  • llama3.1:8b

Running PyDough-CE with Ollama

Make sure Ollama is installed and running locally:

ollama serve

Pull a model (example):

ollama pull gemma3:12b

Set the required environment variables via CLI(or .env file):

export OLLAMA_BASE_URL=http://localhost:11434
export OLLAMA_CONTEXT_LENGTH=32768

A large context window is strongly recommended, as PyDough prompts are relatively large to perform well.


Recommendation

For the most reliable and consistent results, we recommend using Gemini 2.5 Pro or Anthropic Claude Opus 4. Local models via Ollama are supported, but should be considered experimental at this time.

TPCH sample database (download helper)

To make local testing easy, this repo includes a small helper script to download the TPCH demo database.

  • Script location: setup_tpch.sh
  • What it does: If the target file already exists, it prints FOUND and exits. Otherwise it downloads the SQLite DB.
  • Where the DB should live: ./data/databases/TPCH.db (from the repo root). The rest of the docs/CLI examples assume this path.

One-liner (macOS/Linux)

Run from the repo root:

mkdir -p ./data/databases
bash setup_tpch.sh ./data/databases/TPCH.db

If you don't have wget, you can use curl instead:

mkdir -p ./data/databases
curl -L https://github.com/lovasoa/TPCH-sqlite/releases/download/v1.0/TPC-H.db -o ./data/databases/TPCH.db

Verify the file is present:

ls -lh ./data/databases/TPCH.db

Windows (PowerShell)

New-Item -ItemType Directory -Force -Path .\data\databases | Out-Null
Invoke-WebRequest -Uri https://github.com/lovasoa/TPCH-sqlite/releases/download/v1.0/TPC-H.db -OutFile .\data\databases\TPCH.db

Quick Guide

Terminal location:

Run all of the next commands from the pydough-analytics folder (the folder that contains data/, docs/, samples/, src/, etc.). Quick check:

ls data
# → databases  metadata  metadata_markdowns  prompts

1) Generate metadata JSON

pydough-analytics generate-json   --url 'sqlite:///data/databases/TPCH.db'   --graph-name tpch   --json-path ./data/metadata/Tpch_graph.json
  • --url: Connection string.
  • --graph-name: Logical name for this dataset (you’ll reuse it as --db-name in ask).
  • --json-path: Where to save the graph JSON.

2) Export Markdown (used by the LLM)

pydough-analytics generate-md   --graph-name tpch   --json-path ./data/metadata/Tpch_graph.json   --md-path ./data/metadata_markdowns/Tpch.md
  • --graph-name: Logical name for this dataset in the JSON.

  • --json-path: Where to save the graph JSON.

  • --md-path: Where to save the Markdown.

  • Markdown helps the LLM stay grounded during ask.

  • Keep JSON + Markdown in version control for reproducibility.


3) Ask the LLM

The PyDough code is always printed. You can optionally show SQL, a DataFrame preview, and an explanation.

pydough-analytics ask   --question "Give me the name of all the suppliers from the United States"   --url 'sqlite:///data/databases/TPCH.db'   --db-name tpch   --md-path ./data/metadata_markdowns/Tpch.md   --kg-path ./data/metadata/Tpch_graph.json   --show-sql --show-df --show-explanation
  • --question: Natural language question to be answered.

  • --url: Full database connection string (See ## Supported Backends).

  • --db-name: Logical database name, typically matching the graph name.

  • --md-path: Path to the Markdown documentation describing the database.

  • --kg-path: Path to the JSON file containing the metadata.

  • --provider: Optional LLM provider (e.g. openai, anthropic).

  • --model: Optional LLM model identifier (e.g. gpt-4o-mini, claude-3-haiku).

  • --show-sql: Optional Print the SQL query generated (default: False).

  • --show-df: Optional Display the resulting DataFrame (default: False).

  • --show-explanation: Optional Print the reasoning or explanation provided by the LLM (default: False).

  • --as-json: Optional Output the query result as raw JSON (default: False).

  • --rows: Optional Number of rows to display from the resulting DataFrame (default: 20).

  • If you switch from defaults (e.g., to Anthropic), add:

    --provider anthropic --model claude-sonnet-4-5@20250929
    

To run PyDough-CE with Ollama you'll also need to add the provider and model as corresponds:

pydough-analytics ask   --question "Give me the name of all the suppliers from the United States"   --url sqlite:///data/databases/TPCH.db   --db-name tpch   --md-path ./data/metadata_markdowns/tpch.md   --kg-path ./data/metadata/tpch_graph.json   --provider ollama   --model gemma3:12b   --show-sql   --show-df   --show-explanation

Troubleshooting

  • No such file or directory → Check paths and casing; ensure ./data/metadata and ./data/metadata_markdowns exist.
  • Model not found → Model IDs vary by provider (Anthropic direct vs Vertex vs Bedrock). Use the correct one.
  • Vertex AI auth error → Use an absolute path for GOOGLE_APPLICATION_CREDENTIALS and set project/region.

Supported backends

PyDough supports multiple database backends through SQLAlchemy connection strings. Each connection string must include all required parameters (user, password, host, port, database, and optional schema/warehouse where applicable).

SQLite

Used for local files or in-memory databases. All values are required in the URL format.

sqlite:///path/to/mydb.db
  • Uses a local .sqlite or .db file.

To use an in-memory database (for testing):

sqlite:///:memory:

Snowflake

Used for analytical data warehouses. The connection string must include credentials, account, database, schema, and warehouse.

snowflake://user:password@account/db/schema?warehouse=WH&role=PUBLIC

Required values:

  • user: Snowflake username.
  • password: Snowflake password.
  • account: Snowflake account identifier (e.g. xy12345.us-east-1).
  • db: Database name.
  • schema: Schema name.
  • Query parameters (warehouse and role) are required for most configurations.

MySQL

Used for transactional databases. The URL must include user, password, host, port, and database. Internally converted to mysql+mysqlconnector://.

mysql://user:password@host:port/mydb

Converted internally to:

mysql+mysqlconnector://user:password@host:port/mydb

Required values:

  • user: MySQL username.
  • password: MySQL password.
  • host: Server hostname or IP address.
  • port: Port number (default: 3306).
  • mydb: Database name.

PostgreSQL

Used for relational and analytical databases. The URL must include all connection details. Internally converted to postgresql+psycopg2://.

postgres://user:password@host:port/mydb

Converted internally to:

postgresql+psycopg2://user:password@host:port/mydb

Required values:

  • user: PostgreSQL username.
  • password: PostgreSQL password.
  • host: Server hostname or IP address.
  • port: Port number (default: 5432).
  • mydb: Database name.

BodoSQL

A high performance SQL engine that can connect to a variety of data sources. Currently, the BodoSQL backend is only available through the Python API. To use this backend, pass a BodoSQLContext to LLMClient.ask().

For example, the following code assumes TPCH data was written in Apache Iceberg table format to a local directory:

import os
import pandas as pd
from bodosql import BodoSQLContext, FileSystemCatalog

catalog = FileSystemCatalog(os.path.abspath("./data/databases/tpch_db"))
bc = BodoSQLContext(catalog=catalog)

result = client.ask(
    bodosql_context=bc
    question="What are the most common transaction statuses and their respective counts?",
    kg_path="./data/metadata/tpch_graph.json",           # Knowledge Graph JSON
    md_path="./data/metadata_markdowns/tpch.md",         # Markdown doc for the DB
    db_name="TPCH"
)

For a complete list of supported database catalogs, see here.

MCP Server (Optional)

You can optionally run PyDough-Analytics as a Machine Cooperation Protocol (MCP) server to expose its analytics tools programmatically (for example, from Claude Desktop).

Installation

Install the MCP extras from the root folder:

pip install "pydough-analytics[mcp]"

Run the server

  1. Navigate to the main project directory and then to pydough-analytics
cd pydough-analytics
  1. Start the MCP server:
fastmcp run src/pydough_analytics/mcp/mcp_entry.py:server

The server exposes a full suite of tools and resources under the MCP name pydough-analytics.

See more on the README_MCP.md under the mcp folder.

Available Tools

Tool Description
pydough.init_metadata(url, graph_name="DATABASE") Generates metadata JSON and optionally Markdown from a live database.
pydough.open_session(database_url or db_config, metadata_path=..., graph_name="DATABASE") Opens a PyDough session and returns a unique session_id.
pydough.ask(session_id, question, auto_correct=False, max_corrections=1) Runs an LLM-assisted query using the session’s metadata and DB configuration. Returns PyDough code, SQL, and result rows.
pydough.schema_markdown(session_id) Returns the Markdown schema documentation for the active session.
pydough.list_sessions() Lists active sessions with basic diagnostic info.
pydough.close_session(session_id) Closes the session and removes its temporary metadata files.

Resources

Resource URI Description
pydough://metadata/{session_id} Markdown representation of the session’s graph schema.
pydough://result/{session_id} JSON object containing the last query’s PyDough code, SQL, and results.

Environment Variables

The MCP server uses the same environment configuration as the CLI:

  • GOOGLE_PROJECT_ID, GOOGLE_REGION, and GOOGLE_APPLICATION_CREDENTIALS — for Vertex/Claude/Gemini clients.
  • GEMINI_API_KEY or ANTHROPIC_API_KEY — if using direct SDK mode.

When metadata is passed inline (instead of via metadata_path), it’s persisted temporarily in /tmp/pydough_analytics_mcp/ and automatically deleted when close_session() is called.


Example Manifest (Claude Desktop)

{
  "name": "pydough-analytics",
  "command": [
    "fastmcp",
    "run",
    "src/pydough_analytics/mcp/mcp_entry.py:server"
  ]
}

Suggested next steps

  • To expand database coverage updating the CLI to accept engine-specific flags and extending the metadata.
  • Improve the troubleshooting documentation by covering engine-specific errors, connection problems, missing database files, and common CLI usage mistakes with clear resolutions.
  • Provide richer examples and Jupyter notebooks, showing end-to-end pipelines from SQLite, connecting to different databases, and visualizing metadata graphs for more practical learning.

Source folder structure

pydough-analytics/
├── src/                   # Library source code.
│   └── pydough_analytics/
│       ├── commands/      # CLI command implementations.
│       ├── config/        # Default settings and configuration helpers.
│       ├── data/          # Internal data loaders or fixtures.
│       ├── llm/           # Modules for LLM integration.
│       ├── metadata/      # Metadata generation and validation logic.
│       ├── mcp/           # Machine Cooperation Protocol (MCP) server and entrypoint.
│       ├── utils/         # Shared utility functions.
│       ├── __init__.py    # Package entry.
│       ├── __main__.py    # Allows `python -m pydough_analytics` execution.
│       ├── _version.py    # Package version constant.
│       └── cli.py         # Typer CLI entrypoint (`pydough-analytics`).
└── README.md              # Package-specific documentation.

Architecture Overview

+-----------------------------+
| CLI (Typer)                 |
| (generate-json, generate-md,|
|  ask)                       |
+-------------+---------------+
              |
              v
+-----------------------------+       +---------------------------+
| Metadata Generator          | ----> | Metadata JSON             |
| (SQLAlchemy inspector +     |       | (graph definition, V2)    |
|  identifier sanitizer,      |       +---------------------------+
|  type mapping)              |
+-------------+---------------+
              |
              v
+-----------------------------+       +---------------------------+
| Markdown Exporter           | ----> | Markdown Docs             |
| (render schema from graph)  |       | (human-readable overview) |
+-------------+---------------+       +---------------------------+

              |
              v
+-----------------------------+       +---------------------------+
| Ask Command (Typer)         | ----> | LLM Client                |
| (natural language question) |       | (prompt + schema + guide) |
+-------------+---------------+       +-------------+-------------+
              |
              v
+-----------------------------+       +---------------------------+
| AI Providers                | ----> | Gemini / Claude / aisuite |
| (google, anthropic,         |       |                           |
|  other via aisuite)         |       +---------------------------+
+-------------+---------------+
              |
              v
+-----------------------------+       +---------------------------+
| PyDough Executor            | ----> | SQL + DataFrame           |
| (extract code, run on DB,   |       | (results + explanation)   |
|  sanitize, retry on errors) |       +---------------------------+
+-------------+---------------+
              |
              v
+-----------------------------+       +---------------------------+
| MCP Server (FastMCP)        | ----> | External Clients (MCP)    |
| (tools + resources:         |       | Inspector / Claude / IDEs |
|  init_metadata, ask, etc.)  |       |                           |
+-----------------------------+       +---------------------------+

Notes:
- **Engines**: SQLite (built-in), Snowflake, MySQL and PostgreSQL.
- **LLM Providers**: Google Gemini, Anthropic (Claude), aisuite (others).
- **Artifacts**: JSON graph, Markdown docs, generated PyDough code, SQL, result DataFrame.
- **MCP Server**: Exposes PyDough-Analytics functions to external tools through a Machine Cooperation Protocol interface.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydough_analytics-0.1.7.tar.gz (53.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydough_analytics-0.1.7-py3-none-any.whl (64.2 kB view details)

Uploaded Python 3

File details

Details for the file pydough_analytics-0.1.7.tar.gz.

File metadata

  • Download URL: pydough_analytics-0.1.7.tar.gz
  • Upload date:
  • Size: 53.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for pydough_analytics-0.1.7.tar.gz
Algorithm Hash digest
SHA256 326f8efa662a0046b4de964f69330d488898d7250d61e846084e1b54f9ab8061
MD5 d1fbbaaec939e753cdec96b5aa718ea0
BLAKE2b-256 c632e0f9ebfc9ceafdc18189a6239189f23ec67a59191e6a069257ce4c3b4c60

See more details on using hashes here.

File details

Details for the file pydough_analytics-0.1.7-py3-none-any.whl.

File metadata

File hashes

Hashes for pydough_analytics-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 41723caef5c610c4cc2a83386179733152b35d1bda5cea30c4e1281186d5152b
MD5 33f55a1c84c7bc54f63f606be7c2ec95
BLAKE2b-256 facd23f96e6cd78c1ac65eae4072c968a796022193c9dbd6e7aa5f6363af7a49

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page