Text-to-analytics toolkit built on PyDough
Project description
pydough-analytics
Community Edition toolkit that combines the PyDough DSL with LLM-based prompting to deliver text-to-analytics workflows. The package provides:
- Metadata generator that turns relational databases into PyDough knowledge graphs.
- Prompt construction and llm-powered inference to translate natural language questions into PyDough code.
- Safe execution layer that materialises PyDough results as SQL and DataFrames.
- A Typer-based CLI for metadata generation and ad-hoc querying.
Provider Setup — Env (Vertex vs API‑Key)
Below are concise .env examples reflecting the two modes we support and a variant with explicit region.
Do not commit real credentials or API keys to Git. Use placeholders in docs and local
.envfiles.
1) Minimal Vertex (recommended, default) (Gemini & Claude)
Use ADC + Vertex. No API key required. The SDK will use Vertex if you pass project/location in code or set GOOGLE_GENAI_USE_VERTEXAI=true.
# .env — minimal Vertex
GOOGLE_PROJECT_ID="your-gcp-project-id"
GOOGLE_APPLICATION_CREDENTIALS=/abs/path/to/service-account.json
GOOGLE_GENAI_USE_VERTEXAI=true
# Optional: explicit region selection (see #3), defaults noted below
# GOOGLE_REGION="us-east5" # e.g., Claude default region
# GOOGLE_REGION="us-central1" # e.g., Gemini default region
Defaults / notes
- Gemini on Vertex: default region on code if not provided is
us-central1. - Claude on Vertex: default region on code if not provided is
us-east5. - Vertex can also use credentials via gcloud auth application-default login
- Ensure IAM role like
roles/aiplatform.userand Vertex AI API enabled.
2) API‑Key mode (no Vertex) — Gemini only
If you set GOOGLE_GENAI_USE_VERTEXAI=false, the code will use the Google AI Studio (API‑key) SDK for Gemini.
In this mode, GOOGLE_API_KEY is required, and ADC / project / region are not used by the Gemini client.
# .env — API‑key mode (Gemini via Google AI Studio API)
GOOGLE_API_KEY="your-google-api-key"
GOOGLE_GENAI_USE_VERTEXAI=false
# These may exist in your shell and are harmless here, but are not required by API‑key mode:
# GOOGLE_PROJECT_ID="your-gcp-project-id"
# GOOGLE_APPLICATION_CREDENTIALS=/abs/path/to/service-account.json
# GOOGLE_REGION="us-central1"
Notes
- No IAM or Vertex regional control; intended for quick tests or limited environments.
3) Vertex with explicit region (Gemini & Claude)
Set an explicit region that supports the models you plan to use. if you do not set either one of the they have the next default values:
- Gemini →
us-central1 - Claude →
us-east5
# .env — Vertex with explicit region
GOOGLE_PROJECT_ID="your-gcp-project-id"
GOOGLE_REGION="us-east5" # or us-central1, europe-west4, etc., if supported
GOOGLE_APPLICATION_CREDENTIALS=/abs/path/to/service-account.json
GOOGLE_GENAI_USE_VERTEXAI=true
# GOOGLE_API_KEY can be unset in Vertex mode
Recap
- Switch between modes using
GOOGLE_GENAI_USE_VERTEXAI:true→ Vertex (ADC). RequiresGOOGLE_PROJECT_ID(+GOOGLE_REGIONoptional) and credentials.false→ API‑key mode for Gemini. RequiresGOOGLE_API_KEY.
- Claude in this repo runs only via Vertex (ADC), so it needs
projectand a supportedregion(e.g.,us-east5).
Using Local Models (Ollama)
PyDough-CE can also be used with locally hosted open-source models via Ollama. However, it’s important to note that results may vary and are not fully standardized when using local models.
The current PyDough prompting strategy and DSL generation were primarily designed and optimized for high-performing hosted models, specifically Gemini 2.5 Pro and Anthropic Claude (Opus / Sonnet). As a result, local models may occasionally produce PyDough code that is syntactically close but semantically incorrect (for example, small naming mismatches or invalid expressions).
That said, local inference can still be useful for experimentation, development, or environments without external API access.
Recommended local models:
gemma3:12bqwen3:8bllama3.1:8b
Running PyDough-CE with Ollama
Make sure Ollama is installed and running locally:
ollama serve
Pull a model (example):
ollama pull gemma3:12b
Set the required environment variables via CLI(or .env file):
export OLLAMA_BASE_URL=http://localhost:11434
export OLLAMA_CONTEXT_LENGTH=32768
A large context window is strongly recommended, as PyDough prompts are relatively large to perform well.
Recommendation
For the most reliable and consistent results, we recommend using Gemini 2.5 Pro or Anthropic Claude Opus 4. Local models via Ollama are supported, but should be considered experimental at this time.
TPCH sample database (download helper)
To make local testing easy, this repo includes a small helper script to download the TPCH demo database.
- Script location:
setup_tpch.sh - What it does: If the target file already exists, it prints
FOUNDand exits. Otherwise it downloads the SQLite DB. - Where the DB should live:
./data/databases/TPCH.db(from the repo root). The rest of the docs/CLI examples assume this path.
One-liner (macOS/Linux)
Run from the repo root:
mkdir -p ./data/databases
bash setup_tpch.sh ./data/databases/TPCH.db
If you don't have wget, you can use curl instead:
mkdir -p ./data/databases
curl -L https://github.com/lovasoa/TPCH-sqlite/releases/download/v1.0/TPC-H.db -o ./data/databases/TPCH.db
Verify the file is present:
ls -lh ./data/databases/TPCH.db
Windows (PowerShell)
New-Item -ItemType Directory -Force -Path .\data\databases | Out-Null
Invoke-WebRequest -Uri https://github.com/lovasoa/TPCH-sqlite/releases/download/v1.0/TPC-H.db -OutFile .\data\databases\TPCH.db
Quick Guide
Terminal location:
Run all of the next commands from the pydough-analytics folder (the folder that contains data/, docs/, samples/, src/, etc.).
Quick check:
ls data
# → databases metadata metadata_markdowns prompts
1) Generate metadata JSON
pydough-analytics generate-json --url 'sqlite:///data/databases/TPCH.db' --graph-name tpch --json-path ./data/metadata/Tpch_graph.json
--url: Connection string.--graph-name: Logical name for this dataset (you’ll reuse it as--db-nameinask).--json-path: Where to save the graph JSON.
2) Export Markdown (used by the LLM)
pydough-analytics generate-md --graph-name tpch --json-path ./data/metadata/Tpch_graph.json --md-path ./data/metadata_markdowns/Tpch.md
-
--graph-name: Logical name for this dataset in the JSON. -
--json-path: Where to save the graph JSON. -
--md-path: Where to save the Markdown. -
Markdown helps the LLM stay grounded during
ask. -
Keep JSON + Markdown in version control for reproducibility.
3) Ask the LLM
The PyDough code is always printed. You can optionally show SQL, a DataFrame preview, and an explanation.
pydough-analytics ask --question "Give me the name of all the suppliers from the United States" --url 'sqlite:///data/databases/TPCH.db' --db-name tpch --md-path ./data/metadata_markdowns/Tpch.md --kg-path ./data/metadata/Tpch_graph.json --show-sql --show-df --show-explanation
-
--question: Natural language question to be answered. -
--url: Full database connection string (See ## Supported Backends). -
--db-name: Logical database name, typically matching the graph name. -
--md-path: Path to the Markdown documentation describing the database. -
--kg-path: Path to the JSON file containing the metadata. -
--provider: Optional LLM provider (e.g. openai, anthropic). -
--model: Optional LLM model identifier (e.g. gpt-4o-mini, claude-3-haiku). -
--show-sql: Optional Print the SQL query generated (default: False). -
--show-df: Optional Display the resulting DataFrame (default: False). -
--show-explanation: Optional Print the reasoning or explanation provided by the LLM (default: False). -
--as-json: Optional Output the query result as raw JSON (default: False). -
--rows: Optional Number of rows to display from the resulting DataFrame (default: 20). -
If you switch from defaults (e.g., to Anthropic), add:
--provider anthropic --model claude-sonnet-4-5@20250929
To run PyDough-CE with Ollama you'll also need to add the provider and model as corresponds:
pydough-analytics ask --question "Give me the name of all the suppliers from the United States" --url sqlite:///data/databases/TPCH.db --db-name tpch --md-path ./data/metadata_markdowns/tpch.md --kg-path ./data/metadata/tpch_graph.json --provider ollama --model gemma3:12b --show-sql --show-df --show-explanation
Troubleshooting
- No such file or directory → Check paths and casing; ensure
./data/metadataand./data/metadata_markdownsexist. - Model not found → Model IDs vary by provider (Anthropic direct vs Vertex vs Bedrock). Use the correct one.
- Vertex AI auth error → Use an absolute path for
GOOGLE_APPLICATION_CREDENTIALSand set project/region.
Supported backends
PyDough supports multiple database backends through SQLAlchemy connection strings. Each connection string must include all required parameters (user, password, host, port, database, and optional schema/warehouse where applicable).
SQLite
Used for local files or in-memory databases. All values are required in the URL format.
sqlite:///path/to/mydb.db
- Uses a local .sqlite or .db file.
To use an in-memory database (for testing):
sqlite:///:memory:
Snowflake
Used for analytical data warehouses. The connection string must include credentials, account, database, schema, and warehouse.
snowflake://user:password@account/db/schema?warehouse=WH&role=PUBLIC
Required values:
user: Snowflake username.password: Snowflake password.account: Snowflake account identifier (e.g. xy12345.us-east-1).db: Database name.schema: Schema name.- Query parameters (
warehouseandrole) are required for most configurations.
MySQL
Used for transactional databases. The URL must include user, password, host, port, and database. Internally converted to mysql+mysqlconnector://.
mysql://user:password@host:port/mydb
Converted internally to:
mysql+mysqlconnector://user:password@host:port/mydb
Required values:
user: MySQL username.password: MySQL password.host: Server hostname or IP address.port: Port number (default: 3306).mydb: Database name.
PostgreSQL
Used for relational and analytical databases. The URL must include all connection details. Internally converted to postgresql+psycopg2://.
postgres://user:password@host:port/mydb
Converted internally to:
postgresql+psycopg2://user:password@host:port/mydb
Required values:
user: PostgreSQL username.password: PostgreSQL password.host: Server hostname or IP address.port: Port number (default: 5432).mydb: Database name.
BodoSQL
A high performance SQL engine that can connect to a variety of data sources.
Currently, the BodoSQL backend is only available through the Python API.
To use this backend, pass a BodoSQLContext to LLMClient.ask().
For example, the following code assumes TPCH data was written in Apache Iceberg table format to a local directory:
import os
import pandas as pd
from bodosql import BodoSQLContext, FileSystemCatalog
catalog = FileSystemCatalog(os.path.abspath("./data/databases/tpch_db"))
bc = BodoSQLContext(catalog=catalog)
result = client.ask(
bodosql_context=bc
question="What are the most common transaction statuses and their respective counts?",
kg_path="./data/metadata/tpch_graph.json", # Knowledge Graph JSON
md_path="./data/metadata_markdowns/tpch.md", # Markdown doc for the DB
db_name="TPCH"
)
For a complete list of supported database catalogs, see here.
MCP Server (Optional)
You can optionally run PyDough-Analytics as a Machine Cooperation Protocol (MCP) server to expose its analytics tools programmatically (for example, from Claude Desktop).
Installation
Install the MCP extras from the root folder:
pip install "pydough-analytics[mcp]"
Run the server
- Navigate to the main project directory and then to pydough-analytics
cd pydough-analytics
- Start the MCP server:
fastmcp run src/pydough_analytics/mcp/mcp_entry.py:server
The server exposes a full suite of tools and resources under the MCP name pydough-analytics.
See more on the README_MCP.md under the mcp folder.
Available Tools
| Tool | Description |
|---|---|
| pydough.init_metadata(url, graph_name="DATABASE") | Generates metadata JSON and optionally Markdown from a live database. |
| pydough.open_session(database_url or db_config, metadata_path=..., graph_name="DATABASE") | Opens a PyDough session and returns a unique session_id. |
| pydough.ask(session_id, question, auto_correct=False, max_corrections=1) | Runs an LLM-assisted query using the session’s metadata and DB configuration. Returns PyDough code, SQL, and result rows. |
| pydough.schema_markdown(session_id) | Returns the Markdown schema documentation for the active session. |
| pydough.list_sessions() | Lists active sessions with basic diagnostic info. |
| pydough.close_session(session_id) | Closes the session and removes its temporary metadata files. |
Resources
| Resource URI | Description |
|---|---|
pydough://metadata/{session_id} |
Markdown representation of the session’s graph schema. |
pydough://result/{session_id} |
JSON object containing the last query’s PyDough code, SQL, and results. |
Environment Variables
The MCP server uses the same environment configuration as the CLI:
GOOGLE_PROJECT_ID,GOOGLE_REGION, andGOOGLE_APPLICATION_CREDENTIALS— for Vertex/Claude/Gemini clients.GEMINI_API_KEYorANTHROPIC_API_KEY— if using direct SDK mode.
When metadata is passed inline (instead of via metadata_path), it’s persisted temporarily in
/tmp/pydough_analytics_mcp/ and automatically deleted when close_session() is called.
Example Manifest (Claude Desktop)
{
"name": "pydough-analytics",
"command": [
"fastmcp",
"run",
"src/pydough_analytics/mcp/mcp_entry.py:server"
]
}
Suggested next steps
- To expand database coverage updating the CLI to accept engine-specific flags and extending the metadata.
- Improve the troubleshooting documentation by covering engine-specific errors, connection problems, missing database files, and common CLI usage mistakes with clear resolutions.
- Provide richer examples and Jupyter notebooks, showing end-to-end pipelines from SQLite, connecting to different databases, and visualizing metadata graphs for more practical learning.
Source folder structure
pydough-analytics/
├── src/ # Library source code.
│ └── pydough_analytics/
│ ├── commands/ # CLI command implementations.
│ ├── config/ # Default settings and configuration helpers.
│ ├── data/ # Internal data loaders or fixtures.
│ ├── llm/ # Modules for LLM integration.
│ ├── metadata/ # Metadata generation and validation logic.
│ ├── mcp/ # Machine Cooperation Protocol (MCP) server and entrypoint.
│ ├── utils/ # Shared utility functions.
│ ├── __init__.py # Package entry.
│ ├── __main__.py # Allows `python -m pydough_analytics` execution.
│ ├── _version.py # Package version constant.
│ └── cli.py # Typer CLI entrypoint (`pydough-analytics`).
└── README.md # Package-specific documentation.
Architecture Overview
+-----------------------------+
| CLI (Typer) |
| (generate-json, generate-md,|
| ask) |
+-------------+---------------+
|
v
+-----------------------------+ +---------------------------+
| Metadata Generator | ----> | Metadata JSON |
| (SQLAlchemy inspector + | | (graph definition, V2) |
| identifier sanitizer, | +---------------------------+
| type mapping) |
+-------------+---------------+
|
v
+-----------------------------+ +---------------------------+
| Markdown Exporter | ----> | Markdown Docs |
| (render schema from graph) | | (human-readable overview) |
+-------------+---------------+ +---------------------------+
|
v
+-----------------------------+ +---------------------------+
| Ask Command (Typer) | ----> | LLM Client |
| (natural language question) | | (prompt + schema + guide) |
+-------------+---------------+ +-------------+-------------+
|
v
+-----------------------------+ +---------------------------+
| AI Providers | ----> | Gemini / Claude / aisuite |
| (google, anthropic, | | |
| other via aisuite) | +---------------------------+
+-------------+---------------+
|
v
+-----------------------------+ +---------------------------+
| PyDough Executor | ----> | SQL + DataFrame |
| (extract code, run on DB, | | (results + explanation) |
| sanitize, retry on errors) | +---------------------------+
+-------------+---------------+
|
v
+-----------------------------+ +---------------------------+
| MCP Server (FastMCP) | ----> | External Clients (MCP) |
| (tools + resources: | | Inspector / Claude / IDEs |
| init_metadata, ask, etc.) | | |
+-----------------------------+ +---------------------------+
Notes:
- **Engines**: SQLite (built-in), Snowflake, MySQL and PostgreSQL.
- **LLM Providers**: Google Gemini, Anthropic (Claude), aisuite (others).
- **Artifacts**: JSON graph, Markdown docs, generated PyDough code, SQL, result DataFrame.
- **MCP Server**: Exposes PyDough-Analytics functions to external tools through a Machine Cooperation Protocol interface.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pydough_analytics-0.1.7.tar.gz.
File metadata
- Download URL: pydough_analytics-0.1.7.tar.gz
- Upload date:
- Size: 53.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
326f8efa662a0046b4de964f69330d488898d7250d61e846084e1b54f9ab8061
|
|
| MD5 |
d1fbbaaec939e753cdec96b5aa718ea0
|
|
| BLAKE2b-256 |
c632e0f9ebfc9ceafdc18189a6239189f23ec67a59191e6a069257ce4c3b4c60
|
File details
Details for the file pydough_analytics-0.1.7-py3-none-any.whl.
File metadata
- Download URL: pydough_analytics-0.1.7-py3-none-any.whl
- Upload date:
- Size: 64.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
41723caef5c610c4cc2a83386179733152b35d1bda5cea30c4e1281186d5152b
|
|
| MD5 |
33f55a1c84c7bc54f63f606be7c2ec95
|
|
| BLAKE2b-256 |
facd23f96e6cd78c1ac65eae4072c968a796022193c9dbd6e7aa5f6363af7a49
|