AI-powered SQL Agent for data engineering (Compiled Version)
Project description
Datus — Open-Source Data Engineering Agent
What is Datus?
Datus is an open-source data engineering agent that builds evolvable context for your data system — turning natural language into accurate SQL through domain-aware reasoning, semantic search, and continuous learning.
Data engineering is shifting from "building tables and pipelines" to "delivering scoped, domain-aware agents for analysts and business users." Datus makes that shift concrete.
Key Features
Build Evolvable Context, Not Static Pipelines
Traditional data engineering ends at data delivery. Datus goes further — it builds a living knowledge base that captures schema metadata, reference SQL, semantic models, metrics, and domain knowledge into a unified context layer. This context is what makes LLM-generated SQL accurate and trustworthy, and it improves with every interaction through a continuous learning loop. → Contextual Data Engineering
From Exploration to Domain-Specific Agents
Datus provides a complete journey for data engineers: start with a Claude-Code-like CLI to explore your data interactively, use Plan Mode to review before executing, and build up context over time. When a domain matures, package it into a Subagent — a scoped chatbot with curated context, tools, and business rules — and deliver it to analysts via web, API, or MCP. → Subagent docs
Metrics and Semantic Layer
Go beyond raw SQL with pluggable semantic adapters. Define business metrics in YAML via MetricFlow integration, and let Datus generate SQL from metric queries — bridging the gap between business language and database dialect. Use Dashboard Copilot to turn existing BI dashboards into conversational analytics. → Semantic Adapters docs
Measure and Improve
Built-in evaluation framework supporting BIRD and Spider 2.0-Snow datasets. Benchmark your agent's SQL accuracy, compare configurations, and track improvements as context evolves. → Benchmark docs
Open Platform
- 10+ LLM providers (OpenAI, Claude, Gemini, DeepSeek, Qwen, Kimi, OpenRouter, and more) with per-node model assignment — mix models within a single workflow
- 11 databases — Built-in SQLite & DuckDB, plus pluggable adapters for PostgreSQL, MySQL, Snowflake, StarRocks, ClickHouse, and more
- MCP Protocol — Both an MCP server (exposing Datus tools to Claude Desktop, Cursor, etc.) and an MCP client (consuming external tools via
.mcpin the CLI). → MCP docs - Skills — Extend Datus with agentskills.io-style packaged tools, configurable permissions, and marketplace support. → Skills docs
Getting Started
Install
Requirements: Linux or macOS. Python 3.12 is installed automatically when you use the one-liner.
One-liner (Linux / macOS)
Stable install from PyPI:
curl -fsSL https://raw.githubusercontent.com/datus-ai/datus-agent/main/install.sh | sh
This creates a dedicated venv at ~/.datus/venv, installs datus-agent from PyPI into it, and drops datus, datus-cli, datus-api, datus-mcp, datus-agent, datus-gateway, and datus-pip shims into ~/.local/bin. Open a new shell (or source ~/.zshrc) to pick up PATH, then run datus to launch the REPL — use /model to configure an LLM, /datasource to add a datasource, and (optionally) /init to generate AGENTS.md for the current project.
To install additional Python packages into the global venv later, use datus-pip install <package> (it is a shim for ~/.datus/venv/bin/pip).
Dev install from GitHub source (picks up unreleased changes):
curl -fsSL https://raw.githubusercontent.com/datus-ai/datus-agent/main/install-dev.sh | sh
# or pin to a branch / tag / commit
curl -fsSL https://raw.githubusercontent.com/datus-ai/datus-agent/main/install-dev.sh | DATUS_REF=feature/foo sh
Pin a PyPI version (stable installer only):
curl -fsSL https://raw.githubusercontent.com/datus-ai/datus-agent/main/install.sh | DATUS_VERSION=0.2.6 sh
Other variables supported by both installers: DATUS_HOME (default ~/.datus), DATUS_BIN_DIR (default ~/.local/bin), DATUS_FORCE=1 to recreate the venv, DATUS_NO_MODIFY_PATH=1 to skip shell rc edits.
Manual install
pip install datus-agent
datus
After the REPL starts, run /model to configure an LLM, /datasource to add a datasource, and (optionally) /init to generate AGENTS.md for the current project. For detailed guidance, see the Quickstart Guide.
Four Ways to Use Datus
| Interface | Command | Use Case |
|---|---|---|
| CLI (Interactive REPL) | datus-cli --datasource demo |
Data engineers exploring data, building context, creating subagents |
| Web Chatbot (Streamlit) | datus-cli --web --datasource demo |
Analysts chatting with subagents via browser (http://localhost:8501) |
| API Server (FastAPI) | datus-api --datasource demo |
Applications consuming data services via REST (http://localhost:8000) |
| MCP Server | datus-mcp --datasource demo |
MCP-compatible clients (Claude Desktop, Cursor, etc.) |
Tip: Use
datus-cli --print --datasource demofor JSON streaming to stdout — useful for piping into other tools.
Architecture
Workflow Engine
Datus uses a configurable node-based workflow engine. Each workflow is a plan of nodes executed in sequence, parallel, or as sub-workflows:
workflow:
plan: planA
planA:
- schema_linking # Find relevant tables
- parallel: # Run in parallel
- gen_sql # SQL generation
- reasoning # Chain-of-thought reasoning
- selection # Pick the best result
- execute_sql # Run the query
- output # Format and return
Node Types
| Category | Nodes |
|---|---|
| Core | schema_linking, execute_sql, reasoning, reflect, output |
| Agentic | gen_sql, chat, explore, gen_semantic_model, gen_metrics, gen_sql_summary, gen_skill, gen_table, compare |
| Control Flow | parallel, selection, subworkflow |
| Utility | date_parser, doc_search, fix |
RAG Knowledge Base
The knowledge base is powered by LanceDB and organizes context into multiple layers:
- Schema Metadata — Table and column descriptions, relationships
- Reference SQL — Curated query examples with summaries
- Reference Templates — Parameterized Jinja2 SQL templates for stable, reusable queries
- Semantic Models — Business logic and metric definitions
- Metrics — Executable business metrics via semantic layer integration
- Platform Docs — Ingested from GitHub repos, websites, or local files
Build the knowledge base with:
datus-agent bootstrap-kb --datasource demo --components metadata,reference_sql
Configuration
Datus is configured via agent.yml. Launch datus and use /model plus /datasource to populate it interactively, or copy conf/agent.yml.example and edit it by hand.
| Section | Purpose |
|---|---|
agent.models |
LLM provider definitions (API keys, model IDs, base URLs) |
agent.nodes |
Per-node model assignment and tuning parameters |
agent.services.datasources |
Database connections (SQLite, DuckDB, Snowflake, etc.) |
agent.storage |
Embedding models, vector DB, and RAG configuration |
agent.workflow |
Execution plans with sequential, parallel, and sub-workflow steps |
agent.agentic_nodes |
Configuration for agentic nodes (semantic model gen, metrics gen) |
agent.document |
Platform documentation sources (GitHub repos, websites, local files) |
API keys are injected via environment variables using ${ENV_VAR} syntax.
Supported LLM Providers
| Provider | Type | Notes |
|---|---|---|
| OpenAI | openai |
GPT-4o, GPT-4, etc. |
| Anthropic Claude | claude |
Direct API |
| Google Gemini | gemini |
Gemini 2.0+ |
| DeepSeek | deepseek |
DeepSeek-Chat, DeepSeek-Coder |
| Alibaba Qwen | qwen |
Qwen series |
| Moonshot Kimi | kimi |
Kimi models |
| MiniMax | minimax |
MiniMax models |
| GLM (Zhipu) | glm |
GLM-4 series |
| OpenAI Codex | codex |
OAuth-based Codex models (gpt-5.3-codex, o3-codex) |
| OpenRouter | openrouter |
300+ models via a single API key |
Embedding models: OpenAI, Sentence-Transformers, FastEmbed, Hugging Face.
Per-node model assignment lets you use different providers for different workflow steps (e.g., a cheaper model for schema linking, a stronger model for SQL generation).
Supported Databases
| Database | Type | Package |
|---|---|---|
| SQLite | sqlite |
Built-in |
| DuckDB | duckdb |
Built-in |
| PostgreSQL | postgresql |
datus-postgresql |
| MySQL | mysql |
datus-mysql |
| Snowflake | snowflake |
datus-snowflake |
| StarRocks | starrocks |
datus-starrocks |
| ClickHouse | clickhouse |
datus-clickhouse |
| ClickZetta | clickzetta |
datus-clickzetta |
| Hive | hive |
datus-hive |
| Spark | spark |
datus-spark |
| Trino | trino |
datus-trino |
See Database Adapters documentation for details.
How It Works
Explore — Chat with your database, test queries, and ground prompts with @table or @file references.
datus-cli --datasource demo
/Check the top 10 banks by assets lost @table duckdb-demo.main.bank_failures
Build Context — Generate semantic models, import SQL history, define metrics. Each piece becomes reusable context for future queries.
/gen_semantic_model xxx # Generate semantic model from tables
/gen_sql_summary # Index SQL history for retrieval
Create a Subagent — Package mature context into a scoped, domain-aware chatbot with curated tools and business rules.
.subagent add mychatbot # Create a new subagent
Deliver — Serve the subagent to analysts via web (localhost:8501/?subagent=mychatbot), REST API, or MCP — with feedback collection (upvotes, issue reports) built in.
Measure — Run benchmarks against BIRD or Spider 2.0-Snow to track SQL accuracy as context evolves.
Iterate — Analyst feedback loops back: engineers fix SQL, add rules, refine semantic models, and extend with Skills or MCP tools. The agent gets more accurate over time.
→ End-to-end tutorial · CLI docs · Knowledge Base docs · Subagent docs
Development
uv sync # Install dependencies
uv run pytest tests/unit_tests/ -q # Run CI tests (no external deps)
uv run ruff format . && uv run ruff check --fix . # Lint & format
Enable --save_llm_trace on CLI commands or set save_llm_trace: true per model in agent.yml to persist LLM inputs/outputs for debugging. → LLM Trace docs
See CLAUDE.md for full development conventions, architecture patterns, and testing rules.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datus_agent-0.3.4rc1.tar.gz.
File metadata
- Download URL: datus_agent-0.3.4rc1.tar.gz
- Upload date:
- Size: 4.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37f43b2e334de02f79b778e22b19cfc7ef2c230a2156a1d3cb09dd65c592b9a2
|
|
| MD5 |
c837096ef804fccd449e8584ab440b2c
|
|
| BLAKE2b-256 |
4b0a2bd2b2f3ef5b386a66eb717dcbc25607255fbfb0ab820ef7cde6d85bc1f5
|
File details
Details for the file datus_agent-0.3.4rc1-py3-none-any.whl.
File metadata
- Download URL: datus_agent-0.3.4rc1-py3-none-any.whl
- Upload date:
- Size: 5.0 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2577264e839482bffec34d61f11eaee5e26013cd70452f505737d2e17268375d
|
|
| MD5 |
68ff9d92f82e9f53fe99a5cb76e70483
|
|
| BLAKE2b-256 |
3a099edcd887b564b653b60c300b548f94790412908042f542a31950fba1a38a
|