Skip to main content

AI-powered SQL Agent for data engineering (Compiled Version)

Project description

Datus — Open-Source Data Engineering Agent

License Website Docs Quick Start Release Notes Slack


What is Datus?

Datus is an open-source data engineering agent that builds evolvable context for your data system — turning natural language into accurate SQL through domain-aware reasoning, semantic search, and continuous learning.

Data engineering is shifting from "building tables and pipelines" to "delivering scoped, domain-aware agents for analysts and business users." Datus makes that shift concrete.

Datus Architecture

Key Features

Build Evolvable Context, Not Static Pipelines

Traditional data engineering ends at data delivery. Datus goes further — it builds a living knowledge base that captures schema metadata, reference SQL, semantic models, metrics, and domain knowledge into a unified context layer. This context is what makes LLM-generated SQL accurate and trustworthy, and it improves with every interaction through a continuous learning loop. → Contextual Data Engineering

From Exploration to Domain-Specific Agents

Datus provides a complete journey for data engineers: start with a Claude-Code-like CLI to explore your data interactively, use Plan Mode to review before executing, and build up context over time. When a domain matures, package it into a Subagent — a scoped chatbot with curated context, tools, and business rules — and deliver it to analysts via web, API, or MCP. → Subagent docs

Metrics and Semantic Layer

Go beyond raw SQL with pluggable semantic adapters. Define business metrics in YAML via MetricFlow integration, and let Datus generate SQL from metric queries — bridging the gap between business language and database dialect. Use Dashboard Copilot to turn existing BI dashboards into conversational analytics. → Semantic Adapters docs

Measure and Improve

Built-in evaluation framework supporting BIRD and Spider 2.0-Snow datasets. Benchmark your agent's SQL accuracy, compare configurations, and track improvements as context evolves. → Benchmark docs

Open Platform

  • 10+ LLM providers (OpenAI, Claude, Gemini, DeepSeek, Qwen, Kimi, OpenRouter, and more) with per-node model assignment — mix models within a single workflow
  • 11 databases — Built-in SQLite & DuckDB, plus pluggable adapters for PostgreSQL, MySQL, Snowflake, StarRocks, ClickHouse, and more
  • MCP Protocol — Both an MCP server (exposing Datus tools to Claude Desktop, Cursor, etc.) and an MCP client (consuming external tools via .mcp in the CLI). → MCP docs
  • Skills — Extend Datus with agentskills.io-style packaged tools, configurable permissions, and marketplace support. → Skills docs

Getting Started

Install

Requirements: Linux or macOS. Python 3.12 is installed automatically when you use the one-liner.

One-liner (Linux / macOS)

Stable install from PyPI:

curl -fsSL https://raw.githubusercontent.com/datus-ai/datus-agent/main/install.sh | sh

This creates a dedicated venv at ~/.datus/venv, installs datus-agent from PyPI into it, and drops datus, datus-cli, datus-api, datus-mcp, datus-agent, datus-gateway, and datus-pip shims into ~/.local/bin. Open a new shell (or source ~/.zshrc) to pick up PATH, then run datus to launch the REPL — use /model to configure an LLM, /datasource to add a datasource, and (optionally) /init to generate AGENTS.md for the current project.

To install additional Python packages into the global venv later, use datus-pip install <package> (it is a shim for ~/.datus/venv/bin/pip).

Dev install from GitHub source (picks up unreleased changes):

curl -fsSL https://raw.githubusercontent.com/datus-ai/datus-agent/main/install-dev.sh | sh
# or pin to a branch / tag / commit
curl -fsSL https://raw.githubusercontent.com/datus-ai/datus-agent/main/install-dev.sh | DATUS_REF=feature/foo sh

Pin a PyPI version (stable installer only):

curl -fsSL https://raw.githubusercontent.com/datus-ai/datus-agent/main/install.sh | DATUS_VERSION=0.2.6 sh

Other variables supported by both installers: DATUS_HOME (default ~/.datus), DATUS_BIN_DIR (default ~/.local/bin), DATUS_FORCE=1 to recreate the venv, DATUS_NO_MODIFY_PATH=1 to skip shell rc edits.

Manual install

pip install datus-agent
datus

After the REPL starts, run /model to configure an LLM, /datasource to add a datasource, and (optionally) /init to generate AGENTS.md for the current project. For detailed guidance, see the Quickstart Guide.

Four Ways to Use Datus

Interface Command Use Case
CLI (Interactive REPL) datus-cli --datasource demo Data engineers exploring data, building context, creating subagents
Web Chatbot (Streamlit) datus-cli --web --datasource demo Analysts chatting with subagents via browser (http://localhost:8501)
API Server (FastAPI) datus-api --datasource demo Applications consuming data services via REST (http://localhost:8000)
MCP Server datus-mcp --datasource demo MCP-compatible clients (Claude Desktop, Cursor, etc.)

Tip: Use datus-cli --print --datasource demo for JSON streaming to stdout — useful for piping into other tools.

Architecture

Workflow Engine

Datus uses a configurable node-based workflow engine. Each workflow is a plan of nodes executed in sequence, parallel, or as sub-workflows:

workflow:
  plan: planA
  planA:
    - schema_linking     # Find relevant tables
    - parallel:          # Run in parallel
      - generate_sql     # SQL generation
      - reasoning        # Chain-of-thought reasoning
    - selection          # Pick the best result
    - execute_sql        # Run the query
    - output             # Format and return

Node Types

Category Nodes
Core schema_linking, generate_sql, execute_sql, reasoning, reflect, output
Agentic chat, explore, gen_semantic_model, gen_metrics, gen_ext_knowledge, gen_sql_summary, gen_skill, gen_table, compare
Control Flow parallel, selection, subworkflow
Utility date_parser, doc_search, fix

RAG Knowledge Base

The knowledge base is powered by LanceDB and organizes context into multiple layers:

  • Schema Metadata — Table and column descriptions, relationships
  • Reference SQL — Curated query examples with summaries
  • Reference Templates — Parameterized Jinja2 SQL templates for stable, reusable queries
  • Semantic Models — Business logic and metric definitions
  • Metrics — Executable business metrics via semantic layer integration
  • External Knowledge — Domain rules and concepts beyond raw schema
  • Platform Docs — Ingested from GitHub repos, websites, or local files

Build the knowledge base with:

datus-agent bootstrap-kb --datasource demo --components metadata,reference_sql,ext_knowledge

Configuration

Datus is configured via agent.yml. Launch datus and use /model plus /datasource to populate it interactively, or copy conf/agent.yml.example and edit it by hand.

Section Purpose
agent.models LLM provider definitions (API keys, model IDs, base URLs)
agent.nodes Per-node model assignment and tuning parameters
agent.services.datasources Database connections (SQLite, DuckDB, Snowflake, etc.)
agent.storage Embedding models, vector DB, and RAG configuration
agent.workflow Execution plans with sequential, parallel, and sub-workflow steps
agent.agentic_nodes Configuration for agentic nodes (semantic model gen, metrics gen)
agent.document Platform documentation sources (GitHub repos, websites, local files)

API keys are injected via environment variables using ${ENV_VAR} syntax.

Supported LLM Providers

Provider Type Notes
OpenAI openai GPT-4o, GPT-4, etc.
Anthropic Claude claude Direct API
Google Gemini gemini Gemini 2.0+
DeepSeek deepseek DeepSeek-Chat, DeepSeek-Coder
Alibaba Qwen qwen Qwen series
Moonshot Kimi kimi Kimi models
MiniMax minimax MiniMax models
GLM (Zhipu) glm GLM-4 series
OpenAI Codex codex OAuth-based Codex models (gpt-5.3-codex, o3-codex)
OpenRouter openrouter 300+ models via a single API key

Embedding models: OpenAI, Sentence-Transformers, FastEmbed, Hugging Face.

Per-node model assignment lets you use different providers for different workflow steps (e.g., a cheaper model for schema linking, a stronger model for SQL generation).

Supported Databases

Database Type Package
SQLite sqlite Built-in
DuckDB duckdb Built-in
PostgreSQL postgresql datus-postgresql
MySQL mysql datus-mysql
Snowflake snowflake datus-snowflake
StarRocks starrocks datus-starrocks
ClickHouse clickhouse datus-clickhouse
ClickZetta clickzetta datus-clickzetta
Hive hive datus-hive
Spark spark datus-spark
Trino trino datus-trino

See Database Adapters documentation for details.

How It Works

How It Works

Explore — Chat with your database, test queries, and ground prompts with @table or @file references.

datus-cli --datasource demo
/Check the top 10 banks by assets lost @table duckdb-demo.main.bank_failures

Build Context — Generate semantic models, import SQL history, define metrics. Each piece becomes reusable context for future queries.

/gen_semantic_model xxx        # Generate semantic model from tables
/gen_sql_summary               # Index SQL history for retrieval

Create a Subagent — Package mature context into a scoped, domain-aware chatbot with curated tools and business rules.

.subagent add mychatbot        # Create a new subagent

Deliver — Serve the subagent to analysts via web (localhost:8501/?subagent=mychatbot), REST API, or MCP — with feedback collection (upvotes, issue reports) built in.

Measure — Run benchmarks against BIRD or Spider 2.0-Snow to track SQL accuracy as context evolves.

Iterate — Analyst feedback loops back: engineers fix SQL, add rules, refine semantic models, and extend with Skills or MCP tools. The agent gets more accurate over time.

End-to-end tutorial · CLI docs · Knowledge Base docs · Subagent docs

Development

uv sync                                           # Install dependencies
uv run pytest tests/unit_tests/ -q                # Run CI tests (no external deps)
uv run ruff format . && uv run ruff check --fix . # Lint & format

Enable --save_llm_trace on CLI commands or set save_llm_trace: true per model in agent.yml to persist LLM inputs/outputs for debugging. → LLM Trace docs

See CLAUDE.md for full development conventions, architecture patterns, and testing rules.

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datus_agent-0.3.3.tar.gz (4.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datus_agent-0.3.3-py3-none-any.whl (5.0 MB view details)

Uploaded Python 3

File details

Details for the file datus_agent-0.3.3.tar.gz.

File metadata

  • Download URL: datus_agent-0.3.3.tar.gz
  • Upload date:
  • Size: 4.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for datus_agent-0.3.3.tar.gz
Algorithm Hash digest
SHA256 1a063c34ecf7c633893aec458a1a2bbda037d9b7ee53dc21826e7c9ef633a03f
MD5 7493f0182c87b6724a089e949c5caf5a
BLAKE2b-256 a0ab31d2a6f7bfb87d7eb7bb57f9b5b886cd2c3ef621e417cc1452a16feb02d2

See more details on using hashes here.

File details

Details for the file datus_agent-0.3.3-py3-none-any.whl.

File metadata

  • Download URL: datus_agent-0.3.3-py3-none-any.whl
  • Upload date:
  • Size: 5.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for datus_agent-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b109d3ab4225a6c4b6cd4175b1c179ffb3ec7db66e0afb9252a65b5f74c6f12e
MD5 e821cb2ea6d268f184046ebfa2293b8a
BLAKE2b-256 dee8c47f93c7e565574bf5cb2f4203e8066fc1c27f7cd926421f5c278d6beba7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page