DataAgent - A powerful multi-modal Data Agent workflow template framework
Project description
🚀 DataAgent
中文 · English
Data + AI Agent: Enterprise Data Task Solution
🚀 DataAgent is a next-generation enterprise data intelligence platform for Data + AI scenarios, reimagining the entire data engineering pipeline through the Agent paradigm. Deeply integrating NL2SQL, unified semantic layers, and multi-agent collaboration, it delivers end-to-end data analysis and feature mining across financial risk control, AI for Science, and other core domains.
🌟 Why DataAgent
🏆 Scenario Advantages
| Scenario | Traditional Approach | The DataAgent Edge | Typical Applications |
|---|---|---|---|
| 📊 Financial Q&A | Business request → data team queue → manual SQL → manual verification; T+1 is the norm for a single metric query | NL2SQL four-stage pipeline (Perception→Generation→Validation→Reflection), natural language to instant answers. Semantic metric mapping, 74%+ execution accuracy on BIRD DEV benchmark, sub-second response | ✅ Enterprise financial analytics assistant |
| 🔬 AI for Science | Multi-source scientific data scattered everywhere; cross-database correlation requires manual exports; literature and data cannot be jointly queried | Multi-source federated queries + structured/unstructured joint retrieval, natural-language-driven scientific data exploration | ✅ Scientific data exploration platform |
⚡ Core Capabilities
| Capability | Description |
|---|---|
| 🧠 NL2SQL Intelligent Engine | Four-stage pipeline: Perceptor→Generator→Validator→Reflector; multi-strategy fusion: Prompt / ICL / Skeleton / DC; supports SQLite / MySQL / PostgreSQL / Hive; 74%+ execution accuracy on BIRD benchmark |
| 🔬 Automated Feature Engineering | Agents autonomously explore relationships across hundreds of tables, auto-discover latent feature combinations with importance ranking and visualization — 10x+ efficiency boost |
| 🏭 Full-Pipeline Data Factory | Data ingestion→Schema perception→Feature mining→Model training→Report generation — one YAML config runs the complete data engineering pipeline |
| 🧩 Unified Semantic Layer | Prioritizes GaussVector as an enhanced vector retrieval foundation in the semantic layer, turning tables, columns, metric definitions, and business descriptions into retrievable schema signals for NL2SQL and multi-source semantic alignment |
| 🔌 Plugin Tool Ecosystem | Local functions / MCP (stdio+sse) / A2A — three tool types with unified registration and invocation. Auto-discovery and on-demand loading. Built-in data analysis SKILLs |
| 📡 Native Multi-Agent Collaboration | Full A2A 1.0 protocol support: automatic agent discovery, capability mapping, standardized communication. Naturally supports distributed collaboration for complex business tasks |
| 🧩 YAML as Agent | Model, tools, memory, workflow, scenario prompts — all declaratively orchestrated. From idea to running Agent in minutes |
| 🛡️ Enterprise Security Sandbox | Workspace isolation + path whitelisting + full audit trail, meeting financial-grade compliance requirements |
| ⚡ Out of the Box | 20+ industry scenario example configs — zero code to start, up and running in minutes |
📋 Environment Requirements
| Dependency | Version |
|---|---|
| 🐍 Python | >= 3.11 |
| 📦 Package Manager | uv (recommended) or pip |
📚 Documentation
Full documentation lives under docs/ (中文 · English). Build and preview locally:
uv sync --extra mkdoc
uv run mkdocs serve -f docs/mkdocs.yml
| Document | Description |
|---|---|
| 📖 Installation | Install with uv / pip, environment variables, and verification |
| 📖 Quick Start | Run an end-to-end DataAgent workflow in minutes |
| 🗄️ Database Installation | Deploy Elasticsearch, PostgreSQL, MySQL; prioritize GaussVector integration, import scenario data, and connect Semantic Service |
| ⚙️ Features | Core capabilities, modules, tools, and model support |
| 🧩 Semantic Service | MetaVisor enriched metadata for NL2SQL, prioritizing GaussVector-oriented semantic-layer indexing, candidate schema recall, and schema perception enhancement |
| 🔗 openJiuwen | openJiuwen integration and usage guide |
| 🏗️ Architecture | System architecture; context, planning engine, and action modules |
| 📡 API Design | A2A northbound interface and Python SDK |
| 📋 Application Cases | Build a dedicated NL2SQL Agent; build a data analysis Agent |
| 📝 Notes | Development, testing, and documentation maintenance |
| 🗓️ Milestone | Release planning and roadmap |
🚴 Installation
1️⃣ Clone the project
git clone https://gitcode.com/datagallery/DataAgent.git
cd DataAgent
2️⃣ Install dependencies (uv recommended)
# Install dependencies
uv sync
# Activate virtual environment
source .venv/bin/activate # Linux / macOS
.venv\Scripts\activate # Windows
3️⃣ Or use pip
pip install -e .
4️⃣ Configure environment variables
# Copy environment template
cp .env.example .env
# Edit .env file with your actual configuration values
⚡ Quick Start
🎮 Interactive quick start
uv run -m dataagent quickstart
Follow the prompts to enter model configuration and start chatting with the Agent!
📁 Start with config file
# Terminal interactive mode
uv run -m dataagent --config dataagent/core/flex/examples/quickstart.yaml
🔍 Config check
# Check environment variable references in config
uv run -m dataagent config check dataagent/core/flex/examples/quickstart.yaml
📖 Usage
🐍 Python SDK
from dataagent import DataAgent
agent = DataAgent.from_config("path/to/config.yaml")
# Single-turn conversation
response = await agent.chat("Analyze sales data trends for the past week")
print(response)
# Streaming conversation
async for chunk in agent.astream(input={"user_query": "Generate user report"}):
print(chunk, end="", flush=True)
📝 YAML Config Example
AGENT_CONFIG:
name: "My Data Agent"
version: "1.0"
description: "Data Analysis Agent"
backend: "langgraph"
type: "react"
MODEL:
chat_model:
provider: "deepseek"
model_type: "chat"
params:
model: "deepseek-chat"
temperature: 0.7
base_url: "$env{DEEPSEEK_BASE_URL}"
api_key: "$env{DEEPSEEK_API_KEY}"
WORKSPACE:
path: "/tmp/dataagent_workspace"
allow_path:
- "/tmp/dataagent_workspace"
🌐 A2A 1.0 Server Mode
# Start A2A server
uv run -m dataagent serve-a2a \
--config path/to/config.yaml \
--host 0.0.0.0 \
--port 9999 \
--auth-token your_token
# Service endpoints
# ├── 🌟 AgentCard: http://localhost:9999/.well-known/agent.json
# ├── 📡 JSON-RPC: http://localhost:9999/a2a/jsonrpc
# └── 🔌 REST: http://localhost:9999/a2a/rest
⚙️ Configuration
🔐 Environment Variables
| Variable | Description | Example |
|---|---|---|
DEEPSEEK_API_KEY |
DeepSeek API Key | sk-xxx |
DEEPSEEK_BASE_URL |
DeepSeek API Base URL | https://api.deepseek.com |
BAILIAN_API_KEY |
Alibaba Cloud Bailian API Key | sk-xxx |
OPENAI_API_KEY |
OpenAI API Key | sk-xxx |
📌 For more configuration, refer to
.env.example
📄 License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dg_dataagent-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dg_dataagent-0.1.0-py3-none-any.whl
- Upload date:
- Size: 742.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9f628e6606accba5b09dd08de5dbac3091f7c83a6e710f4a676ccf4323db9259
|
|
| MD5 |
16d6c33a491d8a86c6d3ef5cc8913e30
|
|
| BLAKE2b-256 |
cc49418cdc8262e4d181e6826d8622386f53a1da64d0711f5394b4cb2d6a7c75
|