Skip to main content

DataAgent - A powerful multi-modal Data Agent workflow template framework

Project description

🚀 DataAgent

中文 · English

License Python Version LangGraph openJiuwen GaussVector


Data + AI Agent: Enterprise Data Task Solution

🚀 DataAgent is a next-generation enterprise data intelligence platform for Data + AI scenarios, reimagining the entire data engineering pipeline through the Agent paradigm. Deeply integrating NL2SQL, unified semantic layers, and multi-agent collaboration, it delivers end-to-end data analysis and feature mining across financial risk control, AI for Science, and other core domains.

🌟 Why DataAgent

🏆 Scenario Advantages

Scenario Traditional Approach The DataAgent Edge Typical Applications
📊 Financial Q&A Business request → data team queue → manual SQL → manual verification; T+1 is the norm for a single metric query NL2SQL four-stage pipeline (Perception→Generation→Validation→Reflection), natural language to instant answers. Semantic metric mapping, 74%+ execution accuracy on BIRD DEV benchmark, sub-second response ✅ Enterprise financial analytics assistant
🔬 AI for Science Multi-source scientific data scattered everywhere; cross-database correlation requires manual exports; literature and data cannot be jointly queried Multi-source federated queries + structured/unstructured joint retrieval, natural-language-driven scientific data exploration ✅ Scientific data exploration platform

⚡ Core Capabilities

Capability Description
🧠 NL2SQL Intelligent Engine Four-stage pipeline: Perceptor→Generator→Validator→Reflector; multi-strategy fusion: Prompt / ICL / Skeleton / DC; supports SQLite / MySQL / PostgreSQL / Hive; 74%+ execution accuracy on BIRD benchmark
🔬 Automated Feature Engineering Agents autonomously explore relationships across hundreds of tables, auto-discover latent feature combinations with importance ranking and visualization — 10x+ efficiency boost
🏭 Full-Pipeline Data Factory Data ingestion→Schema perception→Feature mining→Model training→Report generation — one YAML config runs the complete data engineering pipeline
🧩 Unified Semantic Layer Prioritizes GaussVector as an enhanced vector retrieval foundation in the semantic layer, turning tables, columns, metric definitions, and business descriptions into retrievable schema signals for NL2SQL and multi-source semantic alignment
🔌 Plugin Tool Ecosystem Local functions / MCP (stdio+sse) / A2A — three tool types with unified registration and invocation. Auto-discovery and on-demand loading. Built-in data analysis SKILLs
📡 Native Multi-Agent Collaboration Full A2A 1.0 protocol support: automatic agent discovery, capability mapping, standardized communication. Naturally supports distributed collaboration for complex business tasks
🧩 YAML as Agent Model, tools, memory, workflow, scenario prompts — all declaratively orchestrated. From idea to running Agent in minutes
🛡️ Enterprise Security Sandbox Workspace isolation + path whitelisting + full audit trail, meeting financial-grade compliance requirements
Out of the Box 20+ industry scenario example configs — zero code to start, up and running in minutes

📋 Environment Requirements

Dependency Version
🐍 Python >= 3.11
📦 Package Manager uv (recommended) or pip

📚 Documentation

Full documentation lives under docs/ (中文 · English). Build and preview locally:

uv sync --extra mkdoc
uv run mkdocs serve -f docs/mkdocs.yml
Document Description
📖 Installation Install with uv / pip, environment variables, and verification
📖 Quick Start Run an end-to-end DataAgent workflow in minutes
🗄️ Database Installation Deploy Elasticsearch, PostgreSQL, MySQL; prioritize GaussVector integration, import scenario data, and connect Semantic Service
⚙️ Features Core capabilities, modules, tools, and model support
🧩 Semantic Service MetaVisor enriched metadata for NL2SQL, prioritizing GaussVector-oriented semantic-layer indexing, candidate schema recall, and schema perception enhancement
🔗 openJiuwen openJiuwen integration and usage guide
🏗️ Architecture System architecture; context, planning engine, and action modules
📡 API Design A2A northbound interface and Python SDK
📋 Application Cases Build a dedicated NL2SQL Agent; build a data analysis Agent
📝 Notes Development, testing, and documentation maintenance
🗓️ Milestone Release planning and roadmap

🚴 Installation

1️⃣ Clone the project

git clone https://gitcode.com/datagallery/DataAgent.git
cd DataAgent

2️⃣ Install dependencies (uv recommended)

# Install dependencies
uv sync

# Activate virtual environment
source .venv/bin/activate  # Linux / macOS
.venv\Scripts\activate     # Windows

3️⃣ Or use pip

pip install -e .

4️⃣ Configure environment variables

# Copy environment template
cp .env.example .env

# Edit .env file with your actual configuration values

⚡ Quick Start

🎮 Interactive quick start

uv run -m dataagent quickstart

Follow the prompts to enter model configuration and start chatting with the Agent!

📁 Start with config file

# Terminal interactive mode
uv run -m dataagent --config dataagent/core/flex/examples/quickstart.yaml

🔍 Config check

# Check environment variable references in config
uv run -m dataagent config check dataagent/core/flex/examples/quickstart.yaml

📖 Usage

🐍 Python SDK

from dataagent import DataAgent

agent = DataAgent.from_config("path/to/config.yaml")

# Single-turn conversation
response = await agent.chat("Analyze sales data trends for the past week")
print(response)

# Streaming conversation
async for chunk in agent.astream(input={"user_query": "Generate user report"}):
    print(chunk, end="", flush=True)

📝 YAML Config Example

AGENT_CONFIG:
  name: "My Data Agent"
  version: "1.0"
  description: "Data Analysis Agent"
  backend: "langgraph"
  type: "react"

MODEL:
  chat_model:
    provider: "deepseek"
    model_type: "chat"
    params:
      model: "deepseek-chat"
      temperature: 0.7
      base_url: "$env{DEEPSEEK_BASE_URL}"
      api_key: "$env{DEEPSEEK_API_KEY}"

WORKSPACE:
  path: "/tmp/dataagent_workspace"
  allow_path:
    - "/tmp/dataagent_workspace"

🌐 A2A 1.0 Server Mode

# Start A2A server
uv run -m dataagent serve-a2a \
  --config path/to/config.yaml \
  --host 0.0.0.0 \
  --port 9999 \
  --auth-token your_token

# Service endpoints
# ├── 🌟 AgentCard: http://localhost:9999/.well-known/agent.json
# ├── 📡 JSON-RPC:  http://localhost:9999/a2a/jsonrpc
# └── 🔌 REST:      http://localhost:9999/a2a/rest

⚙️ Configuration

🔐 Environment Variables

Variable Description Example
DEEPSEEK_API_KEY DeepSeek API Key sk-xxx
DEEPSEEK_BASE_URL DeepSeek API Base URL https://api.deepseek.com
BAILIAN_API_KEY Alibaba Cloud Bailian API Key sk-xxx
OPENAI_API_KEY OpenAI API Key sk-xxx

📌 For more configuration, refer to .env.example

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dg_dataagent-0.1.0-py3-none-any.whl (742.2 kB view details)

Uploaded Python 3

File details

Details for the file dg_dataagent-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dg_dataagent-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 742.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for dg_dataagent-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9f628e6606accba5b09dd08de5dbac3091f7c83a6e710f4a676ccf4323db9259
MD5 16d6c33a491d8a86c6d3ef5cc8913e30
BLAKE2b-256 cc49418cdc8262e4d181e6826d8622386f53a1da64d0711f5394b4cb2d6a7c75

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page