A lightweight, open-source AI assistant for data platform operations
Project description
databot
A lightweight, open-source AI agent platform for data engineering and platform operations.
~8,000 lines of core code -- built for data engineers who need an intelligent assistant for monitoring pipelines, diagnosing data quality issues, querying infrastructure, and managing the full big-data stack.
Deepwiki: https://deepwiki.com/asb108/databot/1-overview
Features
Core
- Multi-Provider LLM: Anthropic, OpenAI, DeepSeek, Gemini, local vLLM via LiteLLM
- Streaming Responses: Token-by-token streaming via SSE for real-time feedback
- Persistent Memory: SQLite-backed sessions and key-value memory (zero external deps)
- Plugin System: Extend with custom tools, channels, and providers via Python entry points
Data Tools
- SQL Queries: Execute read-only queries against MySQL, Trino, Presto, ClickHouse, StarRocks, Hive, and more
- Airflow Integration: Check DAG status, view task logs, trigger runs via REST API
- Data Quality: Row counts, null checks, freshness, source-target comparison with SQL injection protection
- Data Lineage: Upstream/downstream dependencies via NetworkX graphs or Marquez REST API
- Spark Management: Submit batch jobs, manage interactive sessions via Livy/YARN/K8s
- Kafka Ecosystem: Topics, consumer groups, Schema Registry, Kafka Connect management
- Data Catalog: Browse Iceberg REST, AWS Glue, or Databricks Unity Catalog
Connectors
- Connector Framework: Unified
BaseConnectorabstraction for SQL, REST, Spark, Kafka, and Catalog - Connector Registry: Centralized lifecycle management — auto-discovery, health checks, connect/disconnect
- Connector Factory: Declarative connector instantiation from YAML config
Intelligence
- Multi-Agent Architecture: Router + Delegator pattern with 6 specialist agents (SQL, Pipeline, Quality, Catalog, Streaming, General)
- RAG (Retrieval-Augmented Generation): ChromaDB-backed vector store for schema-aware, context-grounded answers
- MCP Server: Expose tools and connectors via Model Context Protocol for Claude Desktop, Cursor, and VS Code
Channels & Gateway
- Google Chat: Webhook (send-only) and App (bidirectional) modes
- Slack: Bot with slash commands and thread-aware conversations
- Discord: Bot with prefix commands
- REST Gateway: FastAPI gateway with auth middleware, rate limiting, SSE streaming endpoint
- Scheduled Tasks: Cron-based proactive monitoring with channel alerts
Operations
- Observability: OpenTelemetry tracing for tool calls and LLM interactions
- Security: Read-only SQL, workspace sandboxing, command allowlist, API key auth, rate limiting
- Shell & Filesystem: Execute commands, read/write files with workspace sandboxing
Quick Start
Install
# From PyPI
pip install databot-ai
# From source (recommended for development)
git clone https://github.com/asb108/databot.git
cd databot
pip install -e ".[all]"
Initialize
databot onboard
Configure
Edit ~/.databot/config.yaml:
providers:
default: anthropic
anthropic:
api_key: ${ANTHROPIC_API_KEY}
model: claude-sonnet-4-5-20250929
channels:
gchat:
enabled: true
mode: webhook
webhook_url: ${GCHAT_WEBHOOK_URL}
tools:
sql:
connections:
clickzetta:
driver: clickzetta
host: ${CZ_HOST}
schema_name: data_warehouse
virtual_cluster: ${CZ_VC}
read_only: true
max_rows: 1000
airflow:
base_url: ${AIRFLOW_URL}
username: ${AIRFLOW_USER}
password: ${AIRFLOW_PASSWORD}
security:
restrict_to_workspace: true
allowed_commands: ["kubectl", "airflow", "trino-cli"]
Chat
# Single message
databot agent -m "How many rows in pricing.rate_cards?"
# Interactive mode
databot agent
# Start gateway (always-on with cron + Google Chat)
databot gateway
Architecture
databot/
agents/ # Multi-agent framework (Router, Delegator, Specialists)
channels/ # Messaging channels (CLI, Google Chat, Slack, Discord)
cli/ # Typer CLI commands
config/ # Pydantic config schema + YAML loader
connectors/ # Connector framework (SQL, REST, Spark, Kafka, Catalog)
core/ # Agent loop, message bus, context builder, streaming
cron/ # Scheduled task execution
mcp/ # MCP server (Model Context Protocol)
memory/ # Persistent key-value memory
middleware/ # Gateway middleware (API key auth, rate limiting)
observability/# OpenTelemetry tracing
plugins/ # Plugin discovery and loading
providers/ # LLM provider abstraction (LiteLLM)
rag/ # RAG module (ChromaDB vector store)
session/ # SQLite-backed conversation history
tools/ # Pluggable tools (SQL, Airflow, DQ, lineage, spark, kafka, catalog, ...)
Tools
| Tool | Description |
|---|---|
sql |
Execute SQL queries against configured databases (connector-backed) |
airflow |
Check DAG status, view logs, trigger runs (connector-backed) |
data_quality |
Row counts, null checks, freshness, source-target comparison |
lineage |
Upstream/downstream dependencies via graphs or Marquez API |
spark |
Submit batch jobs, manage sessions via Livy/YARN/K8s connectors |
kafka |
Topics, consumer groups, Schema Registry, Kafka Connect |
catalog |
Browse Iceberg REST, AWS Glue, Unity Catalog |
shell |
Execute shell commands (sandboxed) |
read_file |
Read file contents |
write_file |
Write/create files |
edit_file |
Find-and-replace edits |
list_dir |
List directory contents |
web_fetch |
Fetch URL content |
web_search |
Search the web (Brave API) |
cron |
Manage scheduled tasks |
CLI Reference
| Command | Description |
|---|---|
databot onboard |
Initialize config and workspace |
databot agent -m "..." |
Send a single message |
databot agent |
Interactive chat mode |
databot gateway |
Start always-on service (API + channels + cron) |
databot mcp |
Start MCP server (stdio transport, for Claude Desktop / Cursor) |
databot mcp --transport sse |
Start MCP server over HTTP (SSE transport) |
databot status |
Show status and configuration |
databot cron list |
List scheduled jobs |
databot cron add --name "..." --schedule "..." --message "..." |
Add a cron job |
databot cron remove --id "..." |
Remove a cron job |
MCP Server
Databot implements the Model Context Protocol so external LLM clients can discover and invoke its tools.
Claude Desktop / Cursor
Add to your MCP config (claude_desktop_config.json or Cursor settings):
{
"mcpServers": {
"databot": {
"command": "databot",
"args": ["mcp"]
}
}
}
SSE (HTTP) Transport
databot mcp --transport sse --port 18791
Exposed Resources
- All registered tools are exposed as MCP tools
- Each connector is exposed as an MCP resource (
connector://<name>) - Health overview available at
databot://health
Connector Configuration
Connectors are declared in config.yaml under the connectors key:
connectors:
instances:
my_warehouse:
type: sql
driver: trino
host: trino.internal
port: 8080
catalog: hive
schema_name: analytics
airflow:
type: rest_api
base_url: http://airflow:8080/api/v1
auth:
type: basic
username: ${AIRFLOW_USER}
password: ${AIRFLOW_PASSWORD}
spark_livy:
type: spark
mode: livy
base_url: http://livy:8998
kafka_prod:
type: kafka
base_url: http://kafka-rest:8082
schema_registry_url: http://schema-registry:8081
connect_url: http://kafka-connect:8083
iceberg:
type: catalog
protocol: iceberg
base_url: http://iceberg-rest:8181
Docker
# Build
docker build -t databot .
# Initialize (first time)
docker run -v ~/.databot:/root/.databot --rm databot onboard
# Run gateway
docker run -v ~/.databot:/root/.databot -p 18790:18790 databot gateway
Kubernetes
See the k8s/ directory for example Kubernetes deployment manifests.
Security
- Read-only SQL by default: Write operations blocked unless explicitly enabled
- Workspace sandboxing: Filesystem and shell restricted to workspace directory
- Command allowlist: Only whitelisted shell commands can execute
- API Key Authentication: Gateway endpoints protected with Bearer / X-API-Key auth
- Rate Limiting: Configurable per-IP request rate limiting with sliding window
- SQL Injection Protection: Identifier validation on data quality checks
Plugins
Databot supports plugins via Python entry points. Third-party packages can add custom tools, channels, and LLM providers.
Creating a Plugin
- Create a Python package with your custom tool:
# my_databot_plugin/tools.py
from databot.tools.base import BaseTool
class MyCustomTool(BaseTool):
@property
def name(self) -> str:
return "my_tool"
@property
def description(self) -> str:
return "Description of what this tool does"
def parameters(self):
return {
"type": "object",
"properties": {
"param1": {"type": "string", "description": "First parameter"}
},
"required": ["param1"]
}
async def execute(self, param1: str) -> str:
return f"Executed with {param1}"
- Register it in your
pyproject.toml:
[project.entry-points."databot.tools"]
my_tool = "my_databot_plugin.tools:MyCustomTool"
- Install your package and databot will auto-discover it.
Entry Point Groups
| Group | Base Class | Description |
|---|---|---|
databot.tools |
BaseTool |
Custom tools for the agent |
databot.channels |
BaseChannel |
Messaging integrations (Slack, Discord, etc.) |
databot.providers |
LLMProvider |
LLM provider adapters |
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file databot_ai-0.3.0.tar.gz.
File metadata
- Download URL: databot_ai-0.3.0.tar.gz
- Upload date:
- Size: 200.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
87c071f8320323702750e925f2c76237f095d8f6531f0f7d44b27df2e0486c32
|
|
| MD5 |
3e4f79edd462e43ea2a49cc82a16214f
|
|
| BLAKE2b-256 |
0cd8428692ea9b4e743ca39ba9a460026eeb2e0f9c06f764b692ebddda121b73
|
Provenance
The following attestation bundles were made for databot_ai-0.3.0.tar.gz:
Publisher:
publish.yml on asb108/databot
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
databot_ai-0.3.0.tar.gz -
Subject digest:
87c071f8320323702750e925f2c76237f095d8f6531f0f7d44b27df2e0486c32 - Sigstore transparency entry: 940140216
- Sigstore integration time:
-
Permalink:
asb108/databot@705ff1ca5e54f7ccc73e7354f42f5b9acd6304cd -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/asb108
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@705ff1ca5e54f7ccc73e7354f42f5b9acd6304cd -
Trigger Event:
release
-
Statement type:
File details
Details for the file databot_ai-0.3.0-py3-none-any.whl.
File metadata
- Download URL: databot_ai-0.3.0-py3-none-any.whl
- Upload date:
- Size: 108.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
71ec284ad3671c5029185cfb2cc31f8d7a1267894febb9003b164bed2212606a
|
|
| MD5 |
da8c683346b50ddce3ce0400f87a8eed
|
|
| BLAKE2b-256 |
fc58aec0948c56858230fcd60aeecddca59f615ef1ea866a36dfe98e1fc666cb
|
Provenance
The following attestation bundles were made for databot_ai-0.3.0-py3-none-any.whl:
Publisher:
publish.yml on asb108/databot
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
databot_ai-0.3.0-py3-none-any.whl -
Subject digest:
71ec284ad3671c5029185cfb2cc31f8d7a1267894febb9003b164bed2212606a - Sigstore transparency entry: 940140280
- Sigstore integration time:
-
Permalink:
asb108/databot@705ff1ca5e54f7ccc73e7354f42f5b9acd6304cd -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/asb108
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@705ff1ca5e54f7ccc73e7354f42f5b9acd6304cd -
Trigger Event:
release
-
Statement type: