Databricks MCP Server for Synthetic Data Generation with dbldatagen
Project description
Databricks dbldatagen MCP Server
A Model Context Protocol (MCP) server for generating synthetic test data using dbldatagen on Databricks. Enables AI assistants to analyze source tables, generate realistic synthetic data, run SQL queries, and manage notebooks — all through natural language.
Features
- Schema Analysis — Inspect column types, nullable flags, metadata, and detect primary keys
- Data Profiling — Deep profiling including distributions, cardinality, null ratios, and pattern detection
- Synthetic Data Generation — Content-aware generation using dbldatagen DataAnalyzer with preserved columns, fixed values, and schema casting
- SQL Execution — Run any SQL query on Databricks (SELECT, DESCRIBE, CREATE, etc.)
- Notebook Operations — Import, sync, and export notebooks to/from Databricks workspace
- Windows Support — Full Windows compatibility with optimized async handling
Architecture
┌────────────────────────────────────────────────────────────┐
│ AI Assistant (VS Code) │
└───────────────────────────┬────────────────────────────────┘
│ MCP Protocol (stdio)
▼
┌────────────────────────────────────────────────────────────┐
│ databricks-dbldatagen-mcp (FastMCP) │
│ │
│ tools/generate_data.py ────┐ │
│ tools/analyze_schema.py ───┤ │
│ tools/profile.py ──────────┼──► @mcp.tool decorators │
│ tools/sql.py ──────────────┤ │
│ tools/notebook_ops.py ─────┘ │
│ │
│ core/analyzer.py ──────────── DataProfiler │
│ auth.py ───────────────────── Authentication & caching │
│ identity.py ───────────────── User-agent tagging │
│ app.py ────────────────────── FastMCP instance + patches │
│ server.py ─────────────────── Entry point │
└───────────────────────────┬────────────────────────────────┘
│ Databricks SDK / REST API
▼
┌───────────────────┐
│ Databricks │
│ Workspace │
└───────────────────┘
Quick Start
1. Install
git clone <repo-url>
cd databricks-dbldatagen-mcp
# Create venv and install
python -m venv .venv
.venv\Scripts\activate # Windows
pip install -e .
2. Configure authentication
Create a .env file in the project root:
DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
DATABRICKS_TOKEN=your-token
DATABRICKS_CLUSTER_ID=your-cluster-id
3. Add to VS Code
Create .vscode/mcp.json:
{
"servers": {
"databricks-dbldatagen": {
"type": "stdio",
"command": "${workspaceFolder}\\.venv\\Scripts\\python.exe",
"args": ["-m", "databricks_dbldatagen_mcp.server"],
"env": {
"PYTHONPATH": "${workspaceFolder}",
"FASTMCP_LOG_LEVEL": "WARNING"
}
}
}
}
4. Use
Open VS Code Copilot Chat and start prompting:
- "Analyze the schema of catalog.schema.my_table"
- "Generate 5000 rows of synthetic data from source_table into target_table"
- "Run SELECT * FROM my_table LIMIT 10"
Available Tools
| Tool | Description |
|---|---|
generate_data |
Generate synthetic data matching source table structure and distributions |
analyze_schema |
Get table schema — columns, types, nullable flags, primary keys, row count |
profile_table_data |
Deep profile — distributions, cardinality, null ratios, patterns per column |
sql |
Execute any SQL query on Databricks |
run_job |
Run a Databricks notebook as a one-time job and return output |
import_notebook |
Download a notebook from Databricks to local |
sync_notebook |
Push local notebook changes back to Databricks |
export_notebook |
Upload a local notebook to Databricks workspace |
See TOOLS.md for detailed usage with all parameters and examples.
Environment Variables
| Variable | Description | Required |
|---|---|---|
DATABRICKS_HOST |
Workspace URL | Yes |
DATABRICKS_TOKEN |
Personal access token | Yes* |
DATABRICKS_CLUSTER_ID |
Cluster ID for generation/SQL | Yes |
DATABRICKS_CONFIG_PROFILE |
Profile from ~/.databrickscfg |
Alternative |
DATABRICKS_CLIENT_ID |
OAuth client ID | For OAuth |
DATABRICKS_CLIENT_SECRET |
OAuth client secret | For OAuth |
* Required unless using OAuth or config profile.
Development
pip install -e ".[dev]"
pytest tests/ -v
black databricks_dbldatagen_mcp/
ruff check databricks_dbldatagen_mcp/ --fix
License
MIT License — see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file databricks_dbldatagen_mcp-0.1.0.tar.gz.
File metadata
- Download URL: databricks_dbldatagen_mcp-0.1.0.tar.gz
- Upload date:
- Size: 54.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
183a410b37874eff736ea8b677ceb2fdd6a4a52336bbd1cf7953b9eba4387c21
|
|
| MD5 |
cc3ed1ab63ea523131a68bff834c0ed6
|
|
| BLAKE2b-256 |
f01ac2838074f5f656a485e3beaa6f69aaf0de7623ec27ae378ef561ffd47c17
|
File details
Details for the file databricks_dbldatagen_mcp-0.1.0-py3-none-any.whl.
File metadata
- Download URL: databricks_dbldatagen_mcp-0.1.0-py3-none-any.whl
- Upload date:
- Size: 44.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b878e5860f5345c56dad656daee9071be9073bf60436f3381c066507803d4e8
|
|
| MD5 |
a993dbd7726591fbf8f3c071cefd062a
|
|
| BLAKE2b-256 |
b74f48a4406ea85e5d581fb351182fb249c22d3ba6871aa8ce99581d51bb0e0d
|