Skip to main content

Databricks MCP Server for Synthetic Data Generation with dbldatagen

Project description

Databricks dbldatagen MCP Server

A Model Context Protocol (MCP) server for generating synthetic test data using dbldatagen on Databricks. Enables AI assistants to analyze source tables, generate realistic synthetic data, run SQL queries, and manage notebooks — all through natural language.

Features

  • Schema Analysis — Inspect column types, nullable flags, metadata, and detect primary keys
  • Data Profiling — Deep profiling including distributions, cardinality, null ratios, and pattern detection
  • Synthetic Data Generation — Content-aware generation using dbldatagen DataAnalyzer with preserved columns, fixed values, and schema casting
  • SQL Execution — Run any SQL query on Databricks (SELECT, DESCRIBE, CREATE, etc.)
  • Notebook Operations — Import, sync, and export notebooks to/from Databricks workspace
  • Windows Support — Full Windows compatibility with optimized async handling

Architecture

┌────────────────────────────────────────────────────────────┐
│                   AI Assistant (VS Code)                   │
└───────────────────────────┬────────────────────────────────┘
                            │ MCP Protocol (stdio)
                            ▼
┌────────────────────────────────────────────────────────────┐
│            databricks-dbldatagen-mcp (FastMCP)             │
│                                                            │
│  tools/generate_data.py ────┐                              │
│  tools/analyze_schema.py ───┤                              │
│  tools/profile.py ──────────┼──► @mcp.tool decorators      │
│  tools/sql.py ──────────────┤                              │
│  tools/notebook_ops.py ─────┘                              │
│                                                            │
│  core/analyzer.py ──────────── DataProfiler                │
│  auth.py ───────────────────── Authentication & caching    │
│  identity.py ───────────────── User-agent tagging          │
│  app.py ────────────────────── FastMCP instance + patches  │
│  server.py ─────────────────── Entry point                 │
└───────────────────────────┬────────────────────────────────┘
                            │ Databricks SDK / REST API
                            ▼
                  ┌───────────────────┐
                  │    Databricks     │
                  │    Workspace      │
                  └───────────────────┘

Quick Start

1. Install

git clone <repo-url>
cd databricks-dbldatagen-mcp

# Create venv and install
python -m venv .venv
.venv\Scripts\activate    # Windows
pip install -e .

2. Configure authentication

Create a .env file in the project root:

DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
DATABRICKS_TOKEN=your-token
DATABRICKS_CLUSTER_ID=your-cluster-id

3. Add to VS Code

Create .vscode/mcp.json:

{
  "servers": {
    "databricks-dbldatagen": {
      "type": "stdio",
      "command": "${workspaceFolder}\\.venv\\Scripts\\python.exe",
      "args": ["-m", "databricks_dbldatagen_mcp.server"],
      "env": {
        "PYTHONPATH": "${workspaceFolder}",
        "FASTMCP_LOG_LEVEL": "WARNING"
      }
    }
  }
}

4. Use

Open VS Code Copilot Chat and start prompting:

  • "Analyze the schema of catalog.schema.my_table"
  • "Generate 5000 rows of synthetic data from source_table into target_table"
  • "Run SELECT * FROM my_table LIMIT 10"

Available Tools

Tool Description
generate_data Generate synthetic data matching source table structure and distributions
analyze_schema Get table schema — columns, types, nullable flags, primary keys, row count
profile_table_data Deep profile — distributions, cardinality, null ratios, patterns per column
sql Execute any SQL query on Databricks
run_job Run a Databricks notebook as a one-time job and return output
import_notebook Download a notebook from Databricks to local
sync_notebook Push local notebook changes back to Databricks
export_notebook Upload a local notebook to Databricks workspace

See TOOLS.md for detailed usage with all parameters and examples.

Environment Variables

Variable Description Required
DATABRICKS_HOST Workspace URL Yes
DATABRICKS_TOKEN Personal access token Yes*
DATABRICKS_CLUSTER_ID Cluster ID for generation/SQL Yes
DATABRICKS_CONFIG_PROFILE Profile from ~/.databrickscfg Alternative
DATABRICKS_CLIENT_ID OAuth client ID For OAuth
DATABRICKS_CLIENT_SECRET OAuth client secret For OAuth

* Required unless using OAuth or config profile.

Development

pip install -e ".[dev]"
pytest tests/ -v
black databricks_dbldatagen_mcp/
ruff check databricks_dbldatagen_mcp/ --fix

License

MIT License — see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

databricks_dbldatagen_mcp-0.1.0.tar.gz (54.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

databricks_dbldatagen_mcp-0.1.0-py3-none-any.whl (44.4 kB view details)

Uploaded Python 3

File details

Details for the file databricks_dbldatagen_mcp-0.1.0.tar.gz.

File metadata

File hashes

Hashes for databricks_dbldatagen_mcp-0.1.0.tar.gz
Algorithm Hash digest
SHA256 183a410b37874eff736ea8b677ceb2fdd6a4a52336bbd1cf7953b9eba4387c21
MD5 cc3ed1ab63ea523131a68bff834c0ed6
BLAKE2b-256 f01ac2838074f5f656a485e3beaa6f69aaf0de7623ec27ae378ef561ffd47c17

See more details on using hashes here.

File details

Details for the file databricks_dbldatagen_mcp-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for databricks_dbldatagen_mcp-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5b878e5860f5345c56dad656daee9071be9073bf60436f3381c066507803d4e8
MD5 a993dbd7726591fbf8f3c071cefd062a
BLAKE2b-256 b74f48a4406ea85e5d581fb351182fb249c22d3ba6871aa8ce99581d51bb0e0d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page