Skip to main content

MCP server for chDB — the in-process SQL OLAP engine powered by ClickHouse. Lets AI agents query Parquet, CSV, JSON, and pandas DataFrames with one tool.

Project description

chdb-mcp

PyPI CI License Python

An MCP server for chDB, the in-process SQL OLAP engine powered by ClickHouse. Lets agents (Claude Desktop, Cursor, VS Code, Codex CLI, Cline, …) query Parquet, CSV, JSON, and pandas DataFrames with one tool — no separate server, no Docker.

Why chdb-mcp?

  • Full ClickHouse engine, in-process. 1000+ functions (windowFunnel, quantilesTDigest, geoToH3, the -If/-State/-Merge combinators), typed JSON with O(1) sub-column reads, native vectors, MergeTree storage.
  • Drop-in pandas API. import datastore as pd covers ~300 pandas-shaped methods compiled to ClickHouse SQL. v1.0 adds dataframe_query() for zero-copy Python(df).
  • ~80 formats and 12+ source connectors in core. Parquet, CSV, JSON, Avro, ORC, Arrow, Protobuf, plus s3(), mongodb(), postgresql(), mysql(), iceberg(), deltaLake() — no INSTALL/LOAD chain.
  • Federate to remote ClickHouse in one statement. (v0.5) remoteSecure('cluster:9440', 'db.table', ...) joins local Parquet with a production ClickHouse cluster in one optimised plan.
  • Same SQL as your warehouse. Copy-paste ClickHouse production queries into the agent prompt — no dialect bridge.

Install

pip install chdb-mcp

Connect

Claude Desktop — add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):

{ "mcpServers": { "chdb": { "command": "chdb-mcp" } } }

Cursor / VS Code — same JSON in ~/.cursor/mcp.json etc.; one-click badges land in v0.2.

Codex CLI / Claude Code / Copilot / Droid — use the cross-IDE bundle chdb-agent-plugin.

Tools (v0.1)

Tool Description
query(sql, format) Run any read-only SQL on the in-process session
list_databases() Enumerate visible databases
list_tables(database) List tables in a database
describe_table(database, table) Column types for a table
query_file(path, sql, format) Query a Parquet/CSV/JSON file via the {file} placeholder
get_sample_data(database, table, limit) First N rows of a table
list_functions(pattern) List ClickHouse SQL functions (optional substring filter)

Read-only by default — SET readonly=2 blocks INSERT/CREATE/DROP/ALTER while keeping file()/url()/s3() usable. Set CHDB_MCP_WRITE=1 to drop the guard. See Security model.

In query_file, {file} is replaced with file('path', 'format') before execution:

query_file(
    path="/data/sales.parquet",
    sql="SELECT region, sum(revenue) FROM {file} GROUP BY region",
    format="Parquet",
)

Configuration

Variable Default Effect
CHDB_MCP_WRITE unset If 1, allows INSERT/CREATE/DROP/ALTER
CHDB_MCP_MAX_RESULT_BYTES 1048576 Per-tool result cap. Enforced engine-side (max_result_bytes + result_overflow_mode='break') plus a final Python slice.
CHDB_MCP_QUERY_TIMEOUT_SEC 30 Wall-clock cap per query (chDB max_execution_time). 0 disables.
CHDB_MCP_FILE_ALLOWLIST empty (unrestricted) :-separated path prefixes. Opt-in isolation switch — when set, query_file() rejects paths outside the prefixes, and query() rejects external table functions (file/url/s3/remote/hdfs/mongodb/...). When unset, no filesystem gating happens — the host process is trusted.
CHDB_MCP_SESSION_PATH empty Persistent session directory (default: ephemeral)

Security model

chDB is in-process. There is no privilege boundary between the MCP server and the host Python interpreter, so the server can't make stronger isolation guarantees than the host already gives it. The model below reflects that.

Trust tiers

  1. Default (no CHDB_MCP_FILE_ALLOWLIST) — no filesystem gating. query() and query_file() can reach anything the host process can reach (any file(), url(), s3(), remote()...). Appropriate when the agent is trusted, or when the surrounding host application enforces the security boundary itself.
  2. Opt-in allowlist (CHDB_MCP_FILE_ALLOWLIST=/data:/tmp/foo) — best-effort defense in depth:
    • query_file() rejects paths whose resolved (symlink-followed) form isn't under any listed prefix.
    • Both query() and query_file() reject SQL containing any table function that isn't on the safe-by-construction list (numbers/values/view/merge/dictionary/generateRandom/...). The "known" set is snapshotted from system.table_functions at session start, so the gate stays in sync with whatever the running chDB build actually exposes — including new external-source variants (paimon*, prometheusQuery*, iceberg*Azure/S3/HDFS), RCE-class functions (executable, python), and *Cluster siblings, without a hand-maintained denylist that goes stale.
    • For query_file(), the scan runs on the user SQL before the {file} placeholder substitution, so a UNION ALL SELECT … FROM file('/etc/passwd', …) smuggled into the query body is caught even though the explicit path is gated.
    • The scanner is comment- and string-aware (single-pass mask covering line comments, block comments, single-quoted strings with '' / \' / \\ escapes), and it normalizes backtick- and double-quote-wrapped identifiers (`file` / "file") before matching so quoted function names can't bypass it.
    • This is not a sandbox: a determined caller can still try to exfiltrate via undiscovered functions, settings, or future chDB features. Strong enough for casual agent mistakes, not for adversarial input.
  3. Hard isolation — for adversarial input, wrap the server in OS-level confinement: macOS App Sandbox, Linux user namespaces / seccomp, or Docker with a read-only filesystem mount. Nothing at the MCP layer can substitute for this.

What's protected

  • Accidental writesSET readonly=2 is applied at session start. CHDB_MCP_WRITE=1 lifts it. (Note: ClickHouse's readonly=2 still permits TEMPORARY TABLE writes and runtime SET changes — by design, not a bug.)
  • Runaway result sizesCHDB_MCP_MAX_RESULT_BYTES is enforced engine-side (max_block_size + max_result_bytes + result_overflow_mode='break'), not just as a post-hoc string slice. Large queries no longer materialize multi-MiB in chDB before truncation.
  • Runaway wall-clockCHDB_MCP_QUERY_TIMEOUT_SEC (default 30s) caps each query via chDB's max_execution_time.
  • SQL-identifier injectionlist_tables / describe_table / get_sample_data arguments are whitelist-regex'd ([A-Za-z_][A-Za-z0-9_]* only) and backtick-quoted before interpolation.
  • SQL string-literal escapelist_functions(pattern) and query_file(path, format) arguments are passed through quote_string, which escapes both single quotes (''') and backslashes (\\\) so that ClickHouse's \' escape form cannot break out of the literal.

What's NOT protected

  • SQL audit. Only the readonly guard — no allow/deny list of statements. Treat the agent as having full SELECT access to anything chDB can reach (subject to the allowlist when set).
  • Setting tampering. Under readonly=2, the agent can still SET max_memory_usage = … to raise resource caps. Lock this down at the host or via OS-level resource limits if it matters.
  • Memory / CPU caps. chDB's max_memory_usage applies, but there's no ulimit/cgroups equivalent imposed by the MCP layer.

For agents acting on untrusted input, run in a throwaway container.

Roadmap

  • v0.5query_remote_clickhouse() federation tool
  • v1.0attach_file(), dataframe_query() (zero-copy Python(df)), HTTP/SSE transport with Bearer auth, .mcpb bundle for Claude Desktop one-click install

Troubleshooting

macOS: "Server disconnected" in Claude Desktop

If ~/Library/Logs/Claude/mcp-server-chdb.log shows PermissionError: Operation not permitted on pyvenv.cfg, your venv sits under a TCC-protected directory (~/Downloads, ~/Documents, ~/Desktop) — Claude Desktop subprocesses can't read those paths.

Fix: install elsewhere. Recommended is uvx (zero-config, isolated under ~/.local/share/uv/):

{ "mcpServers": { "chdb": { "command": "uvx", "args": ["chdb-mcp"] } } }

Or build a venv yourself under ~/.local/share/chdb-mcp/.venv and point Claude Desktop at its chdb-mcp binary.

query_file returns "path is not under any prefix"

The allowlist resolves symlinks on both sides (so /tmp matches /private/tmp on macOS). If you still hit this, check the resolved form printed in the error against python -c "from pathlib import Path; print(Path('YOUR_PATH').resolve())".

"Cannot execute query in readonly mode"

SET readonly=2 blocks DDL/DML by design. Rewrite as a pure SELECT, or restart with CHDB_MCP_WRITE=1.

Per-server logs

~/Library/Logs/Claude/mcp-server-chdb.log   # startup diagnostics + stderr
~/Library/Logs/Claude/mcp.log                # all servers' JSON-RPC traffic

Development

git clone https://github.com/chdb-io/chdb-mcp && cd chdb-mcp
pip install -e ".[dev]"
pytest && ruff check src tests

License

Apache 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chdb_mcp-0.2.0.tar.gz (25.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chdb_mcp-0.2.0-py3-none-any.whl (20.2 kB view details)

Uploaded Python 3

File details

Details for the file chdb_mcp-0.2.0.tar.gz.

File metadata

  • Download URL: chdb_mcp-0.2.0.tar.gz
  • Upload date:
  • Size: 25.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for chdb_mcp-0.2.0.tar.gz
Algorithm Hash digest
SHA256 2ae0f330eba3bb3c1f0eec4ee3b3fe52e0d79fcf0b968e1a06b5bd2098d88023
MD5 28161a7435fff31775824cf92b87e292
BLAKE2b-256 8dc1e5191cde16c4623d444c4b43150cd5b50eda8ed5fdd0b31f5f664dccdd1d

See more details on using hashes here.

Provenance

The following attestation bundles were made for chdb_mcp-0.2.0.tar.gz:

Publisher: publish.yml on chdb-io/chdb-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chdb_mcp-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: chdb_mcp-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 20.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for chdb_mcp-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5d0f9e4dd57b6a9b1e983b36313cfc891db37372b9f47f27a1140a889aec188a
MD5 1381f9a3f949e82f20a74988fcbeb261
BLAKE2b-256 5123b8f924b819f3e11aba228064e9ad78975d10d9eee20bf76d9c5096716a56

See more details on using hashes here.

Provenance

The following attestation bundles were made for chdb_mcp-0.2.0-py3-none-any.whl:

Publisher: publish.yml on chdb-io/chdb-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page