Agent-friendly Spark Connect CLI: read-only querying + async long-job control. No JVM, no Kerberos on the client.
Project description
spark-connect-cli (scq)
An agent-friendly Spark Connect CLI — read-only querying plus async control for long-running jobs.
Built for LLM agents and humans who live in a shell. Unlike spark-sql /
spark-submit, the client is a thin pure-Python gRPC client: no JVM, and
no Kerberos on the client side — the Spark Connect server authenticates with
its own keytab, so you just point at sc://host:15002 and go.
Why
- JSON-first, read-only by default. Safe for an agent to call for
exploration; writes/DDL are blocked unless you opt in (
--allow-ddl). - Long jobs don't block you. A multi-minute Spark job shouldn't trap an agent
in a 30-minute tool call.
scqsubmits the job, hands back a durable job id, and returns immediately. Poll it whenever you like; the handle survives a client/container restart because it lives in an on-disk registry. - Stable exit codes so a caller can branch without scraping text.
Install
pip install spark-connect-cli # once published
# or, from source:
pip install -e .
Quick start
export SPARK_REMOTE=sc://localhost:15002 # your Spark Connect endpoint
scq databases
scq tables mydb --like '%orders%'
scq describe mydb.orders
scq query "SELECT id, name FROM mydb.orders LIMIT 10"
Output is JSONEachRow (one JSON object per line) by default; pick another with
--format json|csv|tsv|table.
Read-only guard
scq query allows only SELECT/SHOW/DESCRIBE/EXPLAIN/WITH. Anything else exits
with code 3 unless you pass --allow-ddl.
| exit | meaning |
|---|---|
| 0 | success |
| 1 | query error (bad SQL) |
| 2 | connection error |
| 3 | blocked by the read-only guard |
| 4 | job-control error (no such job, …) |
Async jobs (Layer A)
Long work runs detached and is tracked by a file-based registry under
$SCQ_JOBS_DIR (default ~/.spark-connect-cli/jobs).
# submit — returns a job id immediately, does NOT block
scq sync ods.orders --to clickhouse
# {"job_id": "j-20260625-...", "state": "running", "message": "... poll with ..."}
scq jobs list # all jobs + state
scq jobs status j-20260625-... # full status (rows, timings, pid, exit code)
scq jobs logs j-20260625-... --tail 40
scq jobs cancel j-20260625-... # kills the whole process group
Design: each job is a directory with meta.json (state machine:
submitted → running → succeeded|failed|cancelled) and out.log. The worker
runs in its own process group, so cancel kills the entire tree (no orphans).
A running job whose process has vanished is reconciled to failed on the next
status read, so status never lies.
Hive → ClickHouse sync
scq sync is one job kind built on the async subsystem. It uses Spark direct
write: a Spark Connect job reads the Hive table and writes to ClickHouse over
JDBC. The write runs on the executors, so rows never pass through this process or
the agent.
Modes control write parallelism — single (one connection, small tables),
parallel (N partitions, large tables), auto (picks by row count).
Requires:
clickhouse-jdbcon the Spark Connect server classpath (/opt/spark/jars/),- cluster→ClickHouse network egress,
- a JDBC URL with credentials via
--ch-jdbc/$SCQ_CH_JDBC, - the target ClickHouse table created beforehand with a suitable engine
(Spark
appendwon't build a usable MergeTree table for you — create it first, e.g. with thechsqlskill).
Introspection
scq meta db.table # one JSON: schema, created time, location,
# partitions, file count/size, mtime range
scq meta db.table --count # also run an exact count(*)
scq exec stages?status=active # read-only Spark REST passthrough
scq exec executors
scq exec stages/<id>/<attempt>/taskSummary?quantiles=0.5,0.95,1.0 # skew: max/median
scq exec auto-discovers the running Spark app via the YARN ResourceManager and
proxies its monitoring REST API (GET-only). Set the RM base with $SCQ_YARN_RM.
Reading scq exec executors — the maxMemory field is Spark's
storage/cache pool ((heap − 300 MB reserved) × 0.6), not the executor's
total memory: a 512 MB executor reports ~93 MB, a 1536 MB driver ~741 MB. The
real heap is spark.executor.memory (+ off-heap overhead). The driver row has
0 cores and runs no tasks. With dynamic allocation, idle executors are released —
so the list may show only the driver when nothing is running.
Configuration
| env | default | meaning |
|---|---|---|
SPARK_REMOTE |
sc://localhost:15002 |
Spark Connect endpoint |
SCQ_JOBS_DIR |
~/.spark-connect-cli/jobs |
job registry (put on a persistent volume) |
SCQ_MAX_ROWS |
10000 |
default row cap for query |
SCQ_CONNECT_TIMEOUT |
10 |
seconds to wait for the endpoint's TCP socket before failing with exit 2 (keeps a dead endpoint from hanging the caller) |
SCQ_CH_JDBC |
— | ClickHouse JDBC URL for sync path A |
SCQ_YARN_RM |
http://namenode.hive-net:8088 |
YARN RM base for scq exec |
Use with an LLM agent
SKILL.md ships a ready-made skill (discover-before-query workflow, async-job
etiquette, type-mapping table). Drop it into your agent's skills directory and
the agent drives scq through a shell/Bash tool.
Roadmap
- Clarify in
SKILL.mdthatscq exec executorsmaxMemoryis the storage pool, not total memory (already noted above). scq cluster— optional read-only passthrough to the YARN ResourceManager REST (apps / queues / nodes), rounding out the introspection plane.- Vendored/offline install path (bundle wheels) for air-gapped deployments.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spark_connect_cli-0.3.1.tar.gz.
File metadata
- Download URL: spark_connect_cli-0.3.1.tar.gz
- Upload date:
- Size: 21.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cde5d7532bada70b95ec250b67841ca714079bbc31f6cacb530e72d9b81d3571
|
|
| MD5 |
4f65a22421d07100933f72accfca7d6d
|
|
| BLAKE2b-256 |
8b39cb82d1c58dcbd3eb36d77c4667012377b09c8cccce7120c2e20dd15bab2c
|
Provenance
The following attestation bundles were made for spark_connect_cli-0.3.1.tar.gz:
Publisher:
publish.yml on dengshu2/spark-connect-cli
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
spark_connect_cli-0.3.1.tar.gz -
Subject digest:
cde5d7532bada70b95ec250b67841ca714079bbc31f6cacb530e72d9b81d3571 - Sigstore transparency entry: 2057936164
- Sigstore integration time:
-
Permalink:
dengshu2/spark-connect-cli@d034e56d361524c4e2570cff42a78f6a16868e77 -
Branch / Tag:
refs/tags/v0.3.1 - Owner: https://github.com/dengshu2
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d034e56d361524c4e2570cff42a78f6a16868e77 -
Trigger Event:
release
-
Statement type:
File details
Details for the file spark_connect_cli-0.3.1-py3-none-any.whl.
File metadata
- Download URL: spark_connect_cli-0.3.1-py3-none-any.whl
- Upload date:
- Size: 24.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16f87383a6968bb2655990d21ee2356ea354425a7344b9308a09e897312623dc
|
|
| MD5 |
79fa859314dfbecf37f033b87ce33be4
|
|
| BLAKE2b-256 |
01611eb5325d9579364646e864773272ba700b37788d9fb48dd7466b11ad75c7
|
Provenance
The following attestation bundles were made for spark_connect_cli-0.3.1-py3-none-any.whl:
Publisher:
publish.yml on dengshu2/spark-connect-cli
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
spark_connect_cli-0.3.1-py3-none-any.whl -
Subject digest:
16f87383a6968bb2655990d21ee2356ea354425a7344b9308a09e897312623dc - Sigstore transparency entry: 2057936776
- Sigstore integration time:
-
Permalink:
dengshu2/spark-connect-cli@d034e56d361524c4e2570cff42a78f6a16868e77 -
Branch / Tag:
refs/tags/v0.3.1 - Owner: https://github.com/dengshu2
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d034e56d361524c4e2570cff42a78f6a16868e77 -
Trigger Event:
release
-
Statement type: