Skip to main content

Agent-friendly Spark Connect CLI: read-only querying + async long-job control. No JVM, no Kerberos on the client.

Project description

spark-connect-cli (scq)

An agent-friendly Spark Connect CLI — read-only querying plus async control for long-running jobs.

Built for LLM agents and humans who live in a shell. Unlike spark-sql / spark-submit, the client is a thin pure-Python gRPC client: no JVM, and no Kerberos on the client side — the Spark Connect server authenticates with its own keytab, so you just point at sc://host:15002 and go.

Why

  • JSON-first, read-only by default. Safe for an agent to call for exploration; writes/DDL are blocked unless you opt in (--allow-ddl).
  • Long jobs don't block you. A multi-minute Spark job shouldn't trap an agent in a 30-minute tool call. scq submits the job, hands back a durable job id, and returns immediately. Poll it whenever you like; the handle survives a client/container restart because it lives in an on-disk registry.
  • Stable exit codes so a caller can branch without scraping text.

Install

pip install spark-connect-cli         # once published
# or, from source:
pip install -e .

Quick start

export SPARK_REMOTE=sc://localhost:15002   # your Spark Connect endpoint

scq databases
scq tables mydb --like '%orders%'
scq describe mydb.orders
scq query "SELECT id, name FROM mydb.orders LIMIT 10"

Output is JSONEachRow (one JSON object per line) by default; pick another with --format json|csv|tsv|table.

Read-only guard

scq query allows only SELECT/SHOW/DESCRIBE/EXPLAIN/WITH. Anything else exits with code 3 unless you pass --allow-ddl.

exit meaning
0 success
1 query error (bad SQL)
2 connection error
3 blocked by the read-only guard
4 job-control error (no such job, …)

Async jobs (Layer A)

Long work runs detached and is tracked by a file-based registry under $SCQ_JOBS_DIR (default ~/.spark-connect-cli/jobs).

# submit — returns a job id immediately, does NOT block
scq sync ods.orders --to clickhouse
# {"job_id": "j-20260625-...", "state": "running", "message": "... poll with ..."}

scq jobs list                       # all jobs + state
scq jobs status j-20260625-...      # full status (rows, timings, pid, exit code)
scq jobs logs   j-20260625-... --tail 40
scq jobs cancel j-20260625-...      # kills the whole process group

Design: each job is a directory with meta.json (state machine: submitted → running → succeeded|failed|cancelled) and out.log. The worker runs in its own process group, so cancel kills the entire tree (no orphans). A running job whose process has vanished is reconciled to failed on the next status read, so status never lies.

Hive → ClickHouse sync

scq sync is one job kind built on the async subsystem. It uses Spark direct write: a Spark Connect job reads the Hive table and writes to ClickHouse over JDBC. The write runs on the executors, so rows never pass through this process or the agent.

Modes control write parallelism — single (one connection, small tables), parallel (N partitions, large tables), auto (picks by row count).

Requires:

  • clickhouse-jdbc on the Spark Connect server classpath (/opt/spark/jars/),
  • cluster→ClickHouse network egress,
  • a JDBC URL with credentials via --ch-jdbc / $SCQ_CH_JDBC,
  • the target ClickHouse table created beforehand with a suitable engine (Spark append won't build a usable MergeTree table for you — create it first, e.g. with the chsql skill).

Introspection

scq meta db.table            # one JSON: schema, created time, location,
                             # partitions, file count/size, mtime range
scq meta db.table --count    # also run an exact count(*)

scq exec stages?status=active            # read-only Spark REST passthrough
scq exec executors
scq exec stages/<id>/<attempt>/taskSummary?quantiles=0.5,0.95,1.0   # skew: max/median

scq exec auto-discovers the running Spark app via the YARN ResourceManager and proxies its monitoring REST API (GET-only). Set the RM base with $SCQ_YARN_RM.

Reading scq exec executors — the maxMemory field is Spark's storage/cache pool ((heap − 300 MB reserved) × 0.6), not the executor's total memory: a 512 MB executor reports ~93 MB, a 1536 MB driver ~741 MB. The real heap is spark.executor.memory (+ off-heap overhead). The driver row has 0 cores and runs no tasks. With dynamic allocation, idle executors are released — so the list may show only the driver when nothing is running.

Configuration

env default meaning
SPARK_REMOTE sc://localhost:15002 Spark Connect endpoint
SCQ_JOBS_DIR ~/.spark-connect-cli/jobs job registry (put on a persistent volume)
SCQ_MAX_ROWS 10000 default row cap for query
SCQ_CH_JDBC ClickHouse JDBC URL for sync path A
SCQ_YARN_RM http://namenode.hive-net:8088 YARN RM base for scq exec

Use with an LLM agent

SKILL.md ships a ready-made skill (discover-before-query workflow, async-job etiquette, type-mapping table). Drop it into your agent's skills directory and the agent drives scq through a shell/Bash tool.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark_connect_cli-0.2.0.tar.gz (18.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spark_connect_cli-0.2.0-py3-none-any.whl (22.3 kB view details)

Uploaded Python 3

File details

Details for the file spark_connect_cli-0.2.0.tar.gz.

File metadata

  • Download URL: spark_connect_cli-0.2.0.tar.gz
  • Upload date:
  • Size: 18.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for spark_connect_cli-0.2.0.tar.gz
Algorithm Hash digest
SHA256 18708d232813c38b7a76a29778278c13bf1d9ab475fd01a9d744971567ab03ba
MD5 91b148651f70e3872c8227a4b6f671d8
BLAKE2b-256 dd28397937493639ff98274cc034800145d74c6cb5aaca471f1725b793a74548

See more details on using hashes here.

File details

Details for the file spark_connect_cli-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: spark_connect_cli-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 22.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for spark_connect_cli-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 06cb14b754f9d8bdce26a1139a10a6f3cd310196363ff573a697252e31190788
MD5 185d8f981e0d38fe7361a550ea043776
BLAKE2b-256 1c79c532ab30db7012718be3a9a7c431dcb96810a50a0f68546708e25f73fcaf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page