Local-first Big Data query diagnostics for Apache Impala.
Project description
Query Doctor
Last reviewed: 2026-05-13
Language: English | Russian
Query Doctor is a local-first Apache Impala query diagnostic tool. It helps data engineers explain slow, suspicious, or resource-heavy queries without pasting raw operational data into a chat tool.
It runs near the operator's own credentials, collects bounded read-only context from Cloudera Manager or direct Impala daemon endpoints, extracts deterministic facts in Python, and can generate validated reports without treating an LLM as a source of truth. Trusted reports default to English; Russian output uses the same language-specific prompt, normalizer, and validator boundary.
Core rule:
Python owns facts. LLM owns wording only.
Query Doctor is not a free-form chat wrapper over raw profiles, and it is not a SQL execution tool.
What It Does
- Scans completed Recent queries, Running queries, or one explicit Known Query ID for Apache Impala.
- Works with Cloudera Manager when available, or with direct Impala daemon profile/query-list endpoints for vanilla, Ambari-style, or otherwise non-Cloudera-Manager clusters.
- Optionally collects bounded Prometheus runtime metric summaries for direct
Impala workflows and bounded read-only Impala metadata through
impala-shell. - Ranks suspicious cases and action candidates from deterministic analyzer facts, not LLM scoring.
- Generates trusted reports only after deterministic normalization, sanitization, and validation.
- Provides a read-only Query Optimizer workflow for pasted SQL review, plus an explicit details-page optimizer action for server-owned analyzed cases.
- Keeps raw SQL, raw profiles, raw metadata, local paths, secrets, subprocess output, model/runtime internals, and raw artifact filenames out of browser and trusted report surfaces.
Supported Scope
| Area | Supported today | Not current support |
|---|---|---|
| Query engine | Apache Impala | Other engines are roadmap seams only. |
| Cloudera Manager | Full Recent discovery/profile/metrics/events context for Impala workflows | Generic cluster diagnosis beyond the Query Doctor flow. |
| Direct Impala | Bounded Recent scans, Running scans, and one Known Query ID through impalad daemon endpoints | Cloudera Manager events, broad log scraping, or SQL execution. |
| Runtime metrics | Optional bounded Prometheus summaries for configured direct Impala workflows | Raw time-series output or arbitrary PromQL from users. |
| Metadata | Read-only allowlisted metadata statements through impala-shell |
User SQL execution or unbounded metadata crawling. |
| Reports and optimizer | Python-owned facts, validation, and explicit selected-case actions | LLM output as trusted evidence or automatic batch LLM jobs. |
Future Big Data SQL/lakehouse engines, broader providers, prepared event/log sources, and Cluster Doctor workflows remain roadmap seams, not current support.
Install
Install the current public package from PyPI:
python3 -m venv .venv
. .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install query-doctor
For local development from a checkout, use an editable install:
python3 -m venv .venv
. .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e .
For contributor tooling, install the development extra:
python -m pip install -e ".[dev]"
pre-commit install
In a network-restricted environment, install from a prebuilt wheel or make sure the build dependencies are already present locally, then install the checkout:
python -m pip install .
Local JSON configuration is documented in docs/configuration.md.
The preferred workstation path is ~/.qdcreds/query-doctor-config.json;
secrets still stay in environment variables or local env files.
Quickstart Smoke
Run the deterministic local checks first. They do not call Cloudera Manager, Impala, Ollama, or the network:
query-doctor-demo-preflight
query-doctor-demo --out /tmp/query-doctor-demo-pack --overwrite
query-doctor-web --batch-summary /tmp/query-doctor-demo-pack/batch_summary.json
Open the localhost URL printed by query-doctor-web. The synthetic demo pack is
local-only and contains no real SQL, profiles, metadata, hostnames, users, or
credentials.
The synthetic demo follows the same safety shape as real local workflows:
flowchart LR
DemoPack[Synthetic demo pack] --> Web[Local web UI]
Web --> Ranked[Ranked cases]
Ranked --> Details[Details page]
Details --> Facts[Analyzer-owned facts]
Details --> Report[Explicit trusted report action]
Details --> Optimizer[Explicit optimizer action]
Console Scripts
After installation, use the packaged entry points:
query-doctor-analyze --help
query-doctor-batch-recent --help
query-doctor-cleanup-generated --help
query-doctor-cm-events --help
query-doctor-cm-sample-smoke --help
query-doctor-collect-cm-profiles --help
query-doctor-collect-impala-context --help
query-doctor-corpus-smoke --help
query-doctor-demo --help
query-doctor-demo-preflight --help
query-doctor-optimize-query --help
query-doctor-pipeline --help
query-doctor-report --help
query-doctor-web --help
Root-level compatibility launchers have been removed. Use the query-doctor-*
commands, or python -m query_doctor.cli.<command_module> when running directly
from a checkout without installing console scripts.
Main Workflows
Web UI
query-doctor-web --help
The local web UI exposes:
Diagnose: the primary screen for Recent query triage.Finished queriesis the default target;Running nowis available as lower-confidence live context.Known Query ID: a secondary mode insideDiagnosefor one explicit Impala query ID. It uses Cloudera Manager by default or direct Impala daemon profile endpoints whenquery_profile_source=impalais configured.- Details pages with deterministic findings, evidence context, and explicit LLM Report / Query LLM optimizer actions.
Help: curated in-product workflow, safety, and documentation guidance.
The pasted-SQL Query Optimizer remains a read-only compatibility route and
test surface. It does not execute SQL and does not echo submitted SQL after
submit, but it is not promoted as a primary navigation item while profile-backed
diagnosis is the main product workflow.
Validated reports and details-page optimizer drafts are generated only by explicit user action for selected cases.
CLI And Headless Use
The packaged CLI entry points cover analyzer runs, batch Recent scans, profile collection, metadata collection, reports, optimizer review, demo generation, and cleanup. They are intended for local diagnosis, automation in a controlled environment, and CI-style smoke checks.
For team workflows, prefer a pinned project version and shared conventions such as a reports repository, scheduled headless scans under a controlled service account, a team jumpbox, or a shared local LLM endpoint. Query Doctor itself remains local-first and single-user unless a future shared-deploy design adds authentication, authorization, tenant/job isolation, audit logging, TLS trust, and resource limits.
Analyzer
query-doctor-analyze CASE_DIR
The analyzer reads collected local case files and writes deterministic facts. It does not call Cloudera Manager, Impala, Ollama, or the report writer.
Pipeline
query-doctor-pipeline CASE_DIR --stop-after-analysis
Pipeline mode runs analyzer-first, can optionally collect bounded metadata when configured, and generates reports only when requested.
Query Optimizer
query-doctor-optimize-query --help
The Query Optimizer accepts one safe read-only SELECT or WITH statement for
analysis. It never executes SQL, never echoes pasted SQL back after submit, and
trusts SQL drafts only when Python-owned recipes and validation prove the
supported transform.
Cloudera Manager (CM) Events And Cluster Context
query-doctor-cm-events --help
The CM Events CLI is a read-only Cluster Doctor seam for Cloudera Manager event
summaries. It can write normalized event summaries plus schema-versioned
raw-free cluster_event_context.json and cluster_context.json artifacts.
Recent scan can also collect one bounded Cluster Event Context from Cloudera
Manager Events per scan window and show only raw-free cluster context status in
the web UI. These artifacts are not yet a Cluster Doctor web workflow or report
path.
Demo Preflight
query-doctor-demo-preflight
The demo preflight is deterministic and local. It checks git hygiene, safety-sensitive changed areas, browser/trusted-output denylist patterns, and focused test suggestions without LLM, network, Cloudera Manager, or Impala access.
Supported Deployment
Query Doctor is supported as a single-user, local-first tool run by an operator with their own local Cloudera Manager, Kerberos, Impala, Prometheus, and LLM credentials. Use localhost or a tightly controlled local bind for the web UI.
Do not deploy the current web UI as a shared service for a team or company. Shared deployments need a separate design for authentication, authorization, tenant/job isolation, audit logging, TLS/reverse-proxy trust, and resource limits before they are supported.
Why Not A Chat Wrapper?
Query Doctor is built for operational diagnostics, where unsupported certainty is worse than saying "unknown." A chat wrapper over raw profiles would make it too easy for model wording to become accidental evidence.
Instead:
- collectors gather bounded, read-only, redacted inputs;
- analyzers extract deterministic facts;
- reports use LLMs only to phrase those facts;
- validators reject unsupported claims and unsafe output;
- browser surfaces show trusted summaries, not raw operational artifacts.
Safety Model
- Python/analyzer-owned facts are the only trusted diagnostic evidence.
- Raw LLM output is untrusted unless normalized, sanitized, and validated.
- Browser-visible UI and trusted reports must not expose raw SQL, raw profiles, raw metadata, local paths, secrets, subprocess output, model/runtime internals, or raw artifact filenames.
- External collection must be explicit, bounded, read-only, redacted, and safe by default.
- Local config
privacy_modedefaults totrue; disabling it can relax local artifact identifier/host masking, but browser-visible UI and trusted reports still do not show raw SQL, profiles, or metadata. Local configno_llm=truekeeps report and optimizer actions on deterministic Python-owned output. - Impala metadata collection is allowlisted and read-only.
- Query Optimizer accepts only a single safe read-only statement and never executes pasted SQL.
See docs/safety-contract.md for the full contract. For a public, reviewer-oriented overview, see docs/security-model.md.
Licensing
Query Doctor is licensed under the GNU Affero General Public License version 3
or later (AGPL-3.0-or-later). See LICENSE.
Commercial licensing is available for proprietary, hosted, embedded, or enterprise use cases where AGPL obligations are not a fit. See COMMERCIAL-LICENSE.md.
Documentation
Start with docs/README.md. It separates current user docs, operations guides, architecture contracts, internal audits, and historical planning notes.
The canonical documentation language is English. Russian localized companion pages live under docs/i18n/ru/ when they are useful for long operator-facing explanations. If English and Russian pages diverge, the English page is the source of truth until the localized companion is updated.
High-value references:
- docs/local-smoke.md: local validation and smoke checks.
- docs/credentials.md: local credentials layout.
- docs/public-release-readiness.md: public release readiness checklist.
- docs/release-checklist.md: maintainer release and visibility-change checklist.
- docs/repository-hardening.md: repository security, CI hardening, release automation, and strong-test backlog.
- docs/architecture.md: current and future component boundary diagrams.
- docs/contributor-architecture.md: contributor-oriented architecture map.
- docs/roadmap.md: implemented scope and planned seams.
- docs/query-optimizer-contract.md: optimizer trust boundary.
- docs/cluster-doctor-contract.md: future Cluster Doctor contract.
Development Checks
Before committing:
pre-commit run --all-files
scripts/local_gate.sh
python -m ruff check query_doctor tests
python -m ruff format --check query_doctor tests scripts
python3 -m pytest -q
git diff --check
query-doctor-demo-preflight
git status --short
Stage only explicit files. Do not commit generated cases, reports, local configs, credentials, raw profiles, raw metadata, or temporary outputs.
Public Status
This repository is public. v0.1.0 is the initial public GitHub release
baseline, and v0.1.1 is the first PyPI release:
query-doctor on PyPI. The public
license is AGPL-3.0-or-later, with commercial licensing available separately.
PyPI publishing uses GitHub OIDC Trusted Publishing. The repository-side
testpypi and pypi environments require maintainer approval and do not use
stored package-index API tokens.
Before cutting a new tag, publishing to a package index, or announcing a public release, run the public-release guard from a clean working tree:
query-doctor-demo-preflight --public-release
Use docs/release-checklist.md for the full release and visibility-change checklist.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file query_doctor-0.1.2.tar.gz.
File metadata
- Download URL: query_doctor-0.1.2.tar.gz
- Upload date:
- Size: 722.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
80e4a8f31b04a003e77bc254184404ed4c522aae6793c17185b88ed40e314c1a
|
|
| MD5 |
9a663ca1cc1ccb9c0f657fdf10ddb561
|
|
| BLAKE2b-256 |
9db28cb255b5cb2ef9d86b6295fdc6e65e0f427b3238380b0d7c6b11842bdc40
|
Provenance
The following attestation bundles were made for query_doctor-0.1.2.tar.gz:
Publisher:
publish.yml on alexandrefimov/Query-Doctor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
query_doctor-0.1.2.tar.gz -
Subject digest:
80e4a8f31b04a003e77bc254184404ed4c522aae6793c17185b88ed40e314c1a - Sigstore transparency entry: 1523841811
- Sigstore integration time:
-
Permalink:
alexandrefimov/Query-Doctor@c885b3e56a60972905443c6c32135b72bde79196 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/alexandrefimov
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c885b3e56a60972905443c6c32135b72bde79196 -
Trigger Event:
release
-
Statement type:
File details
Details for the file query_doctor-0.1.2-py3-none-any.whl.
File metadata
- Download URL: query_doctor-0.1.2-py3-none-any.whl
- Upload date:
- Size: 585.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
84287178b8f8a399c1e8ff8866265f9c95e5a43c87221cee4b76d84aef182e76
|
|
| MD5 |
8d84fc99fe65de17a31409e59af34e12
|
|
| BLAKE2b-256 |
237135b63f14184f3d01bd51b247cb348b291c4818aeeab6e21e69088c4a2bc4
|
Provenance
The following attestation bundles were made for query_doctor-0.1.2-py3-none-any.whl:
Publisher:
publish.yml on alexandrefimov/Query-Doctor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
query_doctor-0.1.2-py3-none-any.whl -
Subject digest:
84287178b8f8a399c1e8ff8866265f9c95e5a43c87221cee4b76d84aef182e76 - Sigstore transparency entry: 1523841847
- Sigstore integration time:
-
Permalink:
alexandrefimov/Query-Doctor@c885b3e56a60972905443c6c32135b72bde79196 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/alexandrefimov
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c885b3e56a60972905443c6c32135b72bde79196 -
Trigger Event:
release
-
Statement type: