A static analyzer for RAG systems and context engineering workflows. ESLint for your context.
Project description
ContextDoctor
A static analyzer for RAG systems and context engineering workflows. Think ESLint, but for your context โ not your JavaScript.
ContextDoctor inspects your documents, chunks, and knowledge bases and flags the structural, chunking, and context-quality problems that quietly wreck retrieval quality โ before you ever call an LLM.
- ๐ฉบ One braggable number. A Context Health Score (0โ100 + AโF grade), Lighthouse-style, with a README badge.
- ๐ Fully offline. No API keys. No cloud. No OpenAI / Anthropic / Gemini calls. No model downloads.
- โก Fast & deterministic. Pure static analysis. Same input โ same report, every time.
- ๐ฆ Zero runtime dependencies. Just Python 3.11+ and the standard library.
- ๐งฐ Opinionated but extensible. Ten sharp rules (CTX001โCTX010) with actionable fixes โ plus a plugin API for your own.
- ๐ Six output formats. Terminal, JSON, Markdown, self-contained HTML, SARIF (GitHub code scanning), and a badge.
- ๐ Meets you where you are. GitHub Action, pre-commit hook, and one-line LangChain / LlamaIndex integration.
- ๐ฅ Reads what you have. Markdown, text, HTML, JSON, JSONL, CSV/TSV, and (optional) PDF.
- ๐ Try it with zero install. A browser playground runs the whole analyzer in WebAssembly โ nothing is uploaded.
Why does this exist? Most "my RAG is bad" problems are not model problems โ they're context problems: chunks that are too big or too small, duplicated passages crowding out diverse results, tables shredded across chunk boundaries, and related facts scattered so far apart that no retriever can reassemble them. ContextDoctor helps you answer "why is my RAG system performing poorly?" statically, in seconds, for free.
Where it fits
RAG evaluation tools like RAGAS, TruLens, DeepEval, and Phoenix are runtime, LLM-as-judge, post-retrieval โ they need a running pipeline, test queries, and API calls, and they measure the answer. None of them check whether your knowledge base was worth retrieving from in the first place. ContextDoctor owns that missing pre-retrieval, pre-index layer. It's complementary: lint with ContextDoctor before you index, evaluate with RAGAS/DeepEval after you answer.
The Context Health Score
Every run produces a single 0โ100 score with an AโF grade โ easy to track over time, gate in CI, and show off:
Context Health Score
69/100 D โโโโโโโโโโโโโโโโโโโโโโโโ poor
Drop a live badge in your README (--format badge prints the snippet):

Install
pip install contextdoctor # from PyPI (once published)
# or, from source:
git clone https://github.com/pranavbelhekar01/ContextLint
cd ContextLint
pip install -e ".[dev]"
Requires Python 3.11+. No other runtime dependencies.
Quick start
contextdoctor analyze ./docs
That's it. Point it at a file or a directory of Markdown, plain text, or JSON chunk exports, and you get a report like this:
ContextDoctor ยท static analysis for RAG
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
root: examples/messy_docs
files: 4 chunks: 15 generated: 2026-07-01T06:19:10Z
Summary 1 error 4 warning 0 info
Chunk statistics
chars tokens
min 10 2
median 705 176
mean 907.5 226.8
p95 2166.2 541.8
max 4199 1050
overlap 35.48% ยท duplicated 6.67%
Context Fragmentation Index (experimental)
CFI 0.030 โโโโโโโโโโโโโโโโโโโโ 0=coherent 1=fragmented
Findings
โ CTX004 [broken-table]
A markdown table in chunks_export.json is split between chunk 2 and chunk 3.
โ Keep tables intact within a single chunk. A table split across chunks
loses its header row and column meaning...
โข chunks_export.json [chunk 2] (table continues)
โข chunks_export.json [chunk 3] (table continued)
โฒ CTX001 [chunk-too-large]
1 chunk(s) exceed the recommended maximum of 2000 characters (largest: 4199).
โ Split oversized chunks...
... (CTX002, CTX003, CTX005) ...
What it checks
| Rule | Name | Severity | What it catches |
|---|---|---|---|
| CTX001 | chunk-too-large |
warning | Chunks bigger than max_chunk_chars โ they dilute relevance and blow the context budget. |
| CTX002 | chunk-too-small |
warning | Chunks smaller than min_chunk_chars โ fragments too small to carry standalone meaning. |
| CTX003 | duplicate-content |
warning | Exact (hash) and near (Jaccard / MinHash) duplicate chunks that crowd out diverse results. |
| CTX004 | broken-table |
error | Markdown tables split across a chunk boundary, losing their header row. |
| CTX005 | heading-fragmentation |
warning | A single section spanning too many chunks โ a signal to use parent-child retrieval. |
| CTX006 | high-context-fragmentation |
warning ยท experimental | High Context Fragmentation Index (CFI) โ related information scattered across distant chunks. |
| CTX007 | secret-detected |
error | API keys, tokens, or private keys embedded in the corpus โ you're about to index a secret into your vector DB. |
| CTX008 | pii-detected |
warning | Emails, phone numbers, SSNs, or card numbers in the content (values are redacted, never echoed). |
| CTX009 | encoding-artifacts |
warning | Mojibake (รยฉ, รขโฌโข), replacement chars (๏ฟฝ), or control characters from a broken extraction step. |
| CTX010 | exceeds-embedding-limit |
warning | Chunks likely over your embedding model's token limit โ the tail is silently truncated and never embedded. |
Every finding includes a severity, a description, a concrete recommendation, and file/chunk references wherever possible.
List them anytime:
contextdoctor rules
The Context Fragmentation Index (CFI) โ experimental ๐งช
The CFI is ContextDoctor's flagship experimental signal. It asks a simple question: when the same named thing is discussed in multiple chunks, how far apart are those chunks? Information about one entity scattered across the whole corpus is much harder for a retriever to reassemble than information kept close together.
How it's computed (v0.1):
- Extract lightweight, local entities per chunk (proper nouns / acronyms) โ no models, no network.
- For every entity that appears in โฅ
min_entity_freqdistinct chunks, record the chunk indices where it appears. - Compute the mean gap between consecutive appearances and normalise by the corpus size (
N โ 1) โ a per-entity fragmentation in[0, 1]. - The CFI is the occurrence-weighted mean of per-entity fragmentation.
Scale: 0.0 = highly coherent ยท 1.0 = highly fragmented.
โ ๏ธ The CFI is experimental and deliberately simple. It's a signal to inspect, not a hard pass/fail โ treat a high CFI as "go look at how this topic is spread out," not "this corpus is broken." It is clearly labelled experimental everywhere it appears.
See it in action:
contextdoctor analyze ./examples/fragmented_kb
# CFI 0.750 โโโโโโโโโโโโโโโโโโโโ โ CTX006 fires
Inputs
ContextDoctor understands many input types and traverses directories recursively (skipping hidden files):
- Markdown (
.md,.markdown) โ chunked by ContextDoctor's structure-aware chunker. - Plain text (
.txt) โ chunked the same way. - HTML (
.html,.htm) โ tags/scripts/styles stripped, then chunked. - JSON exports (
.json) โ read as pre-existing chunks, so metrics reflect your chunking, not ours. - JSONL / NDJSON (
.jsonl,.ndjson) โ one chunk per line. - CSV / TSV (
.csv,.tsv) โ one chunk per row, rendered asheader: value. - PDF (
.pdf) โ optional:pip install "contextdoctor[pdf]"(keeps the core dependency-free).
Supported JSON shapes (auto-detected):
["chunk one", "chunk two"] // list of strings
[{"text": "..."}, {"content": "..."}] // list of objects
{"chunks": [{"page_content": "..."}]} // container object
Recognised text keys: text, content, chunk, page_content, body,
passage. Recognised container keys: chunks, documents, nodes, data,
items, passages.
Output formats
contextdoctor analyze ./docs # rich terminal report (default)
contextdoctor analyze ./docs --format json # machine-readable JSON
contextdoctor analyze ./docs --format markdown -o report.md
contextdoctor analyze ./docs --format html -o report.html # self-contained visual report
contextdoctor analyze ./docs --format sarif -o results.sarif # GitHub code scanning
contextdoctor analyze ./docs --format badge # shields.io endpoint JSON + snippet
The HTML report is a single self-contained file (inline CSS + SVG, no JS, no network) โ open it, screenshot the score card, share it.
Compare two chunking strategies
Answer "is recursive or semantic chunking better for my corpus?" โ statically, no LLM:
contextdoctor compare recursive_export.json semantic_export.json
ContextDoctor compare
metric before after ฮ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
health score 71 88 +17
findings 6 2 -4
duplicate % 9.10 1.20 -7.90
CFI 0.41 0.22 -0.19
โ 'after' is healthier.
CI usage
Fail the build when issues are found:
contextdoctor analyze ./docs --fail-on error # exit 1 on any error-level finding
contextdoctor analyze ./docs --fail-on warning # exit 1 on any warning or worse
GitHub Action (findings appear inline on the PR via SARIF):
# .github/workflows/context.yml
name: ContextDoctor
on: [pull_request]
jobs:
contextdoctor:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: pranavbelhekar01/ContextLint@v0.1 # composite action (action.yml)
with:
path: ./knowledge_base
fail-on: error
- uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: contextdoctor.sarif
pre-commit (.pre-commit-hooks.yaml is shipped):
# .pre-commit-config.yaml
repos:
- repo: https://github.com/pranavbelhekar01/ContextLint
rev: v0.1.0
hooks:
- id: contextdoctor
Configuration
ContextDoctor is opinionated but tunable. It auto-discovers a .contextdoctor.json
or a [tool.contextdoctor] table in pyproject.toml near your target, or you can
pass one explicitly with --config.
.contextdoctor.json:
{
"chunk_size": 1200,
"chunk_overlap": 120,
"max_chunk_chars": 2000,
"min_chunk_chars": 200,
"near_duplicate_threshold": 0.85,
"max_chunks_per_heading": 5,
"cfi_warning_threshold": 0.6,
"min_entity_freq": 2
}
Or in pyproject.toml:
[tool.contextdoctor]
max_chunk_chars = 1500
cfi_warning_threshold = 0.5
Common thresholds can also be overridden on the command line:
contextdoctor analyze ./docs --chunk-size 800 --max-chunk-chars 1500 --cfi-threshold 0.5
| Key | Default | Meaning |
|---|---|---|
chunk_size |
1200 |
Target chunk size (chars) when chunking raw .md/.txt. |
chunk_overlap |
120 |
Overlap (chars) carried between chunks. |
max_chunk_chars |
2000 |
CTX001 threshold. |
min_chunk_chars |
200 |
CTX002 threshold. |
shingle_size |
5 |
Word n-gram size for similarity/overlap. |
near_duplicate_threshold |
0.85 |
CTX003 near-duplicate Jaccard cutoff. |
duplicate_pct_warning |
10.0 |
Corpus-wide duplicate % that warns. |
max_chunks_per_heading |
5 |
CTX005 threshold. |
min_entity_freq |
2 |
Min distinct chunks an entity needs for CFI. |
cfi_warning_threshold |
0.6 |
CTX006 threshold. |
embedding_token_limit |
512 |
CTX010 threshold โ set to your embedding model's context. |
detect_secrets / detect_pii / detect_encoding_artifacts |
true |
Toggle CTX007 / CTX008 / CTX009. |
select / ignore |
[] |
Only-run / skip rule ids (also --select / --ignore). |
severity |
{} |
Per-rule severity override, e.g. {"CTX006": "info"}. |
Python API
Everything the CLI does is available programmatically:
from contextdoctor import analyze_path, Config
report = analyze_path("./docs", Config(max_chunk_chars=1500))
print(report.health_score, report.health_grade) # 82 B
print(report.counts_by_severity()) # {"info": 0, "warning": 4, "error": 1}
for f in report.findings:
print(f.rule_id, f.severity.value, f.message)
print(report.metrics["fragmentation"]["cfi"]) # experimental CFI
from contextdoctor.reports import render_html, render_json
open("report.html", "w").write(render_html(report))
Lint the chunks your pipeline actually produced
analyze_chunks() is a framework-agnostic bridge โ hand it the exact chunks your
splitter emitted, before you embed them:
from contextdoctor import analyze_chunks
# LangChain
from langchain_text_splitters import RecursiveCharacterTextSplitter
docs = RecursiveCharacterTextSplitter(chunk_size=800).split_documents(raw_docs)
report = analyze_chunks([d.page_content for d in docs])
# LlamaIndex
nodes = SentenceSplitter(chunk_size=512).get_nodes_from_documents(documents)
report = analyze_chunks([n.get_content() for n in nodes])
if report.health_score < 80:
raise SystemExit(f"Context health too low: {report.health_score}/100")
This is the assertion you can put in your ingestion pipeline's tests: fail the build if your chunking regresses.
Adopting on an existing corpus
Turning a linter on a large, pre-existing knowledge base usually floods you with issues. Two mechanisms make adoption incremental:
Baseline โ freeze today's findings; fail only on new ones:
contextdoctor baseline ./docs # writes .contextdoctor-baseline.json
contextdoctor analyze ./docs --baseline .contextdoctor-baseline.json --fail-on warning
# -> pre-existing findings are suppressed; only regressions surface (and count against the score)
Inline disable pragmas โ opt a specific file out of a rule (file-scoped), for the legitimate cases (e.g. a doc that shows an example API key):
<!-- contextdoctor: disable=CTX007 --> # disable one or more rules for this file
<!-- contextdoctor: disable=CTX003,CTX008 --> # comma-separated
<!-- contextdoctor: disable-all --> # disable everything for this file
Playground
Want to try it without installing anything? The browser playground runs the entire ContextDoctor engine in WebAssembly (Pyodide) โ paste your chunks, get a score and a full report, and nothing is uploaded. It works because the core has zero dependencies. Deploy your own to GitHub Pages with the included workflow, or run it locally:
python -m http.server -d playground 8000 # then open http://localhost:8000
Custom rules & plugins
ContextDoctor is extensible. A plugin is just an Analyzer subclass that declares
the rules it emits โ and those rules then flow through everything: the health
score, all report formats, SARIF, contextdoctor rules, and --select /
--ignore, exactly like the built-in CTX* rules.
The lowest-friction path is a single local file:
# my_rules.py
from contextdoctor.analyzers import AnalysisContext, Analyzer
from contextdoctor.models import AnalyzerResult, Location, Severity
from contextdoctor.rules import Rule
class TodoAnalyzer(Analyzer):
name = "todo"
provides_rules = [Rule(id="MYP001", name="unfinished-content", category="custom",
default_severity=Severity.WARNING,
description="Placeholder text found.",
recommendation="Finish or remove it before indexing.")]
def analyze(self, ctx: AnalysisContext) -> AnalyzerResult:
findings = [
self._finding("MYP001", "TODO marker in chunk",
locations=[Location(file=c.source_file, chunk_id=c.id)])
for c in ctx.chunks if "TODO" in c.text
]
return self._result(findings=findings)
contextdoctor analyze ./docs --plugin ./my_rules.py
Three ways to load, in increasing order of packaging effort:
| How | Spec |
|---|---|
Local .py file |
--plugin ./my_rules.py or {"plugins": ["./my_rules.py"]} |
| Importable module | --plugin my_pkg.rules or my_pkg.rules:TodoAnalyzer |
| Published package (auto-discovered) | entry point contextdoctor.analyzers in pyproject.toml |
# a distributable plugin package advertises itself; no config needed by users
[project.entry-points."contextdoctor.analyzers"]
my-rules = "contextdoctor_plugin_myrules:TodoAnalyzer"
A complete, working example lives in
examples/plugin/ (rule PLH001, flagging unfinished
content). Plugin loading is best-effort and offline โ a broken plugin warns and
is skipped, and built-in CTX* ids can't be silently overridden.
How it works
contextdoctor/
โโโ cli.py # argparse CLI: analyze / compare / rules
โโโ config.py # thresholds + config discovery (.json / pyproject.toml)
โโโ engine.py # discover โ chunk โ analyze โ filter โ score โ Report
โโโ scoring.py # the Context Health Score
โโโ baseline.py # freeze findings; report only new ones
โโโ plugins.py # load custom analyzers/rules (files, modules, entry points)
โโโ models.py # Chunk, Document, Finding, Report, Severity
โโโ chunking/ # structure-aware chunker (paragraphs, tables, code fences)
โโโ parsers/ # discovery + md/txt/html/json/jsonl/csv/pdf loaders + pragmas
โโโ analyzers/ # one module per concern:
โ โโโ chunk_stats.py # CTX001 / CTX002 / CTX010 + distribution + overlap
โ โโโ duplicates.py # CTX003 (hash + Jaccard/MinHash)
โ โโโ tables.py # CTX004
โ โโโ headings.py # CTX005
โ โโโ content_quality.py # CTX007 / CTX008 / CTX009 (secrets, PII, encoding)
โ โโโ fragmentation.py # CTX006 โ the experimental CFI
โโโ rules/ # rule catalogue (id, severity, description, recommendation)
โโโ reports/ # terminal / json / markdown / html / sarif / badge
โโโ utils/ # text, hashing (MinHash), NLP, ANSI, secret/PII patterns
The pipeline is a straight line: discover files โ build chunks โ run each analyzer over the shared corpus โ collect findings + metrics โ render. No step touches the network.
Development
pip install -e ".[dev]"
pytest -q # run the test suite
ruff check . # lint
ruff format . # format
The project targets Python 3.11, 3.12, and 3.13, and is tested on Linux, macOS, and Windows in CI.
Adding a rule
- Add the rule metadata to
contextdoctor/rules/registry.py. - Emit findings for it from a new or existing analyzer in
contextdoctor/analyzers/(subclassAnalyzer, useself._finding(...)). - Register the analyzer in
contextdoctor/analyzers/__init__.py. - Add tests and an example that triggers it.
Examples
The examples/ directory ships datasets you can run immediately:
examples/clean_docs/โ well-structured docs; scores 100/100.examples/messy_docs/โ triggers CTX001โCTX005 and CTX010 (oversized/tiny chunks, duplicates, a broken table, heading fragmentation, embedding-limit).examples/risky_docs/โ a support log that leaked secrets, PII, and mojibake into the KB (CTX007โCTX009). Values are always redacted.examples/fragmented_kb/โ a scattered knowledge base that triggers the experimental CFI (CTX006), with its own.contextdoctor.json.
contextdoctor analyze ./examples/messy_docs
contextdoctor analyze ./examples/risky_docs
contextdoctor analyze ./examples/fragmented_kb
Roadmap
ContextDoctor is at v0.1. Ideas on the table:
- More rules: boilerplate/nav-chrome detection, orphaned references, language mixing.
- More parsers:
.rst, DOCX, and richer HTML (readability-style main-content extraction). - A refined, better-validated CFI (the current one is intentionally experimental).
- Line-scoped disable pragmas (today's pragmas are file-scoped) and autofix suggestions.
- A VS Code extension surfacing findings inline as you edit docs.
Contributions and issues welcome.
License
MIT. Fully offline, forever. No LLM was called to produce your report.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file contextdoctor-0.1.0.tar.gz.
File metadata
- Download URL: contextdoctor-0.1.0.tar.gz
- Upload date:
- Size: 73.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e44214035138d7bc48772325e960a05d5f0306407a73e4f1e2052121661abcc7
|
|
| MD5 |
f6bfb27fa4cb6dd82757bd14a95ba956
|
|
| BLAKE2b-256 |
e5992ed047d531ddb44fd9a2c7c81c79389e68b4bdc63bc9e650f9bdf6baa4e1
|
Provenance
The following attestation bundles were made for contextdoctor-0.1.0.tar.gz:
Publisher:
release.yml on pranavbelhekar01/ContextDoctor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
contextdoctor-0.1.0.tar.gz -
Subject digest:
e44214035138d7bc48772325e960a05d5f0306407a73e4f1e2052121661abcc7 - Sigstore transparency entry: 2037264263
- Sigstore integration time:
-
Permalink:
pranavbelhekar01/ContextDoctor@d32ef2fac34ea3574bb4780de759318199f0edb8 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/pranavbelhekar01
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d32ef2fac34ea3574bb4780de759318199f0edb8 -
Trigger Event:
push
-
Statement type:
File details
Details for the file contextdoctor-0.1.0-py3-none-any.whl.
File metadata
- Download URL: contextdoctor-0.1.0-py3-none-any.whl
- Upload date:
- Size: 68.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb51893a109084123298720531a6814b5c99025cb478df0f57a42490f3306858
|
|
| MD5 |
71b633b4a4f3cbf925effd21ece81ff9
|
|
| BLAKE2b-256 |
6936d3509bbec9c59677b4ba08b019448ea136cfad5d16fad0470739ad012c69
|
Provenance
The following attestation bundles were made for contextdoctor-0.1.0-py3-none-any.whl:
Publisher:
release.yml on pranavbelhekar01/ContextDoctor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
contextdoctor-0.1.0-py3-none-any.whl -
Subject digest:
cb51893a109084123298720531a6814b5c99025cb478df0f57a42490f3306858 - Sigstore transparency entry: 2037264377
- Sigstore integration time:
-
Permalink:
pranavbelhekar01/ContextDoctor@d32ef2fac34ea3574bb4780de759318199f0edb8 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/pranavbelhekar01
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d32ef2fac34ea3574bb4780de759318199f0edb8 -
Trigger Event:
push
-
Statement type: