A static analyzer for RAG chunks
Project description
ChunkLint
Static analysis for RAG chunks before they enter a vector database.
ChunkLint checks chunk files and in-memory Python chunk objects for structural problems that usually show up later as bad retrieval, incomplete answers, or expensive re-indexing work. It is intentionally simple: no LLM calls, no embeddings, no eval dataset, and no vector database connection required.
The rules are deterministic heuristics. They are built to catch likely chunking mistakes with low runtime cost, but they are not a substitute for a labeled retrieval-quality benchmark.
Why It Exists
A normal RAG pipeline often moves directly from splitting to embedding:
docs = loader.load()
chunks = splitter.split_documents(docs)
vectorstore.add_documents(chunks)
If those chunks are malformed, missing context, duplicated, or split in the middle of an important sentence, the vector database stores that damage. You usually discover it much later when retrieval gives poor context to the model.
ChunkLint adds a quality gate between chunking and embedding:
from chunklint.adapters.langchain import export_documents, lint_documents
docs = loader.load()
chunks = splitter.split_documents(docs)
report = lint_documents(chunks)
export_documents(chunks, "chunks.json")
if report.has_high_issues:
raise RuntimeError(
"ChunkLint found high-severity findings. "
"Run: chunklint scan chunks.json --verbose"
)
vectorstore.add_documents(chunks)
What It Catches
ChunkLint focuses on obvious static chunk quality issues:
- Empty or whitespace-only chunks
- Missing stable chunk IDs
- Missing source metadata
- Missing heading, title, or section context
- Chunks that likely start mid-sentence
- Chunks that likely end mid-sentence
- Tiny chunks with little usable context
- Very large chunks that may mix topics
- Markdown table fragments without headers
- Unclosed markdown code fences
- Near-duplicate chunks
- Common PDF extraction noise
What It Is Not
ChunkLint is not a RAG observability platform, an LLM judge, a retriever eval tool, a vector database scanner, a PDF parser, or a chunk-size optimizer. It is a fast static analyzer that catches preventable chunk problems before indexing.
Install
From PyPI, once published:
pip install chunklint
For local development from this repository:
git clone https://github.com/prabhath004/chunklint.git
cd chunklint
python -m pip install -e .
Optional framework extras:
pip install "chunklint[langchain]"
pip install "chunklint[llamaindex]"
Developer tools:
python -m pip install -e ".[dev]"
Quickstart
Use the included bad chunk fixture:
chunklint scan examples/bad_chunks.json
Fail CI when high-severity issues exist:
chunklint scan examples/bad_chunks.json --fail-on high
Write a JSON report:
chunklint scan examples/bad_chunks.json --format json --out report.json
If the console script is not on your PATH, run the module directly:
python -m chunklint.cli scan examples/bad_chunks.json
CLI
scan
Scans JSON or JSONL chunk exports.
chunklint scan chunks.json
chunklint scan chunks.jsonl
chunklint scan chunks.json --fail-on high
chunklint scan chunks.json --fail-on high,medium
chunklint scan chunks.json --fail-on-at-or-above medium
chunklint scan chunks.json --format json
chunklint scan chunks.json --format json --out report.json
chunklint scan chunks.json --config chunklint.yml
chunklint scan chunks.json --quiet
chunklint scan chunks.json --verbose
chunklint scan chunks.json --raw --max-issues 50
Exit codes:
0: scan completed and did not fail the selected severity gate1: selected severity gate failed, such as--fail-on high2: invalid input or invalid config3: unexpected internal error
Gating findings in CI
ChunkLint exposes two mutually exclusive gate flags:
--fail-onis an exact-severity gate.--fail-on highfails only on high findings;--fail-on mediumfails only on medium findings. You can pass a comma list (--fail-on high,medium) to gate on several specific severities at once.--fail-on-at-or-aboveis a threshold gate.--fail-on-at-or-above mediumfails on medium and high findings;--fail-on-at-or-above lowfails on anything.
Pick --fail-on when you want surgical control (e.g., block high without
caring about medium yet). Pick --fail-on-at-or-above when you want the
usual "block this and worse" CI behavior. The two flags cannot be combined,
so the intent stays explicit in your workflow file.
In text output, the gate shows an overall high/medium/low lint summary plus
root causes for the selected severity. JSON output still contains the full
machine-readable scan. Use --quiet when you only want the exit code.
init
Creates a default config file.
chunklint init
Create a config at a custom path:
chunklint init config/chunklint.yml
Overwrite an existing config:
chunklint init --force
rules
Lists rule IDs, default severities, and whether each rule checks one chunk or the full chunk set.
chunklint rules
More CLI detail is in docs/cli.md.
Python SDK
The generic SDK accepts dictionaries, Chunk models, and supported framework
objects.
from chunklint import lint
chunks = [
{
"id": "chunk_1",
"text": "Refund Policy. Customers can request refunds within 30 days.",
"source": "refund_policy.md",
"metadata": {"heading": "Refund Policy"},
},
{
"id": "chunk_2",
"text": "except enterprise customers may request refunds within 90 days.",
"source": "refund_policy.md",
"metadata": {"heading": "Refund Policy"},
},
]
report = lint(chunks)
if report.has_high_issues:
raise RuntimeError(f"ChunkLint found {report.high} high-severity findings.")
The returned report includes counts and issue objects:
print(report.chunks_scanned)
print(report.issues_found)
print(report.high, report.medium, report.low)
print(report.ok)
More SDK detail is in docs/sdk.md.
LangChain
LangChain Document objects are mapped from page_content and metadata.
from chunklint.adapters.langchain import export_documents, lint_documents
docs = loader.load()
chunks = splitter.split_documents(docs)
report = lint_documents(chunks)
export_documents(chunks, "chunks.json")
if report.has_high_issues:
raise RuntimeError(
"ChunkLint found high-severity findings. "
"Run: chunklint scan chunks.json --verbose"
)
vectorstore.add_documents(chunks)
LlamaIndex
LlamaIndex nodes are mapped from node.get_content(), node.node_id,
node.metadata, and node.ref_doc_id.
from chunklint.adapters.llamaindex import export_nodes, lint_nodes
nodes = parser.get_nodes_from_documents(documents)
report = lint_nodes(nodes)
export_nodes(nodes, "chunks.json")
if report.has_high_issues:
raise RuntimeError(
"ChunkLint found high-severity findings. "
"Run: chunklint scan chunks.json --verbose"
)
index = VectorStoreIndex(nodes)
Input Format
JSON input should be an array of chunk objects:
[
{
"id": "chunk_1",
"text": "Refund Policy. Customers can request refunds within 30 days.",
"source": "refund_policy.pdf",
"metadata": {
"page": 2,
"heading": "Refund Policy"
}
}
]
JSONL input should have one chunk object per line:
{"id":"chunk_1","text":"Refund Policy. Customers can request refunds within 30 days.","source":"refund_policy.pdf","metadata":{"page":2,"heading":"Refund Policy"}}
{"id":"chunk_2","text":"except enterprise customers may request refunds within 90 days.","source":"refund_policy.pdf","metadata":{"page":2,"heading":"Refund Policy"}}
Supported text keys for dictionary inputs include text, page_content, and
content. Source can be supplied as a top-level source or through metadata
keys such as source, file_name, path, or document_id.
Config
ChunkLint automatically loads chunklint.yml or chunklint.yaml from the
current working directory. You can also pass a config explicitly:
chunklint scan chunks.json --config path/to/chunklint.yml
Example config:
version: 1
thresholds:
min_words: 30
max_words: 700
duplicate_similarity: 0.92
max_line_break_ratio: 0.35
heading_keys:
- heading
- title
- section
- heading_path
- document_title
- file_name
rules:
starts_mid_sentence:
enabled: true
severity: high
connector_words:
- except
- however
- therefore
- because
- although
- which
- that
- and
- but
- or
- also
- then
ignore_start_words:
- iphone
- ebay
- npm
- openai
too_short:
enabled: true
severity: low
near_duplicate:
enabled: true
severity: low
Full rule and config reference: docs/rules.md.
Rules
| Rule | Default | Scope | Purpose |
|---|---|---|---|
missing_text |
high | chunk | Flags empty chunks. |
missing_id |
medium | chunk | Flags chunks without stable IDs. |
missing_source |
medium | chunk | Flags chunks without traceable source metadata. |
missing_heading |
medium | chunk | Flags chunks without heading/title/section metadata. Page labels are not treated as headings. |
starts_mid_sentence |
high | chunk | Flags likely mid-sentence starts using continuation punctuation, configurable connector words, lowercase starts, and false-positive exclusions for headings, code, tables, and known product/tool names. |
ends_mid_sentence |
medium | chunk | Flags likely mid-sentence endings using missing punctuation, continuation punctuation, and trailing connector words while skipping headings, tables, code, URLs, and colon labels. |
broken_chunk_boundary |
high | cross-chunk | Compares adjacent chunks and flags likely sentence splits across chunk boundaries. |
too_short |
low | chunk | Flags chunks below min_words; raises to medium when heading context is missing. |
too_long |
medium | chunk | Flags chunks above max_words. |
broken_markdown_table |
high | chunk | Flags table fragments without markdown separator/header context. |
broken_code_block |
medium | chunk | Flags odd counts of triple-backtick fences. |
near_duplicate |
low | cross-chunk | Flags chunks above the duplicate similarity threshold. |
pdf_noise |
low | cross-chunk | Flags page labels, repeated headers/footers, hyphenation, and line-break artifacts. |
Reports
Terminal output is designed for humans:
ChunkLint Report
Chunks scanned: 3
Raw findings: 6
Actionable root causes: 2
High: 2
Medium: 2
Low: 2
Text output groups related rules into root causes, recommends next steps, and
keeps examples hidden by default. --verbose shows examples with snippets, and
--examples-per-rule controls how many examples are shown. Use --raw when
you need row-level findings, and use --raw --max-issues 0 to show every raw
row.
JSON output is designed for CI, logs, or downstream tools. It includes summary, rule groups, root causes, recommendations, and raw issues:
{
"schema_version": 1,
"summary": {
"chunks_scanned": 3,
"issues_found": 6,
"high": 2,
"medium": 2,
"low": 2
},
"groups": [],
"root_causes": [],
"recommendations": [],
"issues": [
{
"chunk_id": "chunk_2",
"source": "refund_policy.pdf",
"rule_id": "starts_mid_sentence",
"severity": "high",
"reason": "Chunk starts with connector word \"except\".",
"why_it_matters": "This chunk likely depends on a previous sentence and may lose the main rule.",
"fix": "Use sentence-aware splitting or increase overlap.",
"snippet": "except enterprise customers may request refunds within 90 days."
}
]
}
schema_version is the public JSON contract version. Adding fields is a
non-breaking change; renaming or removing existing fields bumps the
version. Consumers that pin to a major version should read this key first.
CI
Use ChunkLint as a pre-embedding quality gate:
name: ChunkLint
on:
pull_request:
paths:
- "docs/**"
- "scripts/generate_chunks.py"
- "chunklint.yml"
jobs:
chunklint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: pip install chunklint
- name: Generate chunks
run: python scripts/generate_chunks.py
- name: Run ChunkLint
run: chunklint scan chunks.json --fail-on high
The same workflow is available at examples/github_action.yml.
Project Layout
chunklint/
cli.py # Typer CLI commands
config.py # YAML config model and defaults
engine.py # Rule orchestration and report creation
loader.py # JSON/JSONL load and export helpers
models.py # Pydantic models for chunks, issues, reports
normalizer.py # Converts dicts/framework objects into Chunk models
reporter.py # Rich terminal output and JSON report serialization
adapters/ # LangChain and LlamaIndex adapters
rules/ # One file per rule family
utils/ # Text, metadata, and severity helpers
docs/ # User and contributor documentation
examples/ # Demo inputs and integration snippets
tests/ # Pytest coverage for CLI, loader, adapters, and rules
Test Folder Guide
The tests/ folder is the safety net for the first release:
| File | What it verifies |
|---|---|
tests/test_loader.py |
JSON array loading, JSONL loading, and chunk export behavior. |
tests/test_cli.py |
CLI exit codes, --fail-on, JSON report writing, and config initialization. |
tests/test_adapters.py |
LangChain and LlamaIndex adapter normalization, linting, and export helpers. |
tests/test_missing_rules.py |
missing_text, missing_id, missing_source, and missing_heading. |
tests/test_boundary_rules.py |
Mid-sentence start/end detection, stronger continuation signals, false-positive exclusions, and boundary-rule config options. |
tests/test_table_rule.py |
Broken markdown table detection and valid table pass-through. |
tests/test_code_rule.py |
Unclosed markdown code-fence detection. |
tests/test_duplicate_rule.py |
Near-duplicate detection across chunks. |
tests/test_pdf_noise_rule.py |
PDF page-label noise and repeated footer/header detection. |
tests/test_reporter.py |
Grouped report summaries, JSON report shape, and recommendation generation. |
tests/test_config.py |
YAML config loading, auto-discovery, deep-merge overrides, and rule disabling. |
tests/test_json_schema.py |
Pins the public JSON output contract so downstream consumers can rely on it. |
tests/test_sdk_quiet.py |
Validates the quiet/SDK code path that returns a report without printing. |
More detail is in docs/testing.md.
Run tests:
python -m pytest
Run linting when dev dependencies are installed:
python -m ruff check .
Project Status
ChunkLint v0.1.0 is the first public release on PyPI. The CLI, Python SDK,
framework adapters, rules, and config schema are usable today, and the JSON
output is independently versioned via schema_version: 1. While the project
is still on 0.x, minor versions may introduce breaking changes as the API
gets real-world feedback. The CHANGELOG.md file tracks notable changes
between releases.
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chunklint-0.1.0.tar.gz.
File metadata
- Download URL: chunklint-0.1.0.tar.gz
- Upload date:
- Size: 51.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a3f3b62446b8882013d95a30bd90ab997b81d914d60c52708311944e43ee1239
|
|
| MD5 |
f1c3785ad41853ec0c27055dc6452d26
|
|
| BLAKE2b-256 |
9656050939ec17fcdb1f3a9dfc642d4b6fc95b0b1ff0f22eb4724f1b3b9cedcb
|
File details
Details for the file chunklint-0.1.0-py3-none-any.whl.
File metadata
- Download URL: chunklint-0.1.0-py3-none-any.whl
- Upload date:
- Size: 35.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2f16e0bddf2c41a21e278a0f359e28d3df9e05a866b35032e051471e8c7903ce
|
|
| MD5 |
240fd869a0559a41a65136d859760a66
|
|
| BLAKE2b-256 |
794c808f491437c28ce4e639059f8821acb7ffac7cd94cc9040cfa505561aa23
|