A static analyzer for RAG chunks

These details have not been verified by PyPI

Project links

Project description

ChunkLint

Static analysis for RAG chunks before they enter a vector database.

ChunkLint checks chunk files and in-memory Python chunk objects for structural problems that usually show up later as bad retrieval, incomplete answers, or expensive re-indexing work. It is intentionally simple: no LLM calls, no embeddings, no eval dataset, and no vector database connection required.

The rules are deterministic heuristics. They are built to catch likely chunking mistakes with low runtime cost, but they are not a substitute for a labeled retrieval-quality benchmark.

Why It Exists

A normal RAG pipeline often moves directly from splitting to embedding:

docs = loader.load()
chunks = splitter.split_documents(docs)
vectorstore.add_documents(chunks)

If those chunks are malformed, missing context, duplicated, or split in the middle of an important sentence, the vector database stores that damage. You usually discover it much later when retrieval gives poor context to the model.

ChunkLint adds a quality gate between chunking and embedding:

from chunklint.adapters.langchain import export_documents, lint_documents

docs = loader.load()
chunks = splitter.split_documents(docs)

report = lint_documents(chunks)
export_documents(chunks, "chunks.json")

if report.has_high_issues:
    raise RuntimeError(
        "ChunkLint found high-severity findings. "
        "Run: chunklint scan chunks.json --verbose"
    )

vectorstore.add_documents(chunks)

What It Catches

ChunkLint focuses on obvious static chunk quality issues:

Empty or whitespace-only chunks
Missing stable chunk IDs
Missing source metadata
Missing heading, title, or section context
Chunks that likely start mid-sentence
Chunks that likely end mid-sentence
Tiny chunks with little usable context
Very large chunks that may mix topics
Markdown table fragments without headers
Unclosed markdown code fences
Near-duplicate chunks
Common PDF extraction noise

What It Is Not

ChunkLint is not a RAG observability platform, an LLM judge, a retriever eval tool, a vector database scanner, a PDF parser, or a chunk-size optimizer. It is a fast static analyzer that catches preventable chunk problems before indexing.

Install

From PyPI, once published:

pip install chunklint

For local development from this repository:

git clone https://github.com/prabhath004/chunklint.git
cd chunklint
python -m pip install -e .

Optional framework extras:

pip install "chunklint[langchain]"
pip install "chunklint[llamaindex]"

Developer tools:

python -m pip install -e ".[dev]"

Quickstart

Use the included bad chunk fixture:

chunklint scan examples/bad_chunks.json

Fail CI when high-severity issues exist:

chunklint scan examples/bad_chunks.json --fail-on high

Write a JSON report:

chunklint scan examples/bad_chunks.json --format json --out report.json

If the console script is not on your PATH, run the module directly:

python -m chunklint.cli scan examples/bad_chunks.json

CLI

`scan`

Scans JSON or JSONL chunk exports.

chunklint scan chunks.json
chunklint scan chunks.jsonl
chunklint scan chunks.json --fail-on high
chunklint scan chunks.json --fail-on high,medium
chunklint scan chunks.json --fail-on-at-or-above medium
chunklint scan chunks.json --format json
chunklint scan chunks.json --format json --out report.json
chunklint scan chunks.json --config chunklint.yml
chunklint scan chunks.json --quiet
chunklint scan chunks.json --verbose
chunklint scan chunks.json --raw --max-issues 50

Exit codes:

0: scan completed and did not fail the selected severity gate
1: selected severity gate failed, such as --fail-on high
2: invalid input or invalid config
3: unexpected internal error

Gating findings in CI

ChunkLint exposes two mutually exclusive gate flags:

--fail-on is an exact-severity gate. --fail-on high fails only on high findings; --fail-on medium fails only on medium findings. You can pass a comma list (--fail-on high,medium) to gate on several specific severities at once.
--fail-on-at-or-above is a threshold gate. --fail-on-at-or-above medium fails on medium and high findings; --fail-on-at-or-above low fails on anything.

Pick --fail-on when you want surgical control (e.g., block high without caring about medium yet). Pick --fail-on-at-or-above when you want the usual "block this and worse" CI behavior. The two flags cannot be combined, so the intent stays explicit in your workflow file.

In text output, the gate shows an overall high/medium/low lint summary plus root causes for the selected severity. JSON output still contains the full machine-readable scan. Use --quiet when you only want the exit code.

`init`

Creates a default config file.

chunklint init

Create a config at a custom path:

chunklint init config/chunklint.yml

Overwrite an existing config:

chunklint init --force

`rules`

Lists rule IDs, default severities, and whether each rule checks one chunk or the full chunk set.

chunklint rules

More CLI detail is in docs/cli.md.

Python SDK

The generic SDK accepts dictionaries, Chunk models, and supported framework objects.

from chunklint import lint

chunks = [
    {
        "id": "chunk_1",
        "text": "Refund Policy. Customers can request refunds within 30 days.",
        "source": "refund_policy.md",
        "metadata": {"heading": "Refund Policy"},
    },
    {
        "id": "chunk_2",
        "text": "except enterprise customers may request refunds within 90 days.",
        "source": "refund_policy.md",
        "metadata": {"heading": "Refund Policy"},
    },
]

report = lint(chunks)

if report.has_high_issues:
    raise RuntimeError(f"ChunkLint found {report.high} high-severity findings.")

The returned report includes counts and issue objects:

print(report.chunks_scanned)
print(report.issues_found)
print(report.high, report.medium, report.low)
print(report.ok)

More SDK detail is in docs/sdk.md.

LangChain

LangChain Document objects are mapped from page_content and metadata.

from chunklint.adapters.langchain import export_documents, lint_documents

docs = loader.load()
chunks = splitter.split_documents(docs)

report = lint_documents(chunks)
export_documents(chunks, "chunks.json")

if report.has_high_issues:
    raise RuntimeError(
        "ChunkLint found high-severity findings. "
        "Run: chunklint scan chunks.json --verbose"
    )

vectorstore.add_documents(chunks)

LlamaIndex

LlamaIndex nodes are mapped from node.get_content(), node.node_id, node.metadata, and node.ref_doc_id.

from chunklint.adapters.llamaindex import export_nodes, lint_nodes

nodes = parser.get_nodes_from_documents(documents)
report = lint_nodes(nodes)
export_nodes(nodes, "chunks.json")

if report.has_high_issues:
    raise RuntimeError(
        "ChunkLint found high-severity findings. "
        "Run: chunklint scan chunks.json --verbose"
    )

index = VectorStoreIndex(nodes)

Input Format

JSON input should be an array of chunk objects:

[
  {
    "id": "chunk_1",
    "text": "Refund Policy. Customers can request refunds within 30 days.",
    "source": "refund_policy.pdf",
    "metadata": {
      "page": 2,
      "heading": "Refund Policy"
    }
  }
]

JSONL input should have one chunk object per line:

{"id":"chunk_1","text":"Refund Policy. Customers can request refunds within 30 days.","source":"refund_policy.pdf","metadata":{"page":2,"heading":"Refund Policy"}}
{"id":"chunk_2","text":"except enterprise customers may request refunds within 90 days.","source":"refund_policy.pdf","metadata":{"page":2,"heading":"Refund Policy"}}

Supported text keys for dictionary inputs include text, page_content, and content. Source can be supplied as a top-level source or through metadata keys such as source, file_name, path, or document_id.

Config

ChunkLint automatically loads chunklint.yml or chunklint.yaml from the current working directory. You can also pass a config explicitly:

chunklint scan chunks.json --config path/to/chunklint.yml

Example config:

version: 1

thresholds:
  min_words: 30
  max_words: 700
  duplicate_similarity: 0.92
  max_line_break_ratio: 0.35

heading_keys:
  - heading
  - title
  - section
  - heading_path
  - document_title
  - file_name

rules:
  starts_mid_sentence:
    enabled: true
    severity: high
    connector_words:
      - except
      - however
      - therefore
      - because
      - although
      - which
      - that
      - and
      - but
      - or
      - also
      - then
    ignore_start_words:
      - iphone
      - ebay
      - npm
      - openai

  too_short:
    enabled: true
    severity: low

  near_duplicate:
    enabled: true
    severity: low

Full rule and config reference: docs/rules.md.

Rules

Rule	Default	Scope	Purpose
`missing_text`	high	chunk	Flags empty chunks.
`missing_id`	medium	chunk	Flags chunks without stable IDs.
`missing_source`	medium	chunk	Flags chunks without traceable source metadata.
`missing_heading`	medium	chunk	Flags chunks without heading/title/section metadata. Page labels are not treated as headings.
`starts_mid_sentence`	high	chunk	Flags likely mid-sentence starts using continuation punctuation, configurable connector words, lowercase starts, and false-positive exclusions for headings, code, tables, and known product/tool names.
`ends_mid_sentence`	medium	chunk	Flags likely mid-sentence endings using missing punctuation, continuation punctuation, and trailing connector words while skipping headings, tables, code, URLs, and colon labels.
`broken_chunk_boundary`	high	cross-chunk	Compares adjacent chunks and flags likely sentence splits across chunk boundaries.
`too_short`	low	chunk	Flags chunks below `min_words`; raises to medium when heading context is missing.
`too_long`	medium	chunk	Flags chunks above `max_words`.
`broken_markdown_table`	high	chunk	Flags table fragments without markdown separator/header context.
`broken_code_block`	medium	chunk	Flags odd counts of triple-backtick fences.
`near_duplicate`	low	cross-chunk	Flags chunks above the duplicate similarity threshold.
`pdf_noise`	low	cross-chunk	Flags page labels, repeated headers/footers, hyphenation, and line-break artifacts.

Reports

Terminal output is designed for humans:

ChunkLint Report

Chunks scanned: 3
Raw findings: 6
Actionable root causes: 2

High:   2
Medium: 2
Low:    2

Text output groups related rules into root causes, recommends next steps, and keeps examples hidden by default. --verbose shows examples with snippets, and --examples-per-rule controls how many examples are shown. Use --raw when you need row-level findings, and use --raw --max-issues 0 to show every raw row.

JSON output is designed for CI, logs, or downstream tools. It includes summary, rule groups, root causes, recommendations, and raw issues:

{
  "schema_version": 1,
  "summary": {
    "chunks_scanned": 3,
    "issues_found": 6,
    "high": 2,
    "medium": 2,
    "low": 2
  },
  "groups": [],
  "root_causes": [],
  "recommendations": [],
  "issues": [
    {
      "chunk_id": "chunk_2",
      "source": "refund_policy.pdf",
      "rule_id": "starts_mid_sentence",
      "severity": "high",
      "reason": "Chunk starts with connector word \"except\".",
      "why_it_matters": "This chunk likely depends on a previous sentence and may lose the main rule.",
      "fix": "Use sentence-aware splitting or increase overlap.",
      "snippet": "except enterprise customers may request refunds within 90 days."
    }
  ]
}

schema_version is the public JSON contract version. Adding fields is a non-breaking change; renaming or removing existing fields bumps the version. Consumers that pin to a major version should read this key first.

CI

Use ChunkLint as a pre-embedding quality gate:

name: ChunkLint

on:
  pull_request:
    paths:
      - "docs/**"
      - "scripts/generate_chunks.py"
      - "chunklint.yml"

jobs:
  chunklint:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: pip install chunklint

      - name: Generate chunks
        run: python scripts/generate_chunks.py

      - name: Run ChunkLint
        run: chunklint scan chunks.json --fail-on high

The same workflow is available at examples/github_action.yml.

Project Layout

chunklint/
  cli.py              # Typer CLI commands
  config.py           # YAML config model and defaults
  engine.py           # Rule orchestration and report creation
  loader.py           # JSON/JSONL load and export helpers
  models.py           # Pydantic models for chunks, issues, reports
  normalizer.py       # Converts dicts/framework objects into Chunk models
  reporter.py         # Rich terminal output and JSON report serialization
  adapters/           # LangChain and LlamaIndex adapters
  rules/              # One file per rule family
  utils/              # Text, metadata, and severity helpers
docs/                 # User and contributor documentation
examples/             # Demo inputs and integration snippets
tests/                # Pytest coverage for CLI, loader, adapters, and rules

Test Folder Guide

The tests/ folder is the safety net for the first release:

File	What it verifies
`tests/test_loader.py`	JSON array loading, JSONL loading, and chunk export behavior.
`tests/test_cli.py`	CLI exit codes, `--fail-on`, JSON report writing, and config initialization.
`tests/test_adapters.py`	LangChain and LlamaIndex adapter normalization, linting, and export helpers.
`tests/test_missing_rules.py`	`missing_text`, `missing_id`, `missing_source`, and `missing_heading`.
`tests/test_boundary_rules.py`	Mid-sentence start/end detection, stronger continuation signals, false-positive exclusions, and boundary-rule config options.
`tests/test_table_rule.py`	Broken markdown table detection and valid table pass-through.
`tests/test_code_rule.py`	Unclosed markdown code-fence detection.
`tests/test_duplicate_rule.py`	Near-duplicate detection across chunks.
`tests/test_pdf_noise_rule.py`	PDF page-label noise and repeated footer/header detection.
`tests/test_reporter.py`	Grouped report summaries, JSON report shape, and recommendation generation.
`tests/test_config.py`	YAML config loading, auto-discovery, deep-merge overrides, and rule disabling.
`tests/test_json_schema.py`	Pins the public JSON output contract so downstream consumers can rely on it.
`tests/test_sdk_quiet.py`	Validates the quiet/SDK code path that returns a report without printing.

More detail is in docs/testing.md.

Run tests:

python -m pytest

Run linting when dev dependencies are installed:

python -m ruff check .

Project Status

ChunkLint v0.1.0 is the first public release on PyPI. The CLI, Python SDK, framework adapters, rules, and config schema are usable today, and the JSON output is independently versioned via schema_version: 1. While the project is still on 0.x, minor versions may introduce breaking changes as the API gets real-world feedback. The CHANGELOG.md file tracks notable changes between releases.

License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.0 yanked

May 26, 2026

This version

0.1.0

May 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunklint-0.1.0.tar.gz (51.2 kB view details)

Uploaded May 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chunklint-0.1.0-py3-none-any.whl (35.8 kB view details)

Uploaded May 26, 2026 Python 3

File details

Details for the file chunklint-0.1.0.tar.gz.

File metadata

Download URL: chunklint-0.1.0.tar.gz
Upload date: May 26, 2026
Size: 51.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for chunklint-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a3f3b62446b8882013d95a30bd90ab997b81d914d60c52708311944e43ee1239`
MD5	`f1c3785ad41853ec0c27055dc6452d26`
BLAKE2b-256	`9656050939ec17fcdb1f3a9dfc642d4b6fc95b0b1ff0f22eb4724f1b3b9cedcb`

See more details on using hashes here.

File details

Details for the file chunklint-0.1.0-py3-none-any.whl.

File metadata

Download URL: chunklint-0.1.0-py3-none-any.whl
Upload date: May 26, 2026
Size: 35.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for chunklint-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2f16e0bddf2c41a21e278a0f359e28d3df9e05a866b35032e051471e8c7903ce`
MD5	`240fd869a0559a41a65136d859760a66`
BLAKE2b-256	`794c808f491437c28ce4e639059f8821acb7ffac7cd94cc9040cfa505561aa23`

See more details on using hashes here.

chunklint 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ChunkLint

Why It Exists

What It Catches

What It Is Not

Install

Quickstart

CLI

scan

Gating findings in CI

init

rules

Python SDK

LangChain

LlamaIndex

Input Format

Config

Rules

Reports

CI

Project Layout

Test Folder Guide

Project Status

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`scan`

`init`

`rules`