Semantic chunking utilities for scientific code and documentation corpora.

These details have not been verified by PyPI

Project links

Project description

Chunky

Chunky is a python package for intelligently chunking scientific and technical repositories. It provides a modular pipeline that powers the Nancy Brain knowledge base and MCP services, while remaining useful as a standalone library for retrieval systems that need deterministic, metadata-rich chunks.

Citation

@software{chunky,
  author       = {Amber Malpas},
  title        = {AmberLee2427/chunky: v2.0.0},
  month        = mar,
  year         = 2026,
  publisher    = {Zenodo},
  version      = {v2.0.0},
  doi          = {10.5281/zenodo.18891712},
  url          = {https://doi.org/10.5281/zenodo.18891712},
}

Highlights

Deterministic sliding-window fallback that keeps progress even on unknown file types.
Registry-driven architecture so language-specific chunkers can be added without touching callers.
Rich metadata (chunk_id, line_start, line_end, character spans) ready for downstream RAG and citation tooling.
Optional forward-merge for tiny chunks via ChunkerConfig(min_chunk_chars=...), preserving content that would otherwise be dropped by downstream filters.
Language-aware chunkers for Python, Markdown, YAML/JSON config, plain text, Fortran, reStructuredText (.rst), notebook exports (.nb.txt), and (via Tree-sitter) C/C++/HTML/Bash.
Batteries-included tooling: Hatchling builds, Ruff linting, pytest coverage, Sphinx docs, and automated releases to PyPI + Read the Docs.

Quick Start

from pathlib import Path

from chunky import ChunkPipeline, ChunkerConfig

pipeline = ChunkPipeline()
config = ChunkerConfig(
    max_chars=1000,
    min_chunk_chars=80,  # forward-merge tiny chunks into their successor
    lines_per_chunk=40,
    line_overlap=5,
)

chunks = pipeline.chunk_file(Path("path/to/file.py"), config=config)

for chunk in chunks[:2]:
    print(chunk.chunk_id, chunk.metadata["line_start"], chunk.metadata["line_end"])

See the v2 implementation spec and v2.1 forward-merge spec for release behavior details. The original semantic chunker design draft is retained as archival context.

Documentation lives on Read the Docs: https://chunky.readthedocs.io

Built-in Chunkers

PythonSemanticChunker — splits modules on top-level functions/classes and groups leftover module context.
MarkdownHeadingChunker — emits chunks per heading while keeping the optional intro section.
JSONYamlChunker — slices configs by top-level keys/items and falls back gracefully when parsing fails.
PlainTextChunker — groups blank-line-separated paragraphs before falling back to sliding windows.
FortranChunker — captures subroutine/function/program blocks.
RSTChunker — detects reStructuredText section headings and chunks by section.
NotebookChunker — groups nb4llm notebook cells (.nb.txt) into markdown+code context chunks.
Tree-sitter chunkers (optional extra) for C/C++, HTML, Bash, and other structural languages, with gap-filling so uncaptured lines are still emitted.
SlidingWindowChunker — deterministic line windows with overlap when no specialised handler is available.

Chunk Identifiers

Each chunk defaults to an ID of the form <doc_id>#chunk-0000. Supply a logical document identifier via Document.metadata["doc_id"] (or override the key with ChunkerConfig.doc_id_key) and customise the suffix using ChunkerConfig.chunk_id_template (both {doc_id} and {index} are available).

Forward-Merge for Tiny Chunks

Use ChunkerConfig.min_chunk_chars to preserve small leading/trailing context by merging tiny chunks into their successor (or predecessor at end-of-file) instead of dropping them in downstream pipelines.

You can also apply this utility directly:

from chunky import merge_small_chunks

Installation

Install from PyPI:

pip install chunky-files

Or install from source using the pyproject.toml metadata:

# clone the repo (if you haven't already)
git clone https://github.com/AmberLee2427/chunky.git
cd chunky

# install the library
pip install .

For development and documentation builds, install the optional extras:

pip install -e ".[dev,docs]"

To enable Tree-sitter powered chunkers for C/C++/HTML/Bash (and other supported grammars), install:

pip install chunky-files[tree]

This extra pins tree-sitter==0.20.1 alongside the bundled tree-sitter-languages so the shipped grammar binaries load correctly.

-e performs an editable install so local changes reflect immediately. .[dev,docs] installs the tooling declared under the dev and docs extras in pyproject.toml.

Tooling

Code style: Ruff (ruff check src tests or ruff check src tests --fix)
Tests: Pytest (pytest --cov=chunky)
Docs: Sphinx + MyST + Furo (sphinx-build -b html docs docs/_build/html)
Packaging: Hatchling build backend
Versioning: bump-my-version (driven by tags and the release workflow)

Workflows

CI tests run on Linux, macOS, and Windows for Python 3.8 through 3.12.
Pushing a tag that matches the form vX.Y.Z triggers the release workflow. It validates that the tag matches the version in pyproject.toml, builds the distribution, and publishes to PyPI using the PYPI_API_TOKEN secret.
Read the Docs builds the documentation automatically for pushes to the default branch. Local builds use sphinx-build -b html docs docs/_build/html.

Release checklist:

Review and update CHANGELOG.md, keeping the [Unreleased] section accurate.
Run bump-my-version bump <part> to update version metadata and append a dated entry in the changelog.
Build distributions locally (rm -rf dist && python -m build) and verify metadata with python -m twine check dist/*.
Commit the changes and push to main.
Tag the commit (git tag vX.Y.Z && git push origin vX.Y.Z) to trigger the Release workflow.
Verify the PyPI publish job and Read the Docs build succeed.

Contributing

Know your audience: most contributors will be scientific coders. Write docs assuming limited familiarity with packaging internals.
Use Ruff for style checks and keep numpy-style docstrings on all non-test functions.
Target test coverage above 70% and ensure existing CI jobs pass before opening a PR.
In pull requests, summarise code changes, testing/validation, doc updates, and provide a brief TL;DR when the description runs long.

License

Chunky is released under the MIT License.

Glossary

Term	Meaning
PR	GitHub pull request – a request to merge one branch or fork with another
Release	Publishing a tagged version of the project to PyPI
ChangeLog	A document describing changes between releases
PyPI	Python Package Index – where published distributions live
Ruff	A fast Python linter/formatter used for style enforcement
origin	The upstream GitHub repository
fork	A downstream copy of the origin repo used for contributing
master/main	The default branch
CI	Continuous Integration – automated checks that run on every push/PR
GitHub Workflows	GitHub’s automation runner configured via YAML files
`pyproject.toml`	Core metadata and build configuration for the package
bump-my-version	CLI used to bump version numbers consistently
Read the Docs	Hosted documentation service that builds from the repo

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.2.0

Mar 7, 2026

2.1.1

Mar 7, 2026

2.1.0

Mar 7, 2026

2.0.1

Mar 6, 2026

2.0.0

Mar 6, 2026

1.0.0

Oct 8, 2025

0.4.1

Oct 3, 2025

0.4.0

Oct 3, 2025

0.3.0

Oct 3, 2025

0.2.2

Oct 1, 2025

0.2.0

Oct 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunky_files-2.2.0.tar.gz (296.7 kB view details)

Uploaded Mar 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chunky_files-2.2.0-py3-none-any.whl (27.6 kB view details)

Uploaded Mar 7, 2026 Python 3

File details

Details for the file chunky_files-2.2.0.tar.gz.

File metadata

Download URL: chunky_files-2.2.0.tar.gz
Upload date: Mar 7, 2026
Size: 296.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chunky_files-2.2.0.tar.gz
Algorithm	Hash digest
SHA256	`b4c45686f3c3ebc3b2b54d5803a67bd43e46f837ef6581fbf7119aaf20739549`
MD5	`cec4845d63cc1ecf3c26982a05558bd7`
BLAKE2b-256	`d8a7cb49d1ef9dcb1bbcd0b9fffca0a791d200505876d46f64e62cd05cb6063e`

See more details on using hashes here.

File details

Details for the file chunky_files-2.2.0-py3-none-any.whl.

File metadata

Download URL: chunky_files-2.2.0-py3-none-any.whl
Upload date: Mar 7, 2026
Size: 27.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chunky_files-2.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`84f723e5f018f601362af0b83d431b675556f4d0d381dfca74f3ad5ac45458d4`
MD5	`5cc1584463ca1912dbe144e4dcf93148`
BLAKE2b-256	`a4cea6381a24ff01f6485db1df214d0d34c49f5e558ef476421cc72727127a80`

See more details on using hashes here.

chunky-files 2.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Chunky

Citation

Highlights

Quick Start

Built-in Chunkers

Chunk Identifiers

Forward-Merge for Tiny Chunks

Installation

Tooling

Workflows

Contributing

License

Glossary

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes