Skip to main content

Semantic chunking utilities for scientific code and documentation corpora.

Project description

Chunky

Chunky is a python package for intelligently chunking scientific and technical repositories. It provides a modular pipeline that powers the Nancy Brain knowledge base and MCP services, while remaining useful as a standalone library for retrieval systems that need deterministic, metadata-rich chunks.

Highlights

  • Deterministic baseline chunking with overlap-aware windowing that works for any text file.
  • Registry-driven architecture so language-specific chunkers can be added without touching callers.
  • Rich metadata (chunk_id, line_start, line_end, character spans) ready for downstream RAG and citation tooling.
  • Batteries-included tooling: Hatchling builds, Ruff linting, pytest coverage, Sphinx docs, and automated releases to PyPI + Read the Docs.
  • Language-aware chunkers for Python, Markdown, YAML/JSON config, and plain text with automatic fallback to sliding windows.

Quick Start

from pathlib import Path

from chunky import ChunkPipeline, ChunkerConfig

pipeline = ChunkPipeline()
config = ChunkerConfig(lines_per_chunk=80, line_overlap=10)

chunks = pipeline.chunk_file(Path("path/to/file.py"), config=config)

for chunk in chunks[:2]:
    print(chunk.chunk_id, chunk.metadata["line_start"], chunk.metadata["line_end"])

See the design notes for the roadmap toward language-aware and embedding-driven chunkers.

Documentation lives on Read the Docs: https://chunky.readthedocs.io

Built-in Chunkers

  • PythonSemanticChunker — splits modules on top-level functions/classes and groups leftover module context.
  • MarkdownHeadingChunker — emits chunks per heading while keeping the optional intro section.
  • JSONYamlChunker — slices configs by top-level keys/items and falls back gracefully when parsing fails.
  • PlainTextChunker — groups blank-line-separated paragraphs; other files drop to the sliding-window fallback.
  • SlidingWindowChunker — deterministic line windows with overlap when no specialised handler is available.

Installation

Install from PyPI:

pip install chunky-files

Or install from source using the pyproject.toml metadata:

# clone the repo (if you haven't already)
git clone https://github.com/AmberLee2427/chunky.git
cd chunky

# install the library
pip install .

For development and documentation builds, install the optional extras:

pip install -e ".[dev,docs]"

-e performs an editable install so local changes reflect immediately. .[dev,docs] installs the tooling declared under the dev and docs extras in pyproject.toml.

Tooling

  • Code style: Ruff (ruff check src tests or ruff check src tests --fix)
  • Tests: Pytest (pytest --cov=chunky)
  • Docs: Sphinx + MyST + Furo (sphinx-build -b html docs docs/_build/html)
  • Packaging: Hatchling build backend
  • Versioning: bump-my-version (driven by tags and the release workflow)

Workflows

  • CI tests run on Linux, macOS, and Windows for Python 3.8 through 3.12.
  • Pushing a tag that matches the form vX.Y.Z triggers the release workflow. It validates that the tag matches the version in pyproject.toml, builds the distribution, and publishes to PyPI using the PYPI_API_TOKEN secret.
  • Read the Docs builds the documentation automatically for pushes to the default branch. Local builds use sphinx-build -b html docs docs/_build/html.

Release checklist:

  1. Review and update CHANGELOG.md, keeping the [Unreleased] section accurate.
  2. Run bump-my-version bump <part> to update version metadata and append a dated entry in the changelog.
  3. Build distributions locally (rm -rf dist && python -m build) and verify metadata with python -m twine check dist/*.
  4. Commit the changes and push to main.
  5. Tag the commit (git tag vX.Y.Z && git push origin vX.Y.Z) to trigger the Release workflow.
  6. Verify the PyPI publish job and Read the Docs build succeed.

Contributing

  • Know your audience: most contributors will be scientific coders. Write docs assuming limited familiarity with packaging internals.
  • Use Ruff for style checks and keep numpy-style docstrings on all non-test functions.
  • Target test coverage above 70% and ensure existing CI jobs pass before opening a PR.
  • In pull requests, summarise code changes, testing/validation, doc updates, and provide a brief TL;DR when the description runs long.

License

Chunky is released under the MIT License.

Glossary

Term Meaning
PR GitHub pull request – a request to merge one branch or fork with another
Release Publishing a tagged version of the project to PyPI
ChangeLog A document describing changes between releases
PyPI Python Package Index – where published distributions live
Ruff A fast Python linter/formatter used for style enforcement
origin The upstream GitHub repository
fork A downstream copy of the origin repo used for contributing
master/main The default branch
CI Continuous Integration – automated checks that run on every push/PR
GitHub Workflows GitHub’s automation runner configured via YAML files
pyproject.toml Core metadata and build configuration for the package
bump-my-version CLI used to bump version numbers consistently
Read the Docs Hosted documentation service that builds from the repo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunky_files-0.3.0.tar.gz (19.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunky_files-0.3.0-py3-none-any.whl (16.4 kB view details)

Uploaded Python 3

File details

Details for the file chunky_files-0.3.0.tar.gz.

File metadata

  • Download URL: chunky_files-0.3.0.tar.gz
  • Upload date:
  • Size: 19.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chunky_files-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c38a3e2aab8cb35cb9fabd0a3bfeaab2615e447c94bfb44b27302dda6163230f
MD5 947aab905e78a0ac999a434109355dab
BLAKE2b-256 3ef2c88a388b14ce52b4303211da3c17f4c0ffe8494f263b93f30a644e07a9d0

See more details on using hashes here.

File details

Details for the file chunky_files-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: chunky_files-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 16.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chunky_files-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8ff68ef11bba724bf1852daa2e2eb76052b057d91e1a4c9e570ee3c5c823bc58
MD5 347a3b7b27d9045f5a31f58c951e80ed
BLAKE2b-256 dd430008ac75630ddb0d8b34e372d8b5fc6f23671e5b568e98578d06ebec157b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page