chunking-ai

A fast and light-weight library for ingesting and chunking files

These details have not been verified by PyPI

Project links

Project description

Installation

pip install ...

Install the following system dependencies for the chunking library:

pandoc: for parsing markup language (e.g. epub, html, rtf, rst, docx...)
libreoffice: for file conversion (e.g. doc, ppt, xls to docx, pptx, xlsx...)

Features

Library responsibility:

Graph: Files ----> Parse ----> Chunk ----> Index ----> Search ----> Represent retrieval context -----> LLM We are responsible for:

Parse, Chunk
Represent retrieval context

How chunking is different? We chunk the document structure rather than the text. This is to take the document structure into account to create more sensible chunks.

Maintain sectional layout structure during parsing and chunking (*).
Supported formats (refer ... for suitable parsers for each format):
- Text: html, md, txt, epub, latex, org, rtf, rst
- Office documents: pdf, docx, pptx, xlsx
- Images: jpg, png
- Audio: wav, mp3
- Video: mp4
- Code: ipynb, (coming soon: py, js)
- Data interchange: csv, json yaml, toml
Content linking across files.
Suppport LLM integration for correct content parsing.
Task description for agent-oriented RAG strategies.
Fast.
- At least 100MB/s parsing
- At least 100MB/s splitting
Extensible.
- Easy to add new strategy
- Easy to change configure of the current strategy
Traceable: trace from chunk to source.
Developer-friendly
- Evaluation
- Benchmark
- Config selector
- Docker
- CLI to chunk
Complete.
- All common file types
- All chunking strategies

(*) Due to the complexity and variety of how structures can be represented in different file types, there can be errors. chunking treats this as best effort. Refer xxx for difficult cases. File an issue if you encounter a problem.

Usage

Add LLM support

By default, chunking uses the llm (repo) with alias chunking-llm to interact with LLM. Please setup the desired LLM provider according to their docs, and set the alias chunking-llm to that model. Example, using Gemini model (as of April 2025):

# Install the LLM gemini
$ llm install llm-gemini

# Set the Gemini API key
$ llm keys set gemini

# Alias LLM to 'chunking-llm' (you can see other model ids by running `llm models`)
$ llm aliases set chunking-llm gemini-2.5-flash-preview-04-17

# Check the LLM is working correctly
$ llm -m chunking-llm "Explain quantum mechanics in 100 words"

Cookbook

TBD.

Application in agent-oriented RAG strategies.

Examples

TBD. Code snippet of prominent features:

Drop-in replacement for file parsing.
Show case of maintaining sectional layout structure. Show case the Chunk interface.
Show case of a notable chunking strategy.
Use as tool for agent.

Contributing

Ensure that you have git and git-lfs installed. git will be used for version control and git-lfs will be used for test data.

# Clone the repository
git clone git@github.com:chunking-ai/chunking.git
cd chunking

# Fetch the test data
git submodule update --init --recursive

# Install development dependnecy
pip install -e ".[dev]"

# Initialize pre-commit hooks
pre-commit install

License

Apache 2.0.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.2

May 26, 2025

This version

0.0.1

May 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunking_ai-0.0.1.tar.gz (85.4 kB view details)

Uploaded May 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chunking_ai-0.0.1-py3-none-any.whl (98.9 kB view details)

Uploaded May 18, 2025 Python 3

File details

Details for the file chunking_ai-0.0.1.tar.gz.

File metadata

Download URL: chunking_ai-0.0.1.tar.gz
Upload date: May 18, 2025
Size: 85.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for chunking_ai-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`c7462a758406bcf3b4d3b71bc51de41043128ede495e66d96403c11527a2f3e7`
MD5	`fd58b59693eaaab4817df4dacb147a26`
BLAKE2b-256	`98764582e8b49001e3da6a109a292f2fd1eda6c23b9a3a342ac25bddda019b9c`

See more details on using hashes here.

File details

Details for the file chunking_ai-0.0.1-py3-none-any.whl.

File metadata

Download URL: chunking_ai-0.0.1-py3-none-any.whl
Upload date: May 18, 2025
Size: 98.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for chunking_ai-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`199dd1167d739ed46729ee29f210bf240a4ed7b36394de45f5cfdcd8ef5d1566`
MD5	`673b6a4a2f261abd2644a8dd8511c707`
BLAKE2b-256	`a7bfe1183ed9696f7ccc98c6ec8ec80bc0b772393556ac35bfd2623e252515cb`

See more details on using hashes here.

chunking-ai 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Features

Usage

Add LLM support

Cookbook

Examples

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes