Skip to main content

A fast and light-weight library for ingesting and chunking files

Project description

Installation

pip install ...

Install the following system dependencies for the chunking library:

  • pandoc: for parsing markup language (e.g. epub, html, rtf, rst, docx...)
  • libreoffice: for file conversion (e.g. doc, ppt, xls to docx, pptx, xlsx...)

Features

Library responsibility:

Graph: Files ----> Parse ----> Chunk ----> Index ----> Search ----> Represent retrieval context -----> LLM We are responsible for:

  • Parse, Chunk
  • Represent retrieval context

How chunking is different? We chunk the document structure rather than the text. This is to take the document structure into account to create more sensible chunks.

  • Maintain sectional layout structure during parsing and chunking (*).
  • Supported formats (refer ... for suitable parsers for each format):
    • Text: html, md, txt, epub, latex, org, rtf, rst
    • Office documents: pdf, docx, pptx, xlsx
    • Images: jpg, png
    • Audio: wav, mp3
    • Video: mp4
    • Code: ipynb, (coming soon: py, js)
    • Data interchange: csv, json yaml, toml
  • Content linking across files.
  • Suppport LLM integration for correct content parsing.
  • Task description for agent-oriented RAG strategies.
  • Fast.
    • At least 100MB/s parsing
    • At least 100MB/s splitting
  • Extensible.
    • Easy to add new strategy
    • Easy to change configure of the current strategy
  • Traceable: trace from chunk to source.
  • Developer-friendly
    • Evaluation
    • Benchmark
    • Config selector
    • Docker
    • CLI to chunk
  • Complete.
    • All common file types
    • All chunking strategies

(*) Due to the complexity and variety of how structures can be represented in different file types, there can be errors. chunking treats this as best effort. Refer xxx for difficult cases. File an issue if you encounter a problem.

Usage

Add LLM support

By default, chunking uses the llm (repo) with alias chunking-llm to interact with LLM. Please setup the desired LLM provider according to their docs, and set the alias chunking-llm to that model. Example, using Gemini model (as of April 2025):

# Install the LLM gemini
$ llm install llm-gemini

# Set the Gemini API key
$ llm keys set gemini

# Alias LLM to 'chunking-llm' (you can see other model ids by running `llm models`)
$ llm aliases set chunking-llm gemini-2.5-flash-preview-04-17

# Check the LLM is working correctly
$ llm -m chunking-llm "Explain quantum mechanics in 100 words"

Cookbook

TBD.

  • Application in agent-oriented RAG strategies.

Examples

TBD. Code snippet of prominent features:

  • Drop-in replacement for file parsing.
  • Show case of maintaining sectional layout structure. Show case the Chunk interface.
  • Show case of a notable chunking strategy.
  • Use as tool for agent.

Contributing

Ensure that you have git and git-lfs installed. git will be used for version control and git-lfs will be used for test data.

# Clone the repository
git clone git@github.com:chunking-ai/chunking.git
cd chunking

# Fetch the test data
git submodule update --init --recursive

# Install development dependnecy
pip install -e ".[dev]"

# Initialize pre-commit hooks
pre-commit install

License

Apache 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunking_ai-0.0.1.tar.gz (85.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunking_ai-0.0.1-py3-none-any.whl (98.9 kB view details)

Uploaded Python 3

File details

Details for the file chunking_ai-0.0.1.tar.gz.

File metadata

  • Download URL: chunking_ai-0.0.1.tar.gz
  • Upload date:
  • Size: 85.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for chunking_ai-0.0.1.tar.gz
Algorithm Hash digest
SHA256 c7462a758406bcf3b4d3b71bc51de41043128ede495e66d96403c11527a2f3e7
MD5 fd58b59693eaaab4817df4dacb147a26
BLAKE2b-256 98764582e8b49001e3da6a109a292f2fd1eda6c23b9a3a342ac25bddda019b9c

See more details on using hashes here.

File details

Details for the file chunking_ai-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: chunking_ai-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 98.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for chunking_ai-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 199dd1167d739ed46729ee29f210bf240a4ed7b36394de45f5cfdcd8ef5d1566
MD5 673b6a4a2f261abd2644a8dd8511c707
BLAKE2b-256 a7bfe1183ed9696f7ccc98c6ec8ec80bc0b772393556ac35bfd2623e252515cb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page