A fast and light-weight library for ingesting and chunking files

These details have not been verified by PyPI

Project links

Project description

Chunking

A fast and lightweight Python library for intelligent document parsing, chunking, and analysis.

📊 Library Responsibility

flowchart LR
    Files([Documents]) -->|Input| Parse
    Parse -->|Structured Data| Chunk
    Chunk -->|Semantic Units| Index
    Index -->|Indexed Data| Search
    Search -->|Retrieved Chunks| Context[Represent Context]
    Context -->|Context + Query| LLM([Language Model])

    classDef primary fill:#4CAF50,stroke:#388E3C,color:black,stroke-width:2px,font-weight:bold
    classDef secondary fill:#64B5F6,stroke:#1976D2,color:black,stroke-width:1px
    classDef document fill:#E1BEE7,stroke:#9C27B0,color:black,stroke-width:1px,stroke-dasharray: 5 5

    class Parse,Chunk,Context primary
    class Index,Search secondary
    class Files,LLM document

    subgraph ChunkingLibrary["Chunking Library Focus"]
        Parse
        Chunk
        Context
    end

The Chunking library focuses on the critical first steps in the document processing pipeline: Parsing documents into a unified format, Chunking content intelligently, and Representing retrieval context for LLMs.

🌟 Features

Universal Document Processing: Parse PDFs, Word documents, PowerPoint, Excel, Markdown, HTML, images, audio, video, and more with a unified API
Intelligent Structure Preservation: Keeps document hierarchy, tables, lists, and formatting intact
Modular Architecture: Choose parsers and processing steps based on your needs
Multimodal Support: Extract and process text, images, tables, and other elements
Smart Chunking Strategies: Split content based on headers, semantic meaning, or custom rules
LLM-Ready: Output chunks ready for embedding or use with language models
Extensible: Easy to add custom parsers and processors

🚀 Quick Start

from chunking import parse

# Parse a file or directory into chunks
chunks = parse("path/to/document.pdf")

# Print the text
print(chunks.render())

# Or get a hierarchical structure
for depth, chunk in chunks.walk():
    print("  " * depth + f"{chunk.ctype}: {chunk.content[:30]}...")

📦 Installation

pip install chunking-ai

For specialized parsers, install with extras:

pip install chunking-ai[pdf,ocr,audio]  # Install with PDF, OCR, and audio support

External Dependencies

Some parsers require external tools to be installed on your system:

pandoc: Required for parsing markup languages (EPUB, HTML, RTF, RST, DOCX, etc.)

# Ubuntu/Debian
sudo apt-get install pandoc

# macOS
brew install pandoc

libreoffice: Required for file conversion (DOC → DOCX, PPT → PPTX, XLS → XLSX, etc.)

# Ubuntu/Debian
sudo apt-get install libreoffice

# macOS
brew install --cask libreoffice

🔍 Supported File Types

Documents: PDF, DOCX, PPTX, XLSX, EPUB
Markup: HTML, Markdown
Data: CSV, JSON, YAML, TOML
Media: Images (JPEG, PNG), Audio (MP3, WAV), Video (MP4)
Code: Various programming languages
Directories: Process entire folders of mixed documents

🧩 Components

Parsers

Parsers read different file formats and convert them into a unified chunk structure:

from chunking.parser import FastPDF, Markdown, RapidOCRImageText

# Parse a PDF with a faster parser
pdf_chunks = FastPDF.run(chunk)

# Parse Markdown into a structured tree
md_chunks = Markdown.run(chunk)

# Extract text from images using OCR
img_chunks = RapidOCRImageText.run(chunk)

Chunking Strategies

Split content into meaningful chunks with different strategies:

from chunking.split import ChunkByCharacters, FlattenToMarkdown, LumberChunker

# Split by character count or word count
chunks = ChunkByCharacters.run(doc, chunk_size=1000)

# Flatten hierarchical content while preserving structure
chunks = FlattenToMarkdown.run(doc, max_size=500)

# Use an LLM for semantic chunking (requires LLM setup)
chunks = LumberChunker.run(doc, chunk_size=800)

Processing Pipeline

Build custom processing pipelines:

from chunking.controller import get_controller
from chunking.parser.pdf import FastPDF
from chunking.split import MarkdownSplitByHeading, Propositionizer

# Get a controller and parse a document
ctrl = get_controller()
chunk = ctrl.as_root_chunk("document.pdf")
FastPDF.run(chunk)

# Process the parsed document
sections = MarkdownSplitByHeading.run(chunk, min_chunk_size=200)
propositions = Propositionizer.run(sections)

🔧 Advanced Usage

Custom Parsers

Create your own parsers for specialized formats:

from chunking.base import BaseOperation, Chunk, ChunkGroup, CType

class MyCustomParser(BaseOperation):
    @classmethod
    def run(cls, chunks: Chunk | ChunkGroup, **kwargs) -> ChunkGroup:
        # Custom parsing logic here
        return processed_chunks

LLM Integration

Add LLM support for semantic splitting and processing:

Add LLM support

By default, chunking uses the llm (repo) with alias chunking-llm to interact with LLM. Please setup the desired LLM provider according to their docs, and set the alias chunking-llm to that model. Example, using Gemini model (as of April 2025):

# Install the LLM gemini
$ llm install llm-gemini
# Set the Gemini API key
$ llm keys set gemini
# Alias LLM to 'chunking-llm' (you can see other model ids by running `llm models`)
$ llm aliases set chunking-llm gemini-2.5-flash-preview-04-17
# Check the LLM is working correctly
$ llm -m chunking-llm "Explain quantum mechanics in 100 words"

Once LLM is set up, you can use LLM-based chunkers:

from chunking.split import LumberChunker, AgenticChunker, Propositionizer

chunks = LumberChunker.run(doc)  # Semantically split content
chunks = Propositionizer.run(doc)  # Convert to atomic propositions
chunks = AgenticChunker.run(doc)  # Group chunks by topic

📊 Example Applications

Create knowledge bases from document collections
Build RAG (Retrieval-Augmented Generation) systems
Extract structured data from unstructured documents
Generate document summaries with preserved structure
Create question-answering systems over documents

🤝 Contributing

Contributions are welcome! Ensure that you have git and git-lfs installed. git will be used for version control and git-lfs will be used for test data.

# Clone the repository
git clone git@github.com:chunking-ai/chunking.git
cd chunking

# Fetch the test data
git submodule update --init --recursive

# Install development dependnecy
pip install -e ".[dev]"

# Initialize pre-commit hooks
pre-commit install

📄 License

Apache 2.0 License. See the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.2

May 26, 2025

0.0.1

May 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunking_ai-0.0.2.tar.gz (89.1 kB view details)

Uploaded May 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chunking_ai-0.0.2-py3-none-any.whl (102.1 kB view details)

Uploaded May 26, 2025 Python 3

File details

Details for the file chunking_ai-0.0.2.tar.gz.

File metadata

Download URL: chunking_ai-0.0.2.tar.gz
Upload date: May 26, 2025
Size: 89.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for chunking_ai-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`abefb05cb47246f9de507e499f4cab86d23ad84dd21c9084fa50028b771d7311`
MD5	`0219908f40622eb8579e9e8dc20ccefc`
BLAKE2b-256	`0207212c2da06083a6884ab216d8a3b771083666b61c94d6d1ebaad18b0e9228`

See more details on using hashes here.

File details

Details for the file chunking_ai-0.0.2-py3-none-any.whl.

File metadata

Download URL: chunking_ai-0.0.2-py3-none-any.whl
Upload date: May 26, 2025
Size: 102.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for chunking_ai-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`caaf042a52a68b06edeb63fae19739474e5144ea0ef3a877b87a66c97c45b14f`
MD5	`22e89413baccd9f26cb294259bd12c53`
BLAKE2b-256	`3a1141b8041e0de4790d7a2de2de6d7a062d7217041299ce392615eaf759a22e`

See more details on using hashes here.

chunking-ai 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Chunking

📊 Library Responsibility

🌟 Features

🚀 Quick Start

📦 Installation

External Dependencies

🔍 Supported File Types

🧩 Components

Parsers

Chunking Strategies

Processing Pipeline

🔧 Advanced Usage

Custom Parsers

LLM Integration

Add LLM support

📊 Example Applications

🤝 Contributing

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes