agentset chunker

Project description

Agentset Chunker: Document Chunking and Processing

Agentset Chunker is a versatile tool designed to process and chunk various types of documents, making it easier to manage, analyze, and utilize their content. This repository is created for Retrieval-Augmented Generation (RAG) systems, allowing you to chunk your files efficiently for RAG systems. It supports a wide range of document formats, including text files, PDFs, DOCX files, HTML, and more. The core functionality of this project is to break down documents into smaller, manageable chunks, which can then be used for tasks like information retrieval, summarization, and data extraction.

Key Features

Document Chunking: Breaks down documents into smaller, meaningful chunks based on configurable strategies.
Multi-Format Support: Handles a variety of document types, including:
- Text: TXT, Markdown (MD), JSON
- Office: PDF, DOCX, DOC
- Web: HTML, HTM
URL Handling: Can process documents directly from URLs.
Configurable Chunking: Offers flexible chunking strategies (e.g., by title, basic) with adjustable chunk size and overlap.
Extensible: Designed for future expansion to include more file types and processing techniques.

How It Works

The Agentset Chunker project operates through the following key components:

Chunker Function (chunker.py):
- The core of the system, responsible for parsing and chunking documents.
- It determines the file type based on its extension or URL.
- It dispatches the document to the appropriate chunk (e.g., PDF chunk, DOCX chunk).
- It applies the selected chunking strategy and options.
- Strategies:
  - Strategy.BY_TITLE: Chunks based on titles or headings in the document (if available).
  - Strategy.BASIC: Chunks based on a simple character count with overlap.
- Options:
  - ocr_force: Option for using OCR (Optical Character Recognition).
  - max_characters: Maximum number of characters per chunk.
  - overlap: Number of characters overlapping between chunks.
Chunking Modules (chunking/)
- There are separate modules for each file type:
  - pdf.py : Chunks PDF files.
  - docx.py: Chunks DOCX files.
  - doc.py: Chunks DOC files.
  - txt.py: Chunks TXT files.
  - md.py: Chunks MD files.
  - json.py: Chunks JSON files.
  - html.py: Chunks HTML and HTM files.
- Each module handles the specific parsing and chunking logic for its file type.
Connector (connector/)
- Handles downloading files from URLs to temporary local storage.
- Manages any necessary HTTP requests.
Document Representation (langchain-core/documents/base.py):
- Uses the langchain_core.documents.Document class to represent each chunk.
- Each Document has page_content (the text of the chunk) and metadata (additional information).

Getting Started

Prerequisites

Before you can use Agentset Chunker, you need to have the following installed:

Python: 3.12 or higher.
pip: Python package manager.
System Dependencies:
- Poppler: For PDF processing.
- LibreOffice: For DOC and DOCX processing.
- Pandoc: For markdown processing.
- Tesseract OCR: For Optical Character Recognition (OCR).
  - Note: Installation instructions for these dependencies vary by operating system. Please refer to the individual project's documentation for guidance.
python dependencies:
- run uv pip install -r pyproject.toml

Installation

Clone the Repository:

git clone https://github.com/agentset-ai/agentset-chunker.git
cd agentset-chunker

Create a Virtual Environment (Recommended):

python3 -m venv .venv
source .venv/bin/activate  # On Linux/macOS
.venv\Scripts\activate  # On Windows

Install Dependencies:
```
pip install .
```
If you use uv you can run : uv pip install ..

Usage

Example

import json

from agentset_chunker import chunker, Strategy

with open('./chunked.json', 'w') as file:
    res = chunker(
        "https://site.com/example.pdf",
        strategy=Strategy.BY_TITLE)
    file.write(json.dumps([c.__dict__ for c in res]))

File Types

Supported File Types

HTML: htm, html
Text: pdf, docx, doc, text, md, json

Planned Support

Image Types
Video Types

To-Do

Add Image support
Add video support
Expand Chunking Strategies: Implement more advanced chunking strategies (e.g., semantic chunking).
Improve Error Handling: Add more specific error handling for different types of issues.
Add tests: write unit test for the code.

Contributing

Contributions are welcome! If you find any issues or have suggestions for improvements, please feel free to open an issue or submit a pull request.

Project details

Release history Release notifications | RSS feed

This version

1.0.1

Mar 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentset_chunker-1.0.1.tar.gz (7.1 kB view details)

Uploaded Mar 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentset_chunker-1.0.1-py3-none-any.whl (10.9 kB view details)

Uploaded Mar 8, 2025 Python 3

File details

Details for the file agentset_chunker-1.0.1.tar.gz.

File metadata

Download URL: agentset_chunker-1.0.1.tar.gz
Upload date: Mar 8, 2025
Size: 7.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for agentset_chunker-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`9b559c7e662180fc0a8d338f2849a416d910de14354a2bdf96130607cc22c4e3`
MD5	`4f59eee96bf7662c021b01cdb3c0e4b9`
BLAKE2b-256	`9c7264fea65b4ff9bca0ea342410ad173951dc9b7a31a97e93ddf1ddb24764a7`

See more details on using hashes here.

File details

Details for the file agentset_chunker-1.0.1-py3-none-any.whl.

File metadata

Download URL: agentset_chunker-1.0.1-py3-none-any.whl
Upload date: Mar 8, 2025
Size: 10.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for agentset_chunker-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4ea4cd4fef72d726f572eea8b58ac4d12b5f86004f3b312b4bd4e305446461c8`
MD5	`4bdf1129b756f40ee40ba8c73e1d3b54`
BLAKE2b-256	`c473b6c7d9b2e79d71dedd56b1cb4191d5ea20b1f77bcb1e17f38b8d178022f0`

See more details on using hashes here.

agentset-chunker 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Agentset Chunker: Document Chunking and Processing

Key Features

How It Works

Getting Started

Prerequisites

Installation

Usage

Example

File Types

Supported File Types

Planned Support

To-Do

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes