agentset chunker
Project description
Agentset Chunker: Document Chunking and Processing
Agentset Chunker is a versatile tool designed to process and chunk various types of documents, making it easier to manage, analyze, and utilize their content. This repository is created for Retrieval-Augmented Generation (RAG) systems, allowing you to chunk your files efficiently for RAG systems. It supports a wide range of document formats, including text files, PDFs, DOCX files, HTML, and more. The core functionality of this project is to break down documents into smaller, manageable chunks, which can then be used for tasks like information retrieval, summarization, and data extraction.
Key Features
- Document Chunking: Breaks down documents into smaller, meaningful chunks based on configurable strategies.
- Multi-Format Support: Handles a variety of document types, including:
- Text: TXT, Markdown (MD), JSON
- Office: PDF, DOCX, DOC
- Web: HTML, HTM
- URL Handling: Can process documents directly from URLs.
- Configurable Chunking: Offers flexible chunking strategies (e.g., by title, basic) with adjustable chunk size and overlap.
- Extensible: Designed for future expansion to include more file types and processing techniques.
How It Works
The Agentset Chunker project operates through the following key components:
-
Chunker Function (
chunker.py):- The core of the system, responsible for parsing and chunking documents.
- It determines the file type based on its extension or URL.
- It dispatches the document to the appropriate chunk (e.g., PDF chunk, DOCX chunk).
- It applies the selected chunking strategy and options.
- Strategies:
Strategy.BY_TITLE: Chunks based on titles or headings in the document (if available).Strategy.BASIC: Chunks based on a simple character count with overlap.
- Options:
ocr_force: Option for using OCR (Optical Character Recognition).max_characters: Maximum number of characters per chunk.overlap: Number of characters overlapping between chunks.
-
Chunking Modules (
chunking/)- There are separate modules for each file type:
pdf.py: Chunks PDF files.docx.py: Chunks DOCX files.doc.py: Chunks DOC files.txt.py: Chunks TXT files.md.py: Chunks MD files.json.py: Chunks JSON files.html.py: Chunks HTML and HTM files.
- Each module handles the specific parsing and chunking logic for its file type.
- There are separate modules for each file type:
-
Connector (
connector/)- Handles downloading files from URLs to temporary local storage.
- Manages any necessary HTTP requests.
-
Document Representation (
langchain-core/documents/base.py):- Uses the
langchain_core.documents.Documentclass to represent each chunk. - Each
Documenthaspage_content(the text of the chunk) andmetadata(additional information).
- Uses the
Getting Started
Prerequisites
Before you can use Agentset Chunker, you need to have the following installed:
- Python: 3.12 or higher.
- pip: Python package manager.
- System Dependencies:
- Poppler: For PDF processing.
- LibreOffice: For DOC and DOCX processing.
- Pandoc: For markdown processing.
- Tesseract OCR: For Optical Character Recognition (OCR).
- Note: Installation instructions for these dependencies vary by operating system. Please refer to the individual project's documentation for guidance.
- python dependencies:
- run
uv pip install -r pyproject.toml
- run
Installation
-
Clone the Repository:
git clone https://github.com/agentset-ai/agentset-chunker.git cd agentset-chunker
-
Create a Virtual Environment (Recommended):
python3 -m venv .venv source .venv/bin/activate # On Linux/macOS .venv\Scripts\activate # On Windows
-
Install Dependencies:
pip install .
If you use
uvyou can run :uv pip install ..
Usage
Example
import json
from agentset_chunker import chunker, Strategy
with open('./chunked.json', 'w') as file:
res = chunker(
"https://site.com/example.pdf",
strategy=Strategy.BY_TITLE)
file.write(json.dumps([c.__dict__ for c in res]))
File Types
Supported File Types
- HTML:
htm,html - Text:
pdf,docx,doc,text,md,json
Planned Support
- Image Types
- Video Types
To-Do
- Add Image support
- Add video support
- Expand Chunking Strategies: Implement more advanced chunking strategies (e.g., semantic chunking).
- Improve Error Handling: Add more specific error handling for different types of issues.
- Add tests: write unit test for the code.
Contributing
Contributions are welcome! If you find any issues or have suggestions for improvements, please feel free to open an issue or submit a pull request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentset_chunker-1.0.1.tar.gz.
File metadata
- Download URL: agentset_chunker-1.0.1.tar.gz
- Upload date:
- Size: 7.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b559c7e662180fc0a8d338f2849a416d910de14354a2bdf96130607cc22c4e3
|
|
| MD5 |
4f59eee96bf7662c021b01cdb3c0e4b9
|
|
| BLAKE2b-256 |
9c7264fea65b4ff9bca0ea342410ad173951dc9b7a31a97e93ddf1ddb24764a7
|
File details
Details for the file agentset_chunker-1.0.1-py3-none-any.whl.
File metadata
- Download URL: agentset_chunker-1.0.1-py3-none-any.whl
- Upload date:
- Size: 10.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ea4cd4fef72d726f572eea8b58ac4d12b5f86004f3b312b4bd4e305446461c8
|
|
| MD5 |
4bdf1129b756f40ee40ba8c73e1d3b54
|
|
| BLAKE2b-256 |
c473b6c7d9b2e79d71dedd56b1cb4191d5ea20b1f77bcb1e17f38b8d178022f0
|