Semantic text chunking using Finetuned ModernBERT with recursive splitting and long text support.
Project description
fine-chunker 🚀
Semantic text chunking using a fine-tuned ModernBERT model with recursive splitting and support for extremely long documents.
This library divides your text into meaningful segments based on semantic boundaries rather than just character counts or newline characters. It uses a token-classification approach where the model predicts the ideal points to "cut" the text.
Library and model are still in early development, so expect some rough edges.
Key Features
- Fine-tuned ModernBERT: Uses a finetunned ModernBERT encoder model optimized for semantic boundaries. More details about models are provided at: jboksa/modbert-chunker-base
- Recursive Splitting: Automatically drills down into large chunks with decreasing thresholds to ensure everything fits your target size while remaining semantically coherent.
- Long Text Support: Implements an intelligent sliding window system to process documents of any length (books, reports, etc.) without losing context.
- Hugging Face Integration: Zero configuration required - models and tokenizers are fetched automatically from the Hub.
- Hardware Agnostic: Runs smoothly on CUDA (GPU) or CPU.
Installation
Basic Installation
To install the fine-chunker package, you can use pip:
pip install fine-chunker
Or using uv:
uv add fine-chunker
Optional Dependencies
Depending on your use case, you may want to install additional dependencies:
-
With PyTorch (GPU support): If you plan to use PyTorch with GPU support, install the package with the
torchextras:pip install fine-chunker[torch]
-
With PyTorch (CPU-only): If you plan to use PyTorch but only need CPU support, install the package with the
torch-cpuextras:pip install fine-chunker[torch-cpu]
-
With ONNX Runtime: If you plan to use ONNX for inference, install the package with the
onnxextras:pip install fine-chunker[onnx]
Development Installation
If you want to contribute to the development of fine-chunker, you can install the package with development dependencies:
pip install fine-chunker[dev]
This will include tools for building, testing, and debugging the package.
Quick Start
from fine_chunker import Chunker
text = """
1 Introduction Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15]. Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains. Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network. In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
"""
chunker = Chunker.from_pretrained(device="cpu", use_onnx=True, max_chunk_size=850)
chunks = chunker.chunk(text)
for chunk in chunks:
print(f"\nChunk {chunk.index} | size={len(chunk.content)}")
print(chunk.content)
Result:
Chunk 0 | size=431
1 Introduction Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].
Chunk 1 | size=759
Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time , they generate a sequence of hidden states ht , as a function of the previous hidden state ht −1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths , as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.
Chunk 2 | size=731
Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks , allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network. In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
Advanced Usage
You can fine-tune the chunking behavior using several parameters:
chunker = Chunker.from_pretrained(
device="cuda",
threshold_start=0.5, # Starting sensitivity (higher = fewer chunks)
threshold_step=0.1, # How much to lower threshold when a chunk is too big
max_chunk_size=1000, # Target maximum characters per chunk
min_chunk_size=350, # Minimum characters (merges small fragments)
max_depth=3 # How many times to try splitting a single big chunk
)
How it Works
- Windowing: If the text is extremely long, it's divided into semantic windows of ~8000 tokens.
- Prediction: The ModernBERT model identifies "start of chunk" tokens.
- Recursive Refinement: If a resulting chunk is larger than
max_chunk_size, the library re-scans just that fragment with a lower sensitivity threshold. - Stability Merge: Finally, very small fragments are merged with their neighbors to maintain a consistent chunk size for your RAG or LLM application.
Author
Developed by Jerzy Boksa.
Contact: devjerzy@gmail.com
Model hosted at: jboksa/modbert-chunker-base
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fine_chunker-0.1.1.tar.gz.
File metadata
- Download URL: fine_chunker-0.1.1.tar.gz
- Upload date:
- Size: 87.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c0e0c6b6da962641453f4dae0295e9f6dcea7817d6442ffd39aec731f19252a
|
|
| MD5 |
de533ee159fe34f4f35227bbc0c2ac20
|
|
| BLAKE2b-256 |
17a8612de6bb7be51759dacf76923e6d61f96ff999960701a67e9b084273836d
|
File details
Details for the file fine_chunker-0.1.1-py3-none-any.whl.
File metadata
- Download URL: fine_chunker-0.1.1-py3-none-any.whl
- Upload date:
- Size: 16.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
626358b9c681988b917020875e9e3c79eea8fda832fd688067bf1bec7de094cd
|
|
| MD5 |
9b432ef688734dcf4d6fc343bd333051
|
|
| BLAKE2b-256 |
cc5db62d0447623bf6420166bfb46170d529b2955001769d21b81ac060ee4b8a
|