Skip to main content

Semantic text chunking using Finetuned ModernBERT with recursive splitting and long text support.

Project description

fine-chunker 🚀

Semantic text chunking using a fine-tuned ModernBERT model with recursive splitting and support for extremely long documents.

This library divides your text into meaningful segments based on semantic boundaries rather than just character counts or newline characters. It uses a token-classification approach where the model predicts the ideal points to "cut" the text.

Library and model are still in early development, so expect some rough edges.

Key Features

  • Fine-tuned ModernBERT: Uses a finetunned ModernBERT encoder model optimized for semantic boundaries. More details about models are provided at: jboksa/modbert-chunker-base
  • Recursive Splitting: Automatically drills down into large chunks with decreasing thresholds to ensure everything fits your target size while remaining semantically coherent.
  • Long Text Support: Implements an intelligent sliding window system to process documents of any length (books, reports, etc.) without losing context.
  • Hugging Face Integration: Zero configuration required - models and tokenizers are fetched automatically from the Hub.
  • Hardware Agnostic: Runs smoothly on CUDA (GPU) or CPU.

Installation

Basic Installation

To install the fine-chunker package, you can use pip:

pip install fine-chunker

Or using uv:

uv add fine-chunker

Optional Dependencies

Depending on your use case, you may want to install additional dependencies:

  1. With PyTorch (GPU support): If you plan to use PyTorch with GPU support, install the package with the torch extras:

    pip install fine-chunker[torch]
    
  2. With PyTorch (CPU-only): If you plan to use PyTorch but only need CPU support, install the package with the torch-cpu extras:

    pip install fine-chunker[torch-cpu]
    
  3. With ONNX Runtime: If you plan to use ONNX for inference, install the package with the onnx extras:

    pip install fine-chunker[onnx]
    

Development Installation

If you want to contribute to the development of fine-chunker, you can install the package with development dependencies:

pip install fine-chunker[dev]

This will include tools for building, testing, and debugging the package.

Quick Start

from fine_chunker import Chunker

text = """
1 Introduction Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15]. Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains. Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network. In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
    """

chunker = Chunker.from_pretrained(device="cpu", use_onnx=True, max_chunk_size=850)
chunks = chunker.chunk(text)

for chunk in chunks:
    print(f"\nChunk {chunk.index} | size={len(chunk.content)}")
    print(chunk.content)

Result:

Chunk 0 | size=431
1 Introduction Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].

Chunk 1 | size=759
Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time , they generate a sequence of hidden states ht , as a function of the previous hidden state ht −1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths , as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.

Chunk 2 | size=731
Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks , allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network. In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.

Advanced Usage

You can fine-tune the chunking behavior using several parameters:

chunker = Chunker.from_pretrained(
    device="cuda",
    threshold_start=0.5,  # Starting sensitivity (higher = fewer chunks)
    threshold_step=0.1,   # How much to lower threshold when a chunk is too big
    max_chunk_size=1000,  # Target maximum characters per chunk
    min_chunk_size=350,   # Minimum characters (merges small fragments)
    max_depth=3           # How many times to try splitting a single big chunk
)

How it Works

  1. Windowing: If the text is extremely long, it's divided into semantic windows of ~8000 tokens.
  2. Prediction: The ModernBERT model identifies "start of chunk" tokens.
  3. Recursive Refinement: If a resulting chunk is larger than max_chunk_size, the library re-scans just that fragment with a lower sensitivity threshold.
  4. Stability Merge: Finally, very small fragments are merged with their neighbors to maintain a consistent chunk size for your RAG or LLM application.

Author

Developed by Jerzy Boksa.

Contact: devjerzy@gmail.com

Model hosted at: jboksa/modbert-chunker-base

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fine_chunker-0.1.2.tar.gz (87.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fine_chunker-0.1.2-py3-none-any.whl (17.2 kB view details)

Uploaded Python 3

File details

Details for the file fine_chunker-0.1.2.tar.gz.

File metadata

  • Download URL: fine_chunker-0.1.2.tar.gz
  • Upload date:
  • Size: 87.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for fine_chunker-0.1.2.tar.gz
Algorithm Hash digest
SHA256 0b66d4f7448f6e318a8e14b5241b9e54416ef2592d228c86d3ef39147f4a37ab
MD5 1badae075ba991fc74f7b45e57ad25a5
BLAKE2b-256 fb9a4742c06941cd6758f13ae59d8351b388b248be9330963df66839292f8936

See more details on using hashes here.

File details

Details for the file fine_chunker-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: fine_chunker-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 17.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for fine_chunker-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 670e18cfc90d02b05a3c72c279a9b549ed62f7c920d7fe723086b417221a7321
MD5 f894bc8f02622e4ea2f82008847e9d52
BLAKE2b-256 89793cba3b22f7760285820f13552570e047cb0dd60fb905cc2579ec6ad29402

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page