We intend to develop this project into a plug-and-play chunking library that incorporates various cutting-edge chunking strategies for LLMs.

These details have not been verified by PyPI

Project description

Meta-Chunking: Learning Text Segmentation and Semantic Completion via Logical Perception

Meta-Chunking leverages the capabilities of LLMs to flexibly partition documents into logically coherent, independent chunks. Our approach is grounded in a core principle: allowing variability in chunk size to more effectively capture and maintain the logical integrity of content. This dynamic adjustment of granularity ensures that each segmented chunk contains a complete and independent expression of ideas, thereby avoiding breaks in the logical chain during the segmentation process. This not only enhances the relevance of document retrieval but also improves content clarity.

Note: Perplexity is a metric used to measure a language model's ability to predict text. It reflects the degree of uncertainty in generating the next token or sentence given a specific context. Our initial intuition is also to ensure that, during chunking, we split the text at points of certainty and keep it intact at points of uncertainty. This approach is more beneficial for subsequent retrieval and generation. Therefore, in fact, perplexity-based chunking leverages the hallucinations of language models to perceive text boundaries (relative to the boundaries of models), thereby ensuring that chunks are not split at points where language models hallucinate, avoiding the introduction of more hallucinations during retrieval and question answering by LLMs.

Todo

We intend to develop this project into a plug-and-play chunking library that incorporates various cutting-edge chunking strategies for LLMs. While you can use Llama_index for traditional chunking methods, it may be difficult for this library to keep up with the latest chunking technologies. Therefore, we will regularly reconstruct methods from excellent chunking papers into interfaces and add them to the library, making it easier for your system to integrate advanced chunking strategies.

Currently, all methods are maintained in the tools folder. The eval.ipynb file demonstrates usage examples of different chunking method interfaces, while each of the other files represents a specific LLMs chunking method.

Release PPL Chunking and Margin Sampling Chunking
1. Refactor methods in Meta-Chunking into several interface formats for easy invocation.
- PPL Chunking: Strategically introduce the KV caching mechanism to achieve PPL Chunking for both short and long documents (🚀 A Swift and Accurate Text Chunking Technique🌟).
- Margin Sampling Chunking: A binary classification judgment is made on whether consecutive sentences need to be segmented, based on the probability obtained through margin sampling to make decisions.
- Dynamic combination: To accommodate diverse chunking requirements, a strategy of dynamic combination is introduced to assist in chunking, achieving a balance between fine-grained and coarse-grained text chunking.
2. Integrating LumberChunker: Refactoring it into an interface for convenient invocation; combining it with our margin sampling method to overcome the limitation of the original project's inability to use local small models.
3. Integrating Dense X Retrieval: Refactoring it into an interface for convenient invocation.
......
Our follow-up work

Highlights

Introduces the concept of Meta-Chunk, which operates at a granularity between sentences and paragraphs.
Propose two implementation strategies: Margin Sampling (MSP) Chunking and Perplexity (PPL) Chunking.
Put forward a Meta-Chunk with dynamic combination strategy designed to achieve a valid balance between fine-grained and coarse-grained text segmentation.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.5.1

May 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lmchunker-0.5.1.tar.gz (12.8 kB view details)

Uploaded May 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lmchunker-0.5.1-py3-none-any.whl (16.3 kB view details)

Uploaded May 21, 2025 Python 3

File details

Details for the file lmchunker-0.5.1.tar.gz.

File metadata

Download URL: lmchunker-0.5.1.tar.gz
Upload date: May 21, 2025
Size: 12.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.1 CPython/3.10.15 Linux/5.4.0-100-generic

File hashes

Hashes for lmchunker-0.5.1.tar.gz
Algorithm	Hash digest
SHA256	`78e9955bef6dcd4615745f934a621e5b656fcbfb124c07b7c396ed2c7045a35f`
MD5	`f296534ad7182ed512be892825e7e686`
BLAKE2b-256	`c229e66a6b22e0539b4feb128a3c322ee89baa84ee42a1b4864dd3e89c92f2ca`

See more details on using hashes here.

File details

Details for the file lmchunker-0.5.1-py3-none-any.whl.

File metadata

Download URL: lmchunker-0.5.1-py3-none-any.whl
Upload date: May 21, 2025
Size: 16.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.1 CPython/3.10.15 Linux/5.4.0-100-generic

File hashes

Hashes for lmchunker-0.5.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a55532469242bab301d9b6cf5a54a6c0ff0b2687bf556c099e9fcc41ce168e4f`
MD5	`c4578e6102f0a03858296812f1c02d32`
BLAKE2b-256	`8638a9e9ad9502eb3f0d6a3a03d6b57dccb2c465d6372665fd770e4316b4351d`

See more details on using hashes here.

lmchunker 0.5.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Meta-Chunking: Learning Text Segmentation and Semantic Completion via Logical Perception

Todo

Highlights

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes