We intend to develop this project into a plug-and-play chunking library that incorporates various cutting-edge chunking strategies for LLMs.
Project description
Meta-Chunking: Learning Text Segmentation and Semantic Completion via Logical Perception
Meta-Chunking leverages the capabilities of LLMs to flexibly partition documents into logically coherent, independent chunks. Our approach is grounded in a core principle: allowing variability in chunk size to more effectively capture and maintain the logical integrity of content. This dynamic adjustment of granularity ensures that each segmented chunk contains a complete and independent expression of ideas, thereby avoiding breaks in the logical chain during the segmentation process. This not only enhances the relevance of document retrieval but also improves content clarity.
Note: Perplexity is a metric used to measure a language model's ability to predict text. It reflects the degree of uncertainty in generating the next token or sentence given a specific context. Our initial intuition is also to ensure that, during chunking, we split the text at points of certainty and keep it intact at points of uncertainty. This approach is more beneficial for subsequent retrieval and generation. Therefore, in fact, perplexity-based chunking leverages the hallucinations of language models to perceive text boundaries (relative to the boundaries of models), thereby ensuring that chunks are not split at points where language models hallucinate, avoiding the introduction of more hallucinations during retrieval and question answering by LLMs.
Todo
We intend to develop this project into a plug-and-play chunking library that incorporates various cutting-edge chunking strategies for LLMs. While you can use Llama_index for traditional chunking methods, it may be difficult for this library to keep up with the latest chunking technologies. Therefore, we will regularly reconstruct methods from excellent chunking papers into interfaces and add them to the library, making it easier for your system to integrate advanced chunking strategies.
Currently, all methods are maintained in the tools folder. The eval.ipynb file demonstrates usage examples of different chunking method interfaces, while each of the other files represents a specific LLMs chunking method.
- Release PPL Chunking and Margin Sampling Chunking
- 1. Refactor methods in Meta-Chunking into several interface formats for easy invocation.
- PPL Chunking: Strategically introduce the KV caching mechanism to achieve PPL Chunking for both short and long documents (🚀 A Swift and Accurate Text Chunking Technique🌟).
- Margin Sampling Chunking: A binary classification judgment is made on whether consecutive sentences need to be segmented, based on the probability obtained through margin sampling to make decisions.
- Dynamic combination: To accommodate diverse chunking requirements, a strategy of dynamic combination is introduced to assist in chunking, achieving a balance between fine-grained and coarse-grained text chunking.
- 2. Integrating LumberChunker: Refactoring it into an interface for convenient invocation; combining it with our margin sampling method to overcome the limitation of the original project's inability to use local small models.
- 3. Integrating Dense X Retrieval: Refactoring it into an interface for convenient invocation.
- ......
- Our follow-up work
Highlights
-
Introduces the concept of Meta-Chunk, which operates at a granularity between sentences and paragraphs.
-
Propose two implementation strategies: Margin Sampling (MSP) Chunking and Perplexity (PPL) Chunking.
-
Put forward a Meta-Chunk with dynamic combination strategy designed to achieve a valid balance between fine-grained and coarse-grained text segmentation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lmchunker-0.5.1.tar.gz.
File metadata
- Download URL: lmchunker-0.5.1.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.1 CPython/3.10.15 Linux/5.4.0-100-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
78e9955bef6dcd4615745f934a621e5b656fcbfb124c07b7c396ed2c7045a35f
|
|
| MD5 |
f296534ad7182ed512be892825e7e686
|
|
| BLAKE2b-256 |
c229e66a6b22e0539b4feb128a3c322ee89baa84ee42a1b4864dd3e89c92f2ca
|
File details
Details for the file lmchunker-0.5.1-py3-none-any.whl.
File metadata
- Download URL: lmchunker-0.5.1-py3-none-any.whl
- Upload date:
- Size: 16.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.1 CPython/3.10.15 Linux/5.4.0-100-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a55532469242bab301d9b6cf5a54a6c0ff0b2687bf556c099e9fcc41ce168e4f
|
|
| MD5 |
c4578e6102f0a03858296812f1c02d32
|
|
| BLAKE2b-256 |
8638a9e9ad9502eb3f0d6a3a03d6b57dccb2c465d6372665fd770e4316b4351d
|