Skip to main content

We intend to develop this project into a plug-and-play chunking library that incorporates various cutting-edge chunking strategies for LLMs.

Project description

Meta-Chunking: Learning Text Segmentation and Semantic Completion via Logical Perception

arXiv Paper Hugging Face Daily Papers Apache 2.0 License

Meta-Chunking leverages the capabilities of LLMs to flexibly partition documents into logically coherent, independent chunks. Our approach is grounded in a core principle: allowing variability in chunk size to more effectively capture and maintain the logical integrity of content. This dynamic adjustment of granularity ensures that each segmented chunk contains a complete and independent expression of ideas, thereby avoiding breaks in the logical chain during the segmentation process. This not only enhances the relevance of document retrieval but also improves content clarity.

Note: Perplexity is a metric used to measure a language model's ability to predict text. It reflects the degree of uncertainty in generating the next token or sentence given a specific context. Our initial intuition is also to ensure that, during chunking, we split the text at points of certainty and keep it intact at points of uncertainty. This approach is more beneficial for subsequent retrieval and generation. Therefore, in fact, perplexity-based chunking leverages the hallucinations of language models to perceive text boundaries (relative to the boundaries of models), thereby ensuring that chunks are not split at points where language models hallucinate, avoiding the introduction of more hallucinations during retrieval and question answering by LLMs.

Todo

We intend to develop this project into a plug-and-play chunking library that incorporates various cutting-edge chunking strategies for LLMs. While you can use Llama_index for traditional chunking methods, it may be difficult for this library to keep up with the latest chunking technologies. Therefore, we will regularly reconstruct methods from excellent chunking papers into interfaces and add them to the library, making it easier for your system to integrate advanced chunking strategies.

Currently, all methods are maintained in the tools folder. The eval.ipynb file demonstrates usage examples of different chunking method interfaces, while each of the other files represents a specific LLMs chunking method.

  • Release PPL Chunking and Margin Sampling Chunking
  • 1. Refactor methods in Meta-Chunking into several interface formats for easy invocation.
    • PPL Chunking: Strategically introduce the KV caching mechanism to achieve PPL Chunking for both short and long documents (🚀 A Swift and Accurate Text Chunking Technique🌟).
    • Margin Sampling Chunking: A binary classification judgment is made on whether consecutive sentences need to be segmented, based on the probability obtained through margin sampling to make decisions.
    • Dynamic combination: To accommodate diverse chunking requirements, a strategy of dynamic combination is introduced to assist in chunking, achieving a balance between fine-grained and coarse-grained text chunking.
  • 2. Integrating LumberChunker: Refactoring it into an interface for convenient invocation; combining it with our margin sampling method to overcome the limitation of the original project's inability to use local small models.
  • 3. Integrating Dense X Retrieval: Refactoring it into an interface for convenient invocation.
  • ......
  • Our follow-up work

Highlights

  • Introduces the concept of Meta-Chunk, which operates at a granularity between sentences and paragraphs.

  • Propose two implementation strategies: Margin Sampling (MSP) Chunking and Perplexity (PPL) Chunking.

  • Put forward a Meta-Chunk with dynamic combination strategy designed to achieve a valid balance between fine-grained and coarse-grained text segmentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lmchunker-0.5.1.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lmchunker-0.5.1-py3-none-any.whl (16.3 kB view details)

Uploaded Python 3

File details

Details for the file lmchunker-0.5.1.tar.gz.

File metadata

  • Download URL: lmchunker-0.5.1.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.10.15 Linux/5.4.0-100-generic

File hashes

Hashes for lmchunker-0.5.1.tar.gz
Algorithm Hash digest
SHA256 78e9955bef6dcd4615745f934a621e5b656fcbfb124c07b7c396ed2c7045a35f
MD5 f296534ad7182ed512be892825e7e686
BLAKE2b-256 c229e66a6b22e0539b4feb128a3c322ee89baa84ee42a1b4864dd3e89c92f2ca

See more details on using hashes here.

File details

Details for the file lmchunker-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: lmchunker-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 16.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.10.15 Linux/5.4.0-100-generic

File hashes

Hashes for lmchunker-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a55532469242bab301d9b6cf5a54a6c0ff0b2687bf556c099e9fcc41ce168e4f
MD5 c4578e6102f0a03858296812f1c02d32
BLAKE2b-256 8638a9e9ad9502eb3f0d6a3a03d6b57dccb2c465d6372665fd770e4316b4351d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page