Context aware chunking using perplexity
Project description
Context Aware Chunker
When performing semantic search using vector similarity, one of the key issues that arises is the size of the chunk you are using.
The size of the chunk affects a lot of things, including the accuracy of your result, the amount of contextual information retained at inference time, and accuracy of retrieval.
One of the easiest ways to boost accuracy is to retain highly correlated information in a single atomic chunk as opposed to creating multiple, since this might be missed when performing semantic search.
How does this package work?
The idea is quite simple. Language models are extremely good at knowing when two pieces of text belong together.
When they do, the perplexity remains low, but when they aren't, the perplexity is much higher.
Based on this, we can merge two groups of text together, creating the perfect chunk of highly correlated information
Usage
WARNING: Please note that this is an alpha release and is only suitable for testing, not for production
Installation
pip install context_aware_chunking
Python Code
text = "<INSERT TEXT HERE>"
from context_aware_chunker.chunking_models import T5ChunkerModel
from context_aware_chunker.text_splitter import SentenceSplitter
#This module will help you in finding relevant sentences from unstructured text
splitter = SentenceSplitter()
'''
Responsible for determining which sentence segments to merge or separate
If you have more GPU power you can try using larger models
'''
chunking_agent = T5ChunkerModel('t5-small')
'''
Here, merge_sentences decides how many sentences will be in one split of the sentences
Default is 1, you can increase and see
'''
split_content = splitter.split_text(text, merge_sentences = 1)
chunks = chunking_agent.chunk(split_content)
for chunk in chunks:
print(chunk)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file context_aware_chunker-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: context_aware_chunker-0.0.2-py3-none-any.whl
- Upload date:
- Size: 5.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 63707c08968ba27d510ce5fb8c052b1f11b6005567b5aef03d5e1033a9980890 |
|
MD5 | 13b3f899e7155f2724c9d6fd7f30d8c7 |
|
BLAKE2b-256 | 5b63d21afcf1e8518491356f6b5149d023566257fec3536a866db60d485c524b |