Skip to main content

Context aware chunking using perplexity

Project description

Context Aware Chunker

When performing semantic search using vector similarity, one of the key issues that arises is the size of the chunk you are using.

The size of the chunk affects a lot of things, including the accuracy of your result, the amount of contextual information retained at inference time, and accuracy of retrieval.

One of the easiest ways to boost accuracy is to retain highly correlated information in a single atomic chunk as opposed to creating multiple, since this might be missed when performing semantic search.

How does this package work?

The idea is quite simple. Language models are extremely good at knowing when two pieces of text belong together.

When they do, the perplexity remains low, but when they aren't, the perplexity is much higher.

Based on this, we can merge two groups of text together, creating the perfect chunk of highly correlated information

Usage

WARNING: Please note that this is an alpha release and is only suitable for testing, not for production

Installation

pip install context_aware_chunking

Python Code

text = "<INSERT TEXT HERE>"

from context_aware_chunker.chunking_models import T5ChunkerModel
from context_aware_chunker.text_splitter import SentenceSplitter

#This module will help you in finding relevant sentences from unstructured text
splitter = SentenceSplitter()

'''
Responsible for determining which sentence segments to merge or separate
If you have more GPU power you can try using larger models
'''
chunking_agent = T5ChunkerModel('t5-small')

'''
Here, merge_sentences decides how many sentences will be in one split of the sentences
Default is 1, you can increase and see
'''
split_content = splitter.split_text(text, merge_sentences = 1)

chunks = chunking_agent.chunk(split_content)

for chunk in chunks:
  print(chunk)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

context_aware_chunker-0.0.2-py3-none-any.whl (5.2 kB view details)

Uploaded Python 3

File details

Details for the file context_aware_chunker-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for context_aware_chunker-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 63707c08968ba27d510ce5fb8c052b1f11b6005567b5aef03d5e1033a9980890
MD5 13b3f899e7155f2724c9d6fd7f30d8c7
BLAKE2b-256 5b63d21afcf1e8518491356f6b5149d023566257fec3536a866db60d485c524b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page