An SDK that makes it easy to do contextual chunking
Project description
Overview
Chonktxt is an SDK that makes it easy to do contextual chunking. Inspired by the contextual retrieval article by Anthropic.
Pass in a PDF or text file, and Chonktxt will return the original chunk + contextualized chunk.
Installation
pip install chonktxt
Usage
from chonktxt import Chonktxt
client = Chonktxt(anthropic_api_key=YOUR_ANTHROPIC_API_KEY)
# Use a PDF file as the source document
client.use_doc_pdf("./large-pdf.pdf")
# Or use text as the source document
client.use_doc_txt("This is a very very long text")
# Get contextualized chunks
contextualized_chunks, token_counts = client.contextualize_chunks(
chunks=[
"Chain-of-Thought (Wei et al., 2022) 28.0 ±3.1 34.9 ±3.2 15.0 ±2.5 77.8 ±2.8 88.9 ±2.2",
],
# Use 1 thread if this is the first time you're using a specific document
# this way we can be sure that anthropic has cached the document
# feel free to increase this number on subsequent calls
parallel_threads=1
)
print(contextualized_chunks)
# output:
# [
# {
# "original_chunk": "Chain-of-Thought (Wei et al., 2022) 28.0 ±3.1 34.9 ±3.2 15.0 ±2.5 77.8 ±2.8 88.9 ±2.2",
# "contextualized_chunk": "This chunk presents the performance of the Chain-of-Thought baseline agent on various benchmarks, including reading comprehension (MGSM), math (GSM8K, GSM-Hard, SVAMP, ASDiv), and multi-task (MMLU) domains."
# },
# ]
# As we can see from the above output, the original chunk is meaningless numbers on its own, but the contextualized chunk contains meaningful text.
# We can also print usage summary
print(f"Total input tokens without caching: {token_counts['input']}")
print(f"Total output tokens: {token_counts['output']}")
print(f"Total input tokens written to cache: {token_counts['cache_creation']}")
print(f"Total input tokens read from cache: {token_counts['cache_read']}")
total_tokens = token_counts['input'] + token_counts['cache_read'] + token_counts['cache_creation']
savings_percentage = (token_counts['cache_read'] / total_tokens) * 100 if total_tokens > 0 else 0
print(f"Total input token savings from prompt caching: {savings_percentage:.2f}% of all input tokens used were read from cache.")
print("Tokens read from cache come at a 90 percent discount!")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
chonktxt-0.1.2.tar.gz
(4.5 kB
view details)
Built Distribution
File details
Details for the file chonktxt-0.1.2.tar.gz
.
File metadata
- Download URL: chonktxt-0.1.2.tar.gz
- Upload date:
- Size: 4.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3ae052b4796e5b21257354381460ce64237f247662e6629fc6d38e5ff9140a26 |
|
MD5 | e6408c89603333efc4f486d6ed79b473 |
|
BLAKE2b-256 | 5dcb0f5be259b039c243f08086f75729788710350d553b61a8e692fc02405669 |
File details
Details for the file chonktxt-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: chonktxt-0.1.2-py3-none-any.whl
- Upload date:
- Size: 3.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7fc33629db67221de7824ef309608926a4412b279169723186dc3f59e4e7ed12 |
|
MD5 | 1c06950d38b18bd1b8e83eb9e846a14e |
|
BLAKE2b-256 | f8ad70c6f15b614fb240adaece6b4b359be42c82aff15047663697fdecaa8b46 |