Skip to main content

An SDK that makes it easy to do contextual chunking

Project description

Overview

Chonktxt is an SDK that makes it easy to do contextual chunking. Inspired by the contextual retrieval article by Anthropic.

Pass in a PDF or text file, and Chonktxt will return the original chunk + contextualized chunk.

Installation

pip install chonktxt

Usage

from chonktxt import Chonktxt

client = Chonktxt(anthropic_api_key=YOUR_ANTHROPIC_API_KEY)

# Use a PDF file as the source document
client.use_doc_pdf("./large-pdf.pdf")

# Or use text as the source document
client.use_doc_txt("This is a very very long text")

# Get contextualized chunks
contextualized_chunks, token_counts = client.contextualize_chunks(
    chunks=[
        "Chain-of-Thought (Wei et al., 2022) 28.0 ±3.1 34.9 ±3.2 15.0 ±2.5 77.8 ±2.8 88.9 ±2.2",
    ],

    # Use 1 thread if this is the first time you're using a specific document
    # this way we can be sure that anthropic has cached the document
    # feel free to increase this number on subsequent calls
    parallel_threads=1
)

print(contextualized_chunks)
# output:
# [
#     {
#         "original_chunk": "Chain-of-Thought (Wei et al., 2022) 28.0 ±3.1 34.9 ±3.2 15.0 ±2.5 77.8 ±2.8 88.9 ±2.2",
#         "contextualized_chunk": "This chunk presents the performance of the Chain-of-Thought baseline agent on various benchmarks, including reading comprehension (MGSM), math (GSM8K, GSM-Hard, SVAMP, ASDiv), and multi-task (MMLU) domains."
#     },
# ]

# As we can see from the above output, the original chunk is meaningless numbers on its own, but the contextualized chunk contains meaningful text.

# We can also print usage summary
print(f"Total input tokens without caching: {token_counts['input']}")
print(f"Total output tokens: {token_counts['output']}")
print(f"Total input tokens written to cache: {token_counts['cache_creation']}")
print(f"Total input tokens read from cache: {token_counts['cache_read']}")

total_tokens = token_counts['input'] + token_counts['cache_read'] + token_counts['cache_creation']
savings_percentage = (token_counts['cache_read'] / total_tokens) * 100 if total_tokens > 0 else 0
print(f"Total input token savings from prompt caching: {savings_percentage:.2f}% of all input tokens used were read from cache.")
print("Tokens read from cache come at a 90 percent discount!")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chonktxt-0.1.2.tar.gz (4.5 kB view details)

Uploaded Source

Built Distribution

chonktxt-0.1.2-py3-none-any.whl (3.9 kB view details)

Uploaded Python 3

File details

Details for the file chonktxt-0.1.2.tar.gz.

File metadata

  • Download URL: chonktxt-0.1.2.tar.gz
  • Upload date:
  • Size: 4.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for chonktxt-0.1.2.tar.gz
Algorithm Hash digest
SHA256 3ae052b4796e5b21257354381460ce64237f247662e6629fc6d38e5ff9140a26
MD5 e6408c89603333efc4f486d6ed79b473
BLAKE2b-256 5dcb0f5be259b039c243f08086f75729788710350d553b61a8e692fc02405669

See more details on using hashes here.

File details

Details for the file chonktxt-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: chonktxt-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 3.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for chonktxt-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7fc33629db67221de7824ef309608926a4412b279169723186dc3f59e4e7ed12
MD5 1c06950d38b18bd1b8e83eb9e846a14e
BLAKE2b-256 f8ad70c6f15b614fb240adaece6b4b359be42c82aff15047663697fdecaa8b46

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page