Skip to main content

Cache-aware prompt structure optimizer + local cost/usage logger for LLM apps.

Project description

ContextOps

CI PyPI Python License: MIT

Cache-aware prompt structure optimizer + LLM-as-judge eval + local cost/usage logger.

Stop paying for the same tokens twice. ContextOps reorders your prompt sections so stable content (system prompt, tools) sits at the top — and variable content (query, history) sits at the bottom — maximizing cache hit rate on Anthropic / OpenAI / DeepSeek / any provider that does prefix caching.

No cloud, no SaaS, no SDK lock-in. Just pip install contextops-tool and go.


⚡ Quickstart

pip install contextops-tool
from contextops import optimize, Prompt

p = Prompt(
    query="What's the weather in Berlin?",
    history=[{"role": "user", "content": "Hi!"}],
    documents="Berlin weather API docs...",
    system="You are a helpful weather assistant.",
    tools='[{"name": "get_weather"}]',
    model="gpt-4o",
)

result = optimize(p)
print(result.diff())                            # history → documents → ... → query
print(f"Cache hit: {result.estimated_cache_hit_rate:.1%}")
print(f"Saves ~${result.estimated_cost_savings_usd:.4f} per 1k calls")

Output:

Section order: history → documents → ... → system → ... → query
Cache hit: 71.0%
Saves ~$0.1006 per 1k calls

That's it. Same prompt, same tokens, ~70% cache hit rate instead of ~5%.


🤔 Why?

LLM providers (Anthropic, OpenAI, DeepSeek, Google) cache the prefix of your prompt. If the prefix is stable across calls, you pay 10% of the cached-token price instead of the full price.

The trick: keep the prefix stable by putting variable content (query, history) at the end.

ContextOps knows the canonical ordering by stability:

system → tools → role → context → documents → history → query
  ↑ stable                                                ↑ variable

Estimated impact on a typical workload:

Setup Cache hit rate Effective $/1M input
Random order ~5% $X (full price)
ContextOps optimized ~78% ~$0.3·X (10% on cached prefix)

🧰 What's in the box

Feature Description
Cache-aware reordering Moves stable sections to the top, variable to the bottom. Same total tokens, much higher cache hit rate.
Token counting tiktoken-based, model-aware (gpt-4o, claude-*, qwen*, fallback to cl100k_base).
Cost estimation Per-model pricing baked in; estimates $/1k calls before vs after reorder.
LLM-as-judge eval Built-in metrics: faithfulness, relevance, completeness, conciseness.
A/B testing Run two prompts over a golden dataset, get structural + quality deltas.
Local SQLite logger Every LLM call goes to ~/.contextops/calls.db. Zero cloud.
Dataset loaders .json, .jsonl, .csv golden QA datasets.
Rich CLI optimize / stats / recent / compare / eval / reset with tables and progress bars.
LiteLLM auto-log (opt) One line to auto-log every litellm call. pip install "contextops[integrations]"
Bench harness 1000+ prompts through Ollama, LM Studio, or OpenRouter.

📦 Install

# Core (optimizer + logger + eval + CLI)
pip install contextops-tool

# With LiteLLM auto-callback for real LLM logging
pip install "contextops-tool[integrations]"

# With dev tooling (pytest, ruff, mypy)
pip install "contextops-tool[dev]"

# Everything
pip install "contextops-tool[all]"

From source:

git clone https://github.com/QuickLeopard/contextops.git
cd contextops
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,integrations]"
pytest                                          # 35 tests
python -m contextops_bench smoke               # offline smoke

Requires Python 3.10+.

macOS / Linux gotcha: python not found

If you see zsh: command not found: python, use python3 instead. Or install Python 3.12 via Homebrew and add it to PATH:

brew install python@3.12
export PATH="/opt/homebrew/opt/python@3.12/bin:$PATH"
python --version   # should print 3.12.x

If you want a fully automated setup, run the bootstrap script after cloning:

git clone https://github.com/QuickLeopard/contextops.git
cd contextops
./scripts/bootstrap.sh   # installs Python, creates venv, runs tests + smoke

No python -m venv?

Some Linux distros ship Python without venv:

# Debian / Ubuntu
sudo apt install python3.12-venv

# Fedora
sudo dnf install python3.12-venv

# Or use virtualenv instead
pip install virtualenv
virtualenv .venv

📖 Usage

1. Optimize a prompt (Python)

from contextops import optimize, Prompt

p = Prompt(
    query="What's the weather in Berlin?",
    documents="API docs...",
    system="You are a helpful weather assistant.",
    tools="[]",
    model="gpt-4o",
)

result = optimize(p)
print(result.diff())                  # before → after
print(result.optimized_sections)      # [(Section, content), ...]

2. Compare two prompts (Python)

from contextops.eval import compare

report = compare(baseline=bad_prompt, optimized=good_prompt)
print(report["delta"])
# {"tokens": 0, "cache_hit_rate": 0.65, "cost_savings_per_1k_usd": 4.21}

3. A/B eval with LLM-as-judge

from contextops import evaluate_ab, load_dataset, Prompt, LiteLLMJudge

dataset = load_dataset("evals/sample_dataset.jsonl")

baseline = Prompt(system="...", query="", documents="{ctx}", model="gpt-4o-mini")
optimized = Prompt(system="...", documents="{ctx}", query="", model="gpt-4o-mini")

def my_llm(prompt_str: str) -> str:
    return call_my_llm(prompt_str)

report = evaluate_ab(
    baseline, optimized,
    run_fn=my_llm,
    dataset=dataset,
    metrics=["faithfulness", "relevance", "completeness"],
    judge=LiteLLMJudge(),
    on_render=lambda p, item: p.system.replace("{ctx}", item.context),
)

print(report["structural"])   # tokens / cache / cost deltas
print(report["quality"])      # per-metric judge deltas

4. CLI

# Optimize a prompt inline
contextops optimize \
    --system "You are a weather assistant." \
    --query "What's the weather in Berlin?" \
    --documents "API docs..." \
    --model gpt-4o

# Load a prompt from a JSON file
contextops optimize --from-json my_prompt.json

# Side-by-side comparison
contextops compare baseline.json optimized.json

# A/B eval with offline echo judge
contextops eval \
    --baseline evals/baseline_prompt.json \
    --optimized evals/optimized_prompt.json \
    --dataset evals/sample_dataset.jsonl \
    --metrics relevance,completeness,faithfulness \
    --echo --run-fn echo \
    --output report.json

# Real LLM-as-judge
pip install "contextops[integrations]"
contextops eval \
    --baseline evals/baseline_prompt.json \
    --dataset evals/sample_dataset.jsonl \
    --judge-model gpt-4o-mini \
    --metrics relevance,completeness,faithfulness,conciseness

# Local call stats
contextops stats
contextops recent --limit 50

# Reset the local database
contextops reset

5. Auto-log every LiteLLM call

from contextops.integrations import install_callback
install_callback()

import litellm
litellm.completion(model="gpt-4o", messages=[{"role": "user", "content": "hi"}])
# → automatically logged to ~/.contextops/calls.db

🆚 Comparison

Tool What it does Where ContextOps is different
DSPy Auto-rewrites prompt text using a dataset We reorder sections — no dataset, no model rewrite
RAGAS / DeepEval Evaluate answer quality via LLM-judge We measure structure + cost, complementary not competing
Langfuse Cloud LLM observability We stay local-first: SQLite, no signup
prompt-cache / token-optimizer Cache responses, compress tokens We focus on provider cache (Anthropic / OpenAI), not response cache
vaibkumr/prompt-optimizer Compress text (LLMLingua-style) We reorder, never change tokens or text

🧪 Bench harness

1000+ prompts through Ollama, LM Studio, or OpenRouter:

# Smoke (10 prompts, <30s, no LLM, for CI)
python -m contextops_bench smoke

# Local (100 prompts via Ollama)
python -m contextops_bench local --provider ollama --model llama3.1:8b --n 100

# Cloud (1000 prompts via OpenRouter, 3 models, parallel)
export OPENROUTER_API_KEY=sk-or-v1-...
python -m contextops_bench cloud --provider openrouter \
    --model openai/gpt-4o-mini,anthropic/claude-3.5-haiku,meta-llama/llama-3.1-8b-instruct \
    --n 1000 --parallel 4

Each run writes:

  • bench/results/<label>.csv — every observation (prompt_id, model, tokens, cache hit, cost, latency, error, section order)
  • bench/results/<label>.summary.json — aggregated stats with optimized vs baseline deltas

See docs/ACCEPTANCE.md for the formal pass criteria.


🗺️ Roadmap

  • v0.1 — reorder, token count, SQLite logger, CLI
  • v0.2 — LLM-as-judge eval + A/B testing + dataset loaders
  • v0.2+bench — bench harness for Ollama / LM Studio / OpenRouter + acceptance criteria
  • 🔜 v0.3 — RAG curator (multi-signal retrieval + strict threshold)
  • 🔜 v1.0 — Access-aware context + audit trail (on-prem / enterprise)

📚 Documentation


🤝 Contributing

PRs welcome. See CONTRIBUTING.md for workflow, conventions, and release process.

Good first contributions:

  • New metric in contextops/judge.py (e.g. safety, format_compliance)
  • New provider in contextops_bench/clients.py (e.g. vllm, tgi)
  • Better pricing tables for non-USD regions
  • Translations of docs/ and README.md

📜 License

MIT.


✨ Credits

Built with:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contextops_tool-0.3.0.tar.gz (50.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

contextops_tool-0.3.0-py3-none-any.whl (39.3 kB view details)

Uploaded Python 3

File details

Details for the file contextops_tool-0.3.0.tar.gz.

File metadata

  • Download URL: contextops_tool-0.3.0.tar.gz
  • Upload date:
  • Size: 50.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for contextops_tool-0.3.0.tar.gz
Algorithm Hash digest
SHA256 3929d81855eb7748aaf2580768d08d06914e4d86d6a4fff281dfab02a6466434
MD5 a89c962e35fc4ed5c6f6e3731ffd6bbc
BLAKE2b-256 0304ff7cfa9b161239ca5508fd01ae8591b2a6420636dd27c8285bd1127e7d66

See more details on using hashes here.

File details

Details for the file contextops_tool-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for contextops_tool-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1ac2231dd191a7bbc0ef895ae1093b58859780ae516d810b3a3d8acf78eeea33
MD5 28d22cd3edf733fcab9274aa536fc26d
BLAKE2b-256 fbecb1a1c264da0743b3b67f2933a07cf240e0fdb7f628261bad04376a838361

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page