Skip to main content

The CDN for AI inference costs. 93% token reduction proven on OpenAI API.

Project description

State Pack

The CDN for AI inference costs.

Agents pay per token. State Pack makes that cost invisible — the same way BlackBerry made per-character SMS costs invisible. Not by changing the model. Not by changing the API. By caching state at the infrastructure layer.

Proven on the OpenAI API

Metric Naive State Pack Saving
Input tokens (20-step agent loop) 18,990 1,320 93%
Cost per loop (gpt-4o-mini) $0.00361 $0.00091 74%
Cost per loop (gpt-4o) ~$0.190 ~$0.048 74%
Latency per step 44ms
Base cache hit (shared agents) 0.951s 0.003s 99%

Real numbers. Real API. Real dollars.

The Math at Scale

1,000 agents. 40-step loops. GPT-4o pricing.

Naive State Pack
Cost per cycle $144.40 $36.32
Saving per cycle $108.08
At 100 cycles/day $14,440 $3,632
Daily saving $10,808

If 1,000 agents share the same system prompt, the base KV cache is computed once and served to all. Agents 2-1000 pay 0 tokens for context setup.

How It Works

naive:       [system + history + delta] -> model   (cost grows every step)
state pack:  [delta only]               -> model   (cost stays flat)
  1. CREATE - run base prompt once, serialize KV cache to content-addressed blob
  2. INFER - load cached state, process delta tokens only, emit verifiable receipt
  3. MERGE - fold deltas back into base on threshold (keeps savings compounding)

Every artifact is SHA-256 addressed. Every operation emits a tamper-evident receipt. Same inputs always produce same outputs. Fully auditable.

Prove It Against Your Own API Key

git clone https://github.com/mauludsadiq/State-Pack.git
cd State-Pack
export OPENAI_API_KEY=sk-...
PYTHONPATH=. python3 examples/openai_benchmark.py

Session Server: 1,000 Agents, 1 Base

PYTHONPATH=. python3 -m state_pack.session_server --store my_store --model gpt2

# Agent 1 - computes base (0.951s)
curl -X POST http://localhost:8001/sessions \
  -H 'Content-Type: application/json' \
  -d '{"base_text": "You are a legal research agent..."}'

# Agent 2 - cache hit (0.003s)
curl -X POST http://localhost:8001/sessions \
  -H 'Content-Type: application/json' \
  -d '{"base_text": "You are a legal research agent..."}'

# Run a step
curl -X POST http://localhost:8001/sessions/{id}/step \
  -H 'Content-Type: application/json' \
  -d '{"delta_text": "Step 1: clause affects indemnity."}'

Python SDK

from state_pack.llm import StatePackLLM

llm = StatePackLLM.from_pretrained('gpt2', store='my_store', merge_every=10)
llm.set_base('You are a research agent...\n\n')

for delta in steps:
    output = llm(delta)   # only delta tokens processed

print(llm.stats())
# tokens_saved: 17785, savings_pct: 95.31, speedup: 3.958

HTTP API

PYTHONPATH=. python3 -m state_pack.server --store my_store --model gpt2

curl -X POST http://localhost:8000/packets \
  -H 'Content-Type: application/json' \
  -d '{"base_text": "You are a research agent..."}'

curl -X POST http://localhost:8000/infer \
  -H 'Content-Type: application/json' \
  -d '{"base_sha256": "<sha>", "delta_text": "Step 1."}'

Architecture

state_pack/
  session_server.py  In-memory KV cache, base dedup, 1000-agent scale
  server.py          HTTP API (FastAPI, 43ms/step)
  llm.py             Drop-in LLM wrapper with automatic KV reuse
  store.py           In-process packet store (no subprocess)
  serialize.py       KV cache to .pt blob (float16, 50% smaller)
  client.py          High-level SDK
  agent_loop.py      Drop-in agent loop
  openai_integration.py  Benchmark against OpenAI API

src/main.rs          Rust CLI - content addressing, receipts, protocol

Model Support

Model Status
GPT-2 Verified
Llama (tiny) Verified
Any HuggingFace CausalLM Works
OpenAI API Verified (93% token reduction)

Roadmap

  • Phase 1 - Python SDK (serialize, client, agent_loop)
  • Phase 2 - HTTP API (FastAPI, PacketStore, 43ms/step)
  • Phase 3 - float16 blobs (50% smaller), DynamicCache compat
  • Phase 4 - Session server (in-memory KV, base dedup, 99% cache hit)
  • OpenAI integration (93% token reduction, 74% cost reduction, live API)
  • GPU/multi-device KV portability
  • LangChain/LangGraph native integration
  • Rust HTTP server (protocol layer in Rust, Python inference sidecar)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

state_pack-0.1.1-py3-none-any.whl (19.6 kB view details)

Uploaded Python 3

File details

Details for the file state_pack-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: state_pack-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 19.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for state_pack-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2abc191c66dc96911a7679c17144a84b57b1ddfb74411aafa1f5b936ef403cad
MD5 4c73d30f43d067522b9e76141a328281
BLAKE2b-256 03970ef486c95b73027649cd7d87d07d36e255652e89c913fa3b81109d5b7782

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page