Skip to main content

Drop-in Tessera integration for LlamaIndex. One line of config routes your existing OpenAI / Anthropic / Mistral / Groq / Cohere LLM through Tessera's auto-route + auto-cache + auto-compress + auto-batch proxy. Free tier: 60M tokens/mo. Production: 20% of measured savings.

Project description

tessera-llamaindex

Drop-in cost optimization for LlamaIndex. One line of config routes your existing OpenAI / Anthropic / MistralAI / Groq / Cohere LLM through the Tessera optimization proxy — auto-route to cheaper-equivalent models, exact + provider-prompt-cache hits, prompt compression with per-stack quality canary, batch arbitrage on async-tolerant calls. Free Dev tier: 60M tokens/month, no card. Production: 20% of measured savings, $0 if we save you nothing.

Companion to tessera-sdk (vanilla provider SDKs), tessera-langchain (LangChain integration), tessera-vercel-ai (Vercel AI SDK integration), tessera-mastra (Mastra Agent framework integration), tessera-pydantic-ai (Pydantic AI integration), tessera-crewai (CrewAI multi-agent integration), and tessera-autogen (AutoGen 0.4+ multi-agent integration). Same proxy, same mechanic stack, LlamaIndex-shaped API.

PyPI version License: Apache-2.0


Install

pip install tessera-llamaindex
# Plus whichever LlamaIndex provider package you use:
pip install llama-index-llms-openai          # or llama-index-llms-anthropic / -mistralai / -groq / -cohere

Get a free Tessera API key (60M tokens/mo, no card) — tesseraai.io/dev.


Quickstart

from llama_index.llms.openai import OpenAI
from tessera_llamaindex import tessera_openai_config

llm = OpenAI(
    model="gpt-4o",
    api_key="sk-...",                              # your OpenAI key, unchanged
    **tessera_openai_config(api_key="tsr_..."),    # one line, routes through Tessera
)

# Existing LlamaIndex code (queries, RAG pipelines, agents, sub-question
# engines, multi-step reasoning) runs unchanged.
response = llm.complete("Summarize this document in 3 bullets.")

Same pattern for the other 4 providers:

from llama_index.llms.anthropic import Anthropic
from tessera_llamaindex import tessera_anthropic_config

llm = Anthropic(
    model="claude-sonnet-4-5-20250929",
    api_key="sk-ant-...",
    **tessera_anthropic_config(api_key="tsr_..."),
)

Provider support — verified constructor signatures

Field names runtime-verified against installed LlamaIndex 0.6+ provider packages:

Provider Tessera config function LlamaIndex class URL param Headers approach
OpenAI tessera_openai_config llama_index.llms.openai.OpenAI api_base default_headers
Anthropic tessera_anthropic_config llama_index.llms.anthropic.Anthropic base_url default_headers
Mistral tessera_mistral_config llama_index.llms.mistralai.MistralAI endpoint additional_kwargs.http_headers
Groq tessera_groq_config llama_index.llms.groq.Groq api_base default_headers (via OpenAILike inheritance)
Cohere tessera_cohere_config llama_index.llms.cohere.Cohere base_url additional_kwargs.headers

Generic dispatcher: tessera_config(provider, api_key=...) returns the right kwargs dict regardless of provider.

Each constructor-shape is locked into CI via tests/test_e2e.py — if a future LlamaIndex release changes the kwargs an __init__ accepts, the regression fails before we ship.


What Tessera does on every request

Same mechanic stack as the main tessera-sdk. Each mechanic is opt-in per workload, observable per request, and bypasses when its quality canary drops below the per-stack 0.95 floor.

Mechanic What it does Typical savings
Auto-route (m1) Route to a cheaper-equivalent model gated by a daily promptfoo canary on your eval set 15–35% on routed calls
Auto-cache (m2) sha256 cache on the canonical request body, 7-day TTL 5–40% depending on prompt repetition
Auto-compress (m3) Per-role heuristic compression (system + user toggles independent) 5–15% on prompt tokens
Prompt cache (m6) Inject provider-native cache headers — OpenAI 50% off, Anthropic 90% off cache reads 50–90% on cached prefixes
Context prune (m7) Conservative trim on long conversations + RAG attachments 5–25% on multi-turn workloads
Output-length ceiling (m9) Daily compute fits p90 of completion length per workload 5–15% on completion cost
Batch arbitrage (m10) Route async-tolerant calls to provider Batch APIs (50% off) 50% on batch-eligible traffic
Per-provider circuit breaker (Reliability primitive.) Rolling 5xx-rate state machine per upstream. n/a — keeps the savings stack honest

Pricing

  • Free Dev — 60M tokens/month, 30 requests/minute, all mechanics on, no card. Forever.
  • Production — over 60M tokens/month or higher rate limit. 20% of measured savings only. Zero savings, zero fee. Prepaid Stripe balance, $100 minimum top-up.

Existing customers of the other Tessera packages keep their rate_locked_pct (if any) on this package — same tsr_… key, same billing record.


FAQ

Q: How is this different from tessera-sdk, tessera-langchain, tessera-vercel-ai?

Same proxy, same mechanics, same billing. The four packages target different code surfaces:

  • tessera-sdk — patches provider SDK constructors directly (OpenAI, Anthropic, etc.) via one-line tessera.activate(key). Use when calling provider SDKs without a framework.
  • tessera-langchain — wires into LangChain ChatModel constructors.
  • tessera-vercel-ai — wires into Vercel AI SDK createX provider factories.
  • tessera-llamaindex (this package) — wires into LlamaIndex llama_index.llms.* LLM constructors.

Pick whichever fits your codebase. Side-by-side install is supported.

Q: Does this break my RAG pipeline / query engine / agent?

No. The LlamaIndex LLM object behaves identically — complete, chat, stream_complete, stream_chat all work unchanged. Index queries, retrievers, sub-question engines, OpenAI Agents, multi-step reasoning chains all use the LLM's standard complete/chat interface and route through Tessera transparently.

Q: What happens if Tessera's proxy is down?

Your application gets HTTP errors instead of LLM responses. On the proxy side, a per-provider circuit breaker tracks rolling 5xx rates and skips degraded providers in auto-route decisions. Cross-provider failover (re-routing to a different provider entirely when an upstream is down) is on the roadmap, not shipped yet.

Q: What happens to my OpenAI / Anthropic rate limits?

They pass through. Tessera does not aggregate quotas across customers. Your provider rate limits apply normally; the proxy enforces only the Tessera tier limits (30 rpm Free Dev, 60 rpm Production by default — higher on request).

Q: Are you storing my prompts and completions?

No. We log only token counts, cost deltas, mechanics_stack, and provider response status. Prompts and completions are never persisted. Full data handling on tesseraai.io/security.

Q: Why does Mistral use additional_kwargs.http_headers instead of default_headers?

LlamaIndex's MistralAI wrapper doesn't expose a top-level default_headers argument — it forwards additional_kwargs.http_headers to the underlying mistralai SDK on each request. The Tessera Mistral config function returns the correct shape for this. You don't need to know this; the config function abstracts it. Same story for Cohere (additional_kwargs.headers).

Q: Can I use this with LlamaIndex's Settings.llm = ... global pattern?

Yes — just construct the LLM the same way and assign it:

from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from tessera_llamaindex import tessera_openai_config

Settings.llm = OpenAI(model="gpt-4o", api_key="sk-...", **tessera_openai_config(api_key="tsr_..."))

Architecture

Open-source SDK ↔ closed-source proxy. This package is a thin client that adds one HTTP hop. The actual mechanic decisions run inside the Tessera Cloudflare Worker proxy at api.tesseraai.io. The wire format is open; the mechanic implementations are closed.

License

Apache-2.0. See LICENSE.

Versioning

Semver. Wire format compatibility committed across minor releases; breaking changes only on major bumps.

Security

Coordinated disclosure address: security@tesseraai.io.


Built by Tessera — Fintechagency OÜ, Tallinn, Estonia (registry 16638667).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tessera_llamaindex-0.1.0.tar.gz (19.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tessera_llamaindex-0.1.0-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file tessera_llamaindex-0.1.0.tar.gz.

File metadata

  • Download URL: tessera_llamaindex-0.1.0.tar.gz
  • Upload date:
  • Size: 19.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for tessera_llamaindex-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b9d577632fcfb65a30e217fbf22ddefe562939b4f8e8879d5f98acc3078faa7b
MD5 8350a6b25e210ad0e63eec3ea404d37e
BLAKE2b-256 c35114932ecc0b3f6d0de59f8b857629f7e50416a6c5e817bddc9e33a4aa3998

See more details on using hashes here.

File details

Details for the file tessera_llamaindex-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for tessera_llamaindex-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dbcd2cef5211eaad1630107b113d8df1b7a03f3e5a66643cb0c5288870f5deab
MD5 5d9a7c0ebb721015116fe9cb410d4ebd
BLAKE2b-256 cb51ccadb9c39fb272f304bbb7b65fb13265519eab61604fc30a6721162266c3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page