Skip to main content

Drop-in LiteLLM replacement backed by Rust — same API, 10× lower latency

Project description

edgequake-litellm

Drop-in LiteLLM replacement backed by Rust — same API, lower overhead.

PyPI Python CI License

edgequake-litellm wraps the edgequake-llm Rust core via PyO3, providing a high-performance drop-in for LiteLLM. Swap the import — the rest of your code stays unchanged.

# Before
import litellm

# After — same API, Rust-backed
import edgequake_litellm as litellm

Features

  • LiteLLM-compatible APIcompletion(), acompletion(), stream(), embedding(), same call signatures, same response shape (resp.choices[0].message.content).
  • Multi-provider routing — OpenAI, Anthropic, Gemini, Mistral, OpenRouter, xAI, Azure, AWS Bedrock, Ollama, LM Studio, HuggingFace, and more, via provider/model strings.
  • AWS Bedrock — native support for 12+ model families via the Converse API, including Amazon Nova, Anthropic Claude, Meta Llama, Mistral, and native embedding with Titan / Cohere.
  • Async-native — built on Tokio; sync and async Python both supported.
  • Single wheel per platform — uses PyO3's abi3-py39 stable ABI, one .whl covers Python 3.9–3.13+.
  • Zero Python runtime dependencies — the Rust extension is self-contained.
  • Full type annotations — ships with py.typed and .pyi stubs.
  • max_completion_tokens support — works for all OpenAI model families including o1, o3-mini, o4-mini, gpt-4.1, gpt-4.1-nano that require this field.
  • Cache hit tokensresp.cache_hit_tokens exposes OpenAI prompt cache hits and Anthropic cache reads.
  • Reasoning tokensresp.thinking_tokens surfaces o-series reasoning and Claude extended thinking token counts.

What's New in 0.2.0

  • AWS Bedrock providerbedrock/<model-id> routing for 12+ model families via the Converse API: Amazon Nova, Anthropic Claude, Meta Llama, Mistral, Google Gemma, NVIDIA Nemotron, Qwen, MiniMax, DeepSeek, Z.AI, OpenAI OSS, Cohere, Writer.
  • Bedrock native embedding — Amazon Titan Embed Text v2/v1 and Cohere Embed v3/v4.
  • Inference profile auto-resolution — bare model IDs automatically resolve to cross-region inference profile IDs.
  • Backed by edgequake-llm v0.3.0.

See CHANGELOG.md for the full history.

Installation

pip install edgequake-litellm

Quick Start

import edgequake_litellm as litellm   # drop-in import alias

# ── Synchronous chat ────────────────────────────────────────────────────────
resp = litellm.completion(
    "openai/gpt-4o-mini",
    [{"role": "user", "content": "Hello, world!"}],
)
# litellm-compatible access
print(resp.choices[0].message.content)
# convenience shortcut
print(resp.content)

# ── Asynchronous chat ───────────────────────────────────────────────────────
import asyncio

async def main():
    resp = await litellm.acompletion(
        "anthropic/claude-3-5-haiku-20241022",
        [{"role": "user", "content": "Tell me a joke."}],
        max_tokens=128,
        temperature=0.8,
    )
    print(resp.choices[0].message.content)

asyncio.run(main())

# ── Streaming (async generator) ─────────────────────────────────────────────
async def stream_example():
    messages = [{"role": "user", "content": "Count to five."}]
    async for chunk in litellm.acompletion("openai/gpt-4o", messages, stream=True):
        print(chunk.choices[0].delta.content or "", end="", flush=True)

# ── Embeddings ──────────────────────────────────────────────────────────────
result = litellm.embedding(
    "openai/text-embedding-3-small",
    ["Hello world", "Rust is fast"],
)
# litellm-compatible access
print(result.data[0].embedding[:3])
# legacy list access still works
print(len(result), len(result[0]))  # 2 1536

Provider Routing

Pass provider/model as the first argument — the prefix selects the provider:

Provider Example model string
OpenAI openai/gpt-4o
Anthropic anthropic/claude-3-5-sonnet-20241022
Google Gemini gemini/gemini-2.0-flash
Mistral mistral/mistral-large-latest
OpenRouter openrouter/meta-llama/llama-3.1-70b-instruct
xAI xai/grok-3-beta
Azure OpenAI azure/gpt-4o
AWS Bedrock bedrock/amazon.nova-lite-v1:0
Ollama ollama/llama3.2
LM Studio lmstudio/local-model
HuggingFace huggingface/mistralai/Mixtral-8x7B-Instruct-v0.1
Mock (tests) mock/any-name

API Reference

completion(model, messages, **kwargs) → ModelResponseCompat

Synchronous chat completion. Blocks but releases the GIL during Rust I/O so other Python threads keep running.

resp = litellm.completion(
    "openai/gpt-4o",
    messages,
    max_tokens=256,
    temperature=0.7,
    system="You are a helpful assistant.",
    max_completion_tokens=256,  # alias for max_tokens; required for o1/o3/gpt-4.1 models
    seed=42,
    response_format={"type": "json_object"},  # or "text" / "json_object"
)

# All of these access the same content:
resp.choices[0].message.content   # litellm path
resp.content                       # shortcut
resp["choices"][0]["message"]["content"]  # dict-style

resp.usage.total_tokens
resp.model
resp.response_ms                  # latency in milliseconds
resp.to_dict()                    # plain dict

# New in 0.1.1 — cache and reasoning token metadata
resp.cache_hit_tokens             # int | None — tokens served from provider cache
resp.thinking_tokens              # int | None — reasoning tokens (o-series, Claude)
resp.thinking_content             # str | None — visible thinking text (Claude)

# The same data via usage object:
resp.usage.cache_read_input_tokens  # same as resp.cache_hit_tokens
resp.usage.reasoning_tokens         # same as resp.thinking_tokens

acompletion(model, messages, stream=False, **kwargs)

Async chat completion. Returns ModelResponseCompat or (if stream=True) AsyncGenerator[StreamChunkCompat, None].

# Non-streaming
resp = await litellm.acompletion("openai/gpt-4o", messages)

# Streaming
async for chunk in await litellm.acompletion("openai/gpt-4o", messages, stream=True):
    print(chunk.choices[0].delta.content or "", end="")

stream(model, messages, **kwargs) → AsyncGenerator[StreamChunk, None]

Low-level streaming. Raw StreamChunk objects:

async for chunk in litellm.stream("openai/gpt-4o", messages):
    if chunk.content:
        print(chunk.content, end="")
    elif chunk.is_finished:
        print(f"\n[stop: {chunk.finish_reason}]")

embedding(model, input, **kwargs) → EmbeddingResponseCompat

Synchronous embeddings. Returns an EmbeddingResponseCompat that supports both litellm-style and legacy list-style access:

result = litellm.embedding("openai/text-embedding-3-small", ["foo", "bar"])

# litellm path
result.data[0].embedding

# backwards-compatible list access
for vec in result:          # iterates List[float]
    print(len(vec))
result[0]                   # List[float]
len(result)                 # number of vectors

aembedding(model, input, **kwargs) → EmbeddingResponseCompat

Async embeddings — same return type as embedding().

stream_chunk_builder(chunks, messages=None) → ModelResponseCompat

Reconstruct a full ModelResponseCompat from a collected list of streaming chunks:

from edgequake_litellm import stream_chunk_builder

chunks = []
async for chunk in litellm.stream("openai/gpt-4o", messages):
    chunks.append(chunk)

full = stream_chunk_builder(chunks, messages=messages)
print(full.content)

Configuration

Module-level globals mirror litellm:

import edgequake_litellm as litellm

litellm.set_verbose = True      # enable debug logging
litellm.drop_params = True      # drop unknown params (always True)

# Set default provider / model
litellm.set_default_provider("anthropic")
litellm.set_default_model("claude-3-5-haiku-20241022")

# Now the provider prefix can be omitted:
resp = litellm.completion("claude-3-5-haiku-20241022", messages)

Exception Hierarchy

Exceptions mirror LiteLLM for painless migration:

import edgequake_litellm as litellm

try:
    resp = litellm.completion("openai/gpt-4o", messages)
except litellm.AuthenticationError as e:
    print(f"Check your API key: {e}")
except litellm.RateLimitError:
    time.sleep(5)
except litellm.ContextWindowExceededError:
    # trim messages and retry
    pass
except litellm.NotFoundError:      # alias for ModelNotFoundError
    pass
except litellm.APIConnectionError:
    pass

All exceptions (AuthenticationError, RateLimitError, ContextWindowExceededError, ModelNotFoundError, Timeout, APIConnectionError, APIError) are also available from edgequake_litellm.exceptions.

Environment Variables

Provider credentials follow the standard naming convention:

Provider Environment variable
OpenAI OPENAI_API_KEY
Anthropic ANTHROPIC_API_KEY
Gemini GEMINI_API_KEY
Mistral MISTRAL_API_KEY
OpenRouter OPENROUTER_API_KEY
xAI XAI_API_KEY
HuggingFace HF_TOKEN
Ollama OLLAMA_HOST (default: http://localhost:11434)
LM Studio LMSTUDIO_HOST (default: http://localhost:1234)

Defaults can also be set via LITELLM_EDGE_PROVIDER / LITELLM_EDGE_MODEL.

Development

Prerequisites

  • Rust ≥ 1.83 (rustup toolchain install stable)
  • Python ≥ 3.9
  • pip install maturin

Build from source

git clone https://github.com/raphaelmansuy/edgequake-llm.git
cd edgequake-llm/edgequake-litellm

# Create a virtual environment
python -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate

pip install maturin pytest pytest-asyncio ruff mypy

# Build & install in dev mode (incremental Rust + Python)
maturin develop --release

# Run unit tests (mock provider — no API keys needed)
pytest tests/ -k "not e2e" -v

Running E2E tests

export OPENAI_API_KEY=sk-...
pytest tests/test_e2e_openai.py -v

Publishing

# Bump version in pyproject.toml AND Cargo.toml (must match), then:
git tag py-v0.2.0
git push --tags
# GitHub Actions builds and publishes to PyPI automatically.

License

Apache-2.0 — see LICENSE-APACHE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

edgequake_litellm-0.3.0.tar.gz (813.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

edgequake_litellm-0.3.0-cp39-abi3-win_amd64.whl (7.3 MB view details)

Uploaded CPython 3.9+Windows x86-64

edgequake_litellm-0.3.0-cp39-abi3-musllinux_1_2_x86_64.whl (9.7 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ x86-64

edgequake_litellm-0.3.0-cp39-abi3-musllinux_1_2_aarch64.whl (9.4 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARM64

edgequake_litellm-0.3.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.3 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

edgequake_litellm-0.3.0-cp39-abi3-macosx_11_0_arm64.whl (8.4 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

edgequake_litellm-0.3.0-cp39-abi3-macosx_10_12_x86_64.whl (8.7 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file edgequake_litellm-0.3.0.tar.gz.

File metadata

  • Download URL: edgequake_litellm-0.3.0.tar.gz
  • Upload date:
  • Size: 813.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for edgequake_litellm-0.3.0.tar.gz
Algorithm Hash digest
SHA256 b1e875daab2de99d7c0dd7e7f2984b45e7291c109762bac71f65a3ce8292c634
MD5 1f6c39289c5448f8e4a2d90329797416
BLAKE2b-256 8573f8f3e3c2ec38009ca6953c958c88e1de79027e340aa64b4c9a473c8d42f0

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.3.0-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.3.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 dbca6ba1e0a38e152689cb5d6fed0acc410391c09316c4067ffac0b001940c4b
MD5 ca78e9487369fd9fc094a2e939cb13f3
BLAKE2b-256 ad6e364ccb93f0afc6618d2b7f2b478dfdbbea776f665bf43071bfe5bb33c783

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.3.0-cp39-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.3.0-cp39-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 cea44b06772233bf21de8210c88fba356fb5507db790261a5fdf9006fc960912
MD5 48102d15d71cd089c995877e59ef80cb
BLAKE2b-256 12242d46ca1e0c1f9d7d726c663897575121108fe06224ae640db55883e03f6a

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.3.0-cp39-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.3.0-cp39-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 186eca29b663b7238de212ebe2f55ba0f5c48558f340b73d4112b80fab1e6370
MD5 1146181e6c769296fac9748802c99c58
BLAKE2b-256 5594c56043a39a796601c44013b1cbd9a043ccd1a0fc6dc1f43efe79a9717a3e

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.3.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.3.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f23c3e3b6b037fbf2c7d208278cb0432da4a76741c55379549ae217221b99dd6
MD5 33b1b5fd20254ae51fef6fe86c042f6d
BLAKE2b-256 899a686d9422231785158bba732e4d7967373866d6266feb16daacfa033afdbe

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.3.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.3.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a5ca64c15e74b8e645277aa0fae1dc2a02fe65e296023fb2c96e37c20933f056
MD5 bf677a5e7cdce4cb6e5637ca62c13a7c
BLAKE2b-256 caec7220bb4674c7b026548059f38743bb39a383291adf28274719a4c288fa1d

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.3.0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.3.0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 e0c482759cdd155b5dd6f471efe73b73091430fe1b5ce874e84d7f71b16fa130
MD5 a731763701c3185bcf3dccc62c6ff3ed
BLAKE2b-256 2fd859d263dcc4c2dd65581f01a82bd25cf549c21dbcb5e00c1b051e8feaffbc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page