Skip to main content

Drop-in LiteLLM replacement backed by Rust — same API, 10× lower latency

Project description

edgequake-litellm

Drop-in LiteLLM replacement backed by Rust — same API, lower overhead.

PyPI Python CI License

edgequake-litellm wraps the edgequake-llm Rust core via PyO3, providing a high-performance drop-in for LiteLLM. Swap the import — the rest of your code stays unchanged.

# Before
import litellm

# After — same API, Rust-backed
import edgequake_litellm as litellm

Features

  • LiteLLM-compatible APIcompletion(), acompletion(), stream(), embedding(), same call signatures, same response shape (resp.choices[0].message.content).
  • Multi-provider routing — OpenAI, Anthropic, Gemini, Mistral, OpenRouter, xAI, Azure, AWS Bedrock, Ollama, LM Studio, HuggingFace, and more, via provider/model strings.
  • AWS Bedrock — native support for 12+ model families via the Converse API, including Amazon Nova, Anthropic Claude, Meta Llama, Mistral, and native embedding with Titan / Cohere.
  • Async-native — built on Tokio; sync and async Python both supported.
  • Single wheel per platform — uses PyO3's abi3-py39 stable ABI, one .whl covers Python 3.9–3.13+.
  • Zero Python runtime dependencies — the Rust extension is self-contained.
  • Full type annotations — ships with py.typed and .pyi stubs.
  • max_completion_tokens support — works for all OpenAI model families including o1, o3-mini, o4-mini, gpt-4.1, gpt-4.1-nano that require this field.
  • Cache hit tokensresp.cache_hit_tokens exposes OpenAI prompt cache hits and Anthropic cache reads.
  • Reasoning tokensresp.thinking_tokens surfaces o-series reasoning and Claude extended thinking token counts.

What's New in 0.2.0

  • AWS Bedrock providerbedrock/<model-id> routing for 12+ model families via the Converse API: Amazon Nova, Anthropic Claude, Meta Llama, Mistral, Google Gemma, NVIDIA Nemotron, Qwen, MiniMax, DeepSeek, Z.AI, OpenAI OSS, Cohere, Writer.
  • Bedrock native embedding — Amazon Titan Embed Text v2/v1 and Cohere Embed v3/v4.
  • Inference profile auto-resolution — bare model IDs automatically resolve to cross-region inference profile IDs.
  • Backed by edgequake-llm v0.3.0.

See CHANGELOG.md for the full history.

Installation

pip install edgequake-litellm

Quick Start

import edgequake_litellm as litellm   # drop-in import alias

# ── Synchronous chat ────────────────────────────────────────────────────────
resp = litellm.completion(
    "openai/gpt-4o-mini",
    [{"role": "user", "content": "Hello, world!"}],
)
# litellm-compatible access
print(resp.choices[0].message.content)
# convenience shortcut
print(resp.content)

# ── Asynchronous chat ───────────────────────────────────────────────────────
import asyncio

async def main():
    resp = await litellm.acompletion(
        "anthropic/claude-3-5-haiku-20241022",
        [{"role": "user", "content": "Tell me a joke."}],
        max_tokens=128,
        temperature=0.8,
    )
    print(resp.choices[0].message.content)

asyncio.run(main())

# ── Streaming (async generator) ─────────────────────────────────────────────
async def stream_example():
    messages = [{"role": "user", "content": "Count to five."}]
    async for chunk in litellm.acompletion("openai/gpt-4o", messages, stream=True):
        print(chunk.choices[0].delta.content or "", end="", flush=True)

# ── Embeddings ──────────────────────────────────────────────────────────────
result = litellm.embedding(
    "openai/text-embedding-3-small",
    ["Hello world", "Rust is fast"],
)
# litellm-compatible access
print(result.data[0].embedding[:3])
# legacy list access still works
print(len(result), len(result[0]))  # 2 1536

Provider Routing

Pass provider/model as the first argument — the prefix selects the provider:

Provider Example model string
OpenAI openai/gpt-4o
Anthropic anthropic/claude-3-5-sonnet-20241022
Google Gemini gemini/gemini-2.0-flash
Mistral mistral/mistral-large-latest
OpenRouter openrouter/meta-llama/llama-3.1-70b-instruct
xAI xai/grok-3-beta
Azure OpenAI azure/gpt-4o
AWS Bedrock bedrock/amazon.nova-lite-v1:0
Ollama ollama/llama3.2
LM Studio lmstudio/local-model
HuggingFace huggingface/mistralai/Mixtral-8x7B-Instruct-v0.1
Mock (tests) mock/any-name

API Reference

completion(model, messages, **kwargs) → ModelResponseCompat

Synchronous chat completion. Blocks but releases the GIL during Rust I/O so other Python threads keep running.

resp = litellm.completion(
    "openai/gpt-4o",
    messages,
    max_tokens=256,
    temperature=0.7,
    system="You are a helpful assistant.",
    max_completion_tokens=256,  # alias for max_tokens; required for o1/o3/gpt-4.1 models
    seed=42,
    response_format={"type": "json_object"},  # or "text" / "json_object"
)

# All of these access the same content:
resp.choices[0].message.content   # litellm path
resp.content                       # shortcut
resp["choices"][0]["message"]["content"]  # dict-style

resp.usage.total_tokens
resp.model
resp.response_ms                  # latency in milliseconds
resp.to_dict()                    # plain dict

# New in 0.1.1 — cache and reasoning token metadata
resp.cache_hit_tokens             # int | None — tokens served from provider cache
resp.thinking_tokens              # int | None — reasoning tokens (o-series, Claude)
resp.thinking_content             # str | None — visible thinking text (Claude)

# The same data via usage object:
resp.usage.cache_read_input_tokens  # same as resp.cache_hit_tokens
resp.usage.reasoning_tokens         # same as resp.thinking_tokens

acompletion(model, messages, stream=False, **kwargs)

Async chat completion. Returns ModelResponseCompat or (if stream=True) AsyncGenerator[StreamChunkCompat, None].

# Non-streaming
resp = await litellm.acompletion("openai/gpt-4o", messages)

# Streaming
async for chunk in await litellm.acompletion("openai/gpt-4o", messages, stream=True):
    print(chunk.choices[0].delta.content or "", end="")

stream(model, messages, **kwargs) → AsyncGenerator[StreamChunk, None]

Low-level streaming. Raw StreamChunk objects:

async for chunk in litellm.stream("openai/gpt-4o", messages):
    if chunk.content:
        print(chunk.content, end="")
    elif chunk.is_finished:
        print(f"\n[stop: {chunk.finish_reason}]")

embedding(model, input, **kwargs) → EmbeddingResponseCompat

Synchronous embeddings. Returns an EmbeddingResponseCompat that supports both litellm-style and legacy list-style access:

result = litellm.embedding("openai/text-embedding-3-small", ["foo", "bar"])

# litellm path
result.data[0].embedding

# backwards-compatible list access
for vec in result:          # iterates List[float]
    print(len(vec))
result[0]                   # List[float]
len(result)                 # number of vectors

aembedding(model, input, **kwargs) → EmbeddingResponseCompat

Async embeddings — same return type as embedding().

stream_chunk_builder(chunks, messages=None) → ModelResponseCompat

Reconstruct a full ModelResponseCompat from a collected list of streaming chunks:

from edgequake_litellm import stream_chunk_builder

chunks = []
async for chunk in litellm.stream("openai/gpt-4o", messages):
    chunks.append(chunk)

full = stream_chunk_builder(chunks, messages=messages)
print(full.content)

Configuration

Module-level globals mirror litellm:

import edgequake_litellm as litellm

litellm.set_verbose = True      # enable debug logging
litellm.drop_params = True      # drop unknown params (always True)

# Set default provider / model
litellm.set_default_provider("anthropic")
litellm.set_default_model("claude-3-5-haiku-20241022")

# Now the provider prefix can be omitted:
resp = litellm.completion("claude-3-5-haiku-20241022", messages)

Exception Hierarchy

Exceptions mirror LiteLLM for painless migration:

import edgequake_litellm as litellm

try:
    resp = litellm.completion("openai/gpt-4o", messages)
except litellm.AuthenticationError as e:
    print(f"Check your API key: {e}")
except litellm.RateLimitError:
    time.sleep(5)
except litellm.ContextWindowExceededError:
    # trim messages and retry
    pass
except litellm.NotFoundError:      # alias for ModelNotFoundError
    pass
except litellm.APIConnectionError:
    pass

All exceptions (AuthenticationError, RateLimitError, ContextWindowExceededError, ModelNotFoundError, Timeout, APIConnectionError, APIError) are also available from edgequake_litellm.exceptions.

Environment Variables

Provider credentials follow the standard naming convention:

Provider Environment variable
OpenAI OPENAI_API_KEY
Anthropic ANTHROPIC_API_KEY
Gemini GEMINI_API_KEY
Mistral MISTRAL_API_KEY
OpenRouter OPENROUTER_API_KEY
xAI XAI_API_KEY
HuggingFace HF_TOKEN
Ollama OLLAMA_HOST (default: http://localhost:11434)
LM Studio LMSTUDIO_HOST (default: http://localhost:1234)

Defaults can also be set via LITELLM_EDGE_PROVIDER / LITELLM_EDGE_MODEL.

Development

Prerequisites

  • Rust ≥ 1.83 (rustup toolchain install stable)
  • Python ≥ 3.9
  • pip install maturin

Build from source

git clone https://github.com/raphaelmansuy/edgequake-llm.git
cd edgequake-llm/edgequake-litellm

# Create a virtual environment
python -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate

pip install maturin pytest pytest-asyncio ruff mypy

# Build & install in dev mode (incremental Rust + Python)
maturin develop --release

# Run unit tests (mock provider — no API keys needed)
pytest tests/ -k "not e2e" -v

Running E2E tests

export OPENAI_API_KEY=sk-...
pytest tests/test_e2e_openai.py -v

Publishing

# Bump version in pyproject.toml AND Cargo.toml (must match), then:
git tag py-v0.2.0
git push --tags
# GitHub Actions builds and publishes to PyPI automatically.

License

Apache-2.0 — see LICENSE-APACHE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

edgequake_litellm-0.2.0.tar.gz (789.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

edgequake_litellm-0.2.0-cp39-abi3-win_amd64.whl (7.4 MB view details)

Uploaded CPython 3.9+Windows x86-64

edgequake_litellm-0.2.0-cp39-abi3-musllinux_1_2_x86_64.whl (9.7 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ x86-64

edgequake_litellm-0.2.0-cp39-abi3-musllinux_1_2_aarch64.whl (9.5 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARM64

edgequake_litellm-0.2.0-cp39-abi3-macosx_11_0_arm64.whl (8.5 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

edgequake_litellm-0.2.0-cp39-abi3-macosx_10_12_x86_64.whl (8.7 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

edgequake_litellm-0.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.3 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file edgequake_litellm-0.2.0.tar.gz.

File metadata

  • Download URL: edgequake_litellm-0.2.0.tar.gz
  • Upload date:
  • Size: 789.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for edgequake_litellm-0.2.0.tar.gz
Algorithm Hash digest
SHA256 3fd2d8eb52e8b72e0f80e698e5de00b9387b90a09f573cf30bb6d04d99e24886
MD5 0df8b7dd9f25dcd8e5e1512405f54810
BLAKE2b-256 4a54d24d94cc898d6513901c7134d6d90cd4af127c13f8010d40ed0baa537823

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.2.0-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.2.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 c3cbe978930473e9f5dd1d2080df9314220a99a5b854acf9dbd9e3b039749350
MD5 e7aec6f7fd6929570d94bc441bb10d6d
BLAKE2b-256 2d0ae7db7af7c4baf7a2f67437d60ee38ce8b8ea671ce3a1a558391b6a7a3d5b

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.2.0-cp39-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.2.0-cp39-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 40e131c09d42056ab62d8bdb7b83c17b202267e96de013f118076e7059dffc78
MD5 946544272cdb1e5b5fda6ae8a2ab6307
BLAKE2b-256 06e87423eba75890af8a84ff6efb8eed3291c36303fde1a61229f80391f6c6ae

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.2.0-cp39-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.2.0-cp39-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 450a4d2e1edb65a7e9d59e11d3a9bc588465ee496768705cd6dd1acb41ad927d
MD5 db1fe3386bd27edca77aea4ddcec321f
BLAKE2b-256 2c68ec8dd179d89bc72dc2b0074613895d1041322d4a2ccd174427257a30e837

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.2.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.2.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 36dd7d6ce268cf76aba2817378f7edab732596fab71c80089f70cc154e558b5f
MD5 dd6ee7ada65d090d2ce42e012fdc5c45
BLAKE2b-256 ee69095e84709a04a229389eeaf94f551a08cdf6278421df2d0cd63fa0d20017

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.2.0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.2.0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 614dc21ba71ab5155e4174b8e6f29886a4ef0e199cf525dfac2f31a9ab369f75
MD5 caa94c5adad1ea34ed06b0683e2f447b
BLAKE2b-256 755cf9c59479ed148fb0d57e4178d6589e32e3f01144e8fa7b923e2d22de16e7

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 81bc19f05cbe3be47cb94a691e60b8904f8eb9d1509f4f3c885f1c7fe0752516
MD5 7d32882a2f86f659c93427347b01fe4d
BLAKE2b-256 1c16c0694fbfe5c381f0316a127e0252b02a983534d457714071d79995945f10

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page