Skip to main content

Drop-in LiteLLM replacement backed by Rust — same API, 10× lower latency

Project description

edgequake-litellm

Drop-in LiteLLM replacement backed by Rust — same API, lower overhead.

PyPI Python CI License

edgequake-litellm wraps the edgequake-llm Rust core via PyO3, providing a high-performance drop-in for LiteLLM. Swap the import — the rest of your code stays unchanged.

# Before
import litellm

# After — same API, Rust-backed
import edgequake_litellm as litellm

Features

  • LiteLLM-compatible APIcompletion(), acompletion(), stream(), embedding(), same call signatures, same response shape (resp.choices[0].message.content).
  • Multi-provider routing — OpenAI, Anthropic, Gemini, Mistral, OpenRouter, xAI, Ollama, LM Studio, HuggingFace, and more, via provider/model strings.
  • Async-native — built on Tokio; sync and async Python both supported.
  • Single wheel per platform — uses PyO3's abi3-py39 stable ABI, one .whl covers Python 3.9–3.13+.
  • Zero Python runtime dependencies — the Rust extension is self-contained.
  • Full type annotations — ships with py.typed and .pyi stubs.
  • max_completion_tokens support — works for all OpenAI model families including o1, o3-mini, o4-mini, gpt-4.1, gpt-4.1-nano that require this field.
  • Cache hit tokensresp.cache_hit_tokens exposes OpenAI prompt cache hits and Anthropic cache reads.
  • Reasoning tokensresp.thinking_tokens surfaces o-series reasoning and Claude extended thinking token counts.

What's New in 0.1.1

  • max_completion_tokens fixed for OpenAI o-series and gpt-4.1 model families (previously returned 400 Bad Request).
  • resp.cache_hit_tokens — new property returning tokens served from provider cache (None if not applicable).
  • resp.thinking_tokens — new property returning reasoning/thinking token count for o-series and Claude models.
  • Both new properties are included in resp.to_dict().

See CHANGELOG.md for the full history.

Installation

pip install edgequake-litellm

Quick Start

import edgequake_litellm as litellm   # drop-in import alias

# ── Synchronous chat ────────────────────────────────────────────────────────
resp = litellm.completion(
    "openai/gpt-4o-mini",
    [{"role": "user", "content": "Hello, world!"}],
)
# litellm-compatible access
print(resp.choices[0].message.content)
# convenience shortcut
print(resp.content)

# ── Asynchronous chat ───────────────────────────────────────────────────────
import asyncio

async def main():
    resp = await litellm.acompletion(
        "anthropic/claude-3-5-haiku-20241022",
        [{"role": "user", "content": "Tell me a joke."}],
        max_tokens=128,
        temperature=0.8,
    )
    print(resp.choices[0].message.content)

asyncio.run(main())

# ── Streaming (async generator) ─────────────────────────────────────────────
async def stream_example():
    messages = [{"role": "user", "content": "Count to five."}]
    async for chunk in litellm.acompletion("openai/gpt-4o", messages, stream=True):
        print(chunk.choices[0].delta.content or "", end="", flush=True)

# ── Embeddings ──────────────────────────────────────────────────────────────
result = litellm.embedding(
    "openai/text-embedding-3-small",
    ["Hello world", "Rust is fast"],
)
# litellm-compatible access
print(result.data[0].embedding[:3])
# legacy list access still works
print(len(result), len(result[0]))  # 2 1536

Provider Routing

Pass provider/model as the first argument — the prefix selects the provider:

Provider Example model string
OpenAI openai/gpt-4o
Anthropic anthropic/claude-3-5-sonnet-20241022
Google Gemini gemini/gemini-2.0-flash
Mistral mistral/mistral-large-latest
OpenRouter openrouter/meta-llama/llama-3.1-70b-instruct
xAI xai/grok-3-beta
Ollama ollama/llama3.2
LM Studio lmstudio/local-model
HuggingFace huggingface/mistralai/Mixtral-8x7B-Instruct-v0.1
Mock (tests) mock/any-name

API Reference

completion(model, messages, **kwargs) → ModelResponseCompat

Synchronous chat completion. Blocks but releases the GIL during Rust I/O so other Python threads keep running.

resp = litellm.completion(
    "openai/gpt-4o",
    messages,
    max_tokens=256,
    temperature=0.7,
    system="You are a helpful assistant.",
    max_completion_tokens=256,  # alias for max_tokens; required for o1/o3/gpt-4.1 models
    seed=42,
    response_format={"type": "json_object"},  # or "text" / "json_object"
)

# All of these access the same content:
resp.choices[0].message.content   # litellm path
resp.content                       # shortcut
resp["choices"][0]["message"]["content"]  # dict-style

resp.usage.total_tokens
resp.model
resp.response_ms                  # latency in milliseconds
resp.to_dict()                    # plain dict

# New in 0.1.1 — cache and reasoning token metadata
resp.cache_hit_tokens             # int | None — tokens served from provider cache
resp.thinking_tokens              # int | None — reasoning tokens (o-series, Claude)
resp.thinking_content             # str | None — visible thinking text (Claude)

# The same data via usage object:
resp.usage.cache_read_input_tokens  # same as resp.cache_hit_tokens
resp.usage.reasoning_tokens         # same as resp.thinking_tokens

acompletion(model, messages, stream=False, **kwargs)

Async chat completion. Returns ModelResponseCompat or (if stream=True) AsyncGenerator[StreamChunkCompat, None].

# Non-streaming
resp = await litellm.acompletion("openai/gpt-4o", messages)

# Streaming
async for chunk in await litellm.acompletion("openai/gpt-4o", messages, stream=True):
    print(chunk.choices[0].delta.content or "", end="")

stream(model, messages, **kwargs) → AsyncGenerator[StreamChunk, None]

Low-level streaming. Raw StreamChunk objects:

async for chunk in litellm.stream("openai/gpt-4o", messages):
    if chunk.content:
        print(chunk.content, end="")
    elif chunk.is_finished:
        print(f"\n[stop: {chunk.finish_reason}]")

embedding(model, input, **kwargs) → EmbeddingResponseCompat

Synchronous embeddings. Returns an EmbeddingResponseCompat that supports both litellm-style and legacy list-style access:

result = litellm.embedding("openai/text-embedding-3-small", ["foo", "bar"])

# litellm path
result.data[0].embedding

# backwards-compatible list access
for vec in result:          # iterates List[float]
    print(len(vec))
result[0]                   # List[float]
len(result)                 # number of vectors

aembedding(model, input, **kwargs) → EmbeddingResponseCompat

Async embeddings — same return type as embedding().

stream_chunk_builder(chunks, messages=None) → ModelResponseCompat

Reconstruct a full ModelResponseCompat from a collected list of streaming chunks:

from edgequake_litellm import stream_chunk_builder

chunks = []
async for chunk in litellm.stream("openai/gpt-4o", messages):
    chunks.append(chunk)

full = stream_chunk_builder(chunks, messages=messages)
print(full.content)

Configuration

Module-level globals mirror litellm:

import edgequake_litellm as litellm

litellm.set_verbose = True      # enable debug logging
litellm.drop_params = True      # drop unknown params (always True)

# Set default provider / model
litellm.set_default_provider("anthropic")
litellm.set_default_model("claude-3-5-haiku-20241022")

# Now the provider prefix can be omitted:
resp = litellm.completion("claude-3-5-haiku-20241022", messages)

Exception Hierarchy

Exceptions mirror LiteLLM for painless migration:

import edgequake_litellm as litellm

try:
    resp = litellm.completion("openai/gpt-4o", messages)
except litellm.AuthenticationError as e:
    print(f"Check your API key: {e}")
except litellm.RateLimitError:
    time.sleep(5)
except litellm.ContextWindowExceededError:
    # trim messages and retry
    pass
except litellm.NotFoundError:      # alias for ModelNotFoundError
    pass
except litellm.APIConnectionError:
    pass

All exceptions (AuthenticationError, RateLimitError, ContextWindowExceededError, ModelNotFoundError, Timeout, APIConnectionError, APIError) are also available from edgequake_litellm.exceptions.

Environment Variables

Provider credentials follow the standard naming convention:

Provider Environment variable
OpenAI OPENAI_API_KEY
Anthropic ANTHROPIC_API_KEY
Gemini GEMINI_API_KEY
Mistral MISTRAL_API_KEY
OpenRouter OPENROUTER_API_KEY
xAI XAI_API_KEY
HuggingFace HF_TOKEN
Ollama OLLAMA_HOST (default: http://localhost:11434)
LM Studio LMSTUDIO_HOST (default: http://localhost:1234)

Defaults can also be set via LITELLM_EDGE_PROVIDER / LITELLM_EDGE_MODEL.

Development

Prerequisites

  • Rust ≥ 1.83 (rustup toolchain install stable)
  • Python ≥ 3.9
  • pip install maturin

Build from source

git clone https://github.com/raphaelmansuy/edgequake-llm.git
cd edgequake-llm/edgequake-litellm

# Create a virtual environment
python -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate

pip install maturin pytest pytest-asyncio ruff mypy

# Build & install in dev mode (incremental Rust + Python)
maturin develop --release

# Run unit tests (mock provider — no API keys needed)
pytest tests/ -k "not e2e" -v

Running E2E tests

export OPENAI_API_KEY=sk-...
pytest tests/test_e2e_openai.py -v

Publishing

# Bump version in pyproject.toml AND Cargo.toml (must match), then:
git tag py-v0.2.0
git push --tags
# GitHub Actions builds and publishes to PyPI automatically.

License

Apache-2.0 — see LICENSE-APACHE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

edgequake_litellm-0.1.1.tar.gz (709.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

edgequake_litellm-0.1.1-cp39-abi3-win_amd64.whl (3.0 MB view details)

Uploaded CPython 3.9+Windows x86-64

edgequake_litellm-0.1.1-cp39-abi3-musllinux_1_2_x86_64.whl (3.8 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ x86-64

edgequake_litellm-0.1.1-cp39-abi3-musllinux_1_2_aarch64.whl (3.7 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARM64

edgequake_litellm-0.1.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

edgequake_litellm-0.1.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.5 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

edgequake_litellm-0.1.1-cp39-abi3-macosx_11_0_arm64.whl (3.2 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

edgequake_litellm-0.1.1-cp39-abi3-macosx_10_12_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file edgequake_litellm-0.1.1.tar.gz.

File metadata

  • Download URL: edgequake_litellm-0.1.1.tar.gz
  • Upload date:
  • Size: 709.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for edgequake_litellm-0.1.1.tar.gz
Algorithm Hash digest
SHA256 16f55e2cc7a06ba0a42ffddeb6a958f3ba9362b6d1dc5189a551a32ec4d7656f
MD5 6b59e518d1be7886cf7e997570b2a9e4
BLAKE2b-256 c8aa206f970eb9e37979f7023832fbbd762587e53786f19e765a5f2d25bd53b5

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.1.1-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.1.1-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 6604edb11dbc802af570baf4d2a9f1e295d3e6c85781529002860452432dabf6
MD5 9096179de5f1e7fbcfcef25f93dd5cf0
BLAKE2b-256 5a50c30a4f466f017cf76bda983d4bf25268c4d4f2bfc5683fbfd75e559f25ad

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.1.1-cp39-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.1.1-cp39-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 838fbddb3801a97ea81d583c3346919b2f76dac91b90e8ce168057b0eb2995e9
MD5 c342c67579725e99828749653a333bc1
BLAKE2b-256 9d35d6b365f4f54ec01bc8dd8688c0559e4f0664d1b81f761d6ac588651d26d6

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.1.1-cp39-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.1.1-cp39-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 9b7787d7c6fca46fada71016970de9360a67bfbaa6b5079462f2c86193b37177
MD5 c38e547771c879bf781c760f29deafee
BLAKE2b-256 ed0f0788b512c9e9536b7b5d311aab4fc58560c0742df88540d353f7f0d66923

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.1.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.1.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3f0493ce0317b345ed5f1fa78517ee90fc8cfc2745b6ba48adfcabb6714d2974
MD5 b1c865a1bb8c09d3887df6bdddd34ef5
BLAKE2b-256 6bd7ec04b2c3fb0bc8e2bff908bae1cc079d0c28630e5fcf6d3a6d5939f36445

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.1.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.1.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 fd0c6f09cfe0b7d3be56a932d37ca0e7882ddc258b62ea5cb6c456b50148cd05
MD5 c2050730b01e8f17f9d0013bb3f5fd4e
BLAKE2b-256 30388f6c6950b728b76fcfff79494fce1f1f7910613da7b9d9059b58ad7e350a

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.1.1-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.1.1-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ea76d421069a97493be9def610a868bc6c6a71823681deae8328c1f4d82fa91c
MD5 1182dfe328ec28b5ffcaa6fec7dbaa5f
BLAKE2b-256 bb831723564b26bc7cd7602c4fad2c2191615caa9873bea744374198468d9d15

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.1.1-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.1.1-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 10283d31b14f6ecce36985bc90f298e52b55080c461b2070a810486a28fa3841
MD5 92fb7988d4156bd372da4c6c7641fd1b
BLAKE2b-256 dde32d2be2b227115d70cb1f1a8d363f43862c2db946ff898494fd77d4903382

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page