Skip to main content

Drop-in LiteLLM replacement backed by Rust — same API, 10× lower latency

Project description

edgequake-litellm

Drop-in LiteLLM replacement backed by Rust — same API, lower overhead.

PyPI Python CI License

edgequake-litellm wraps the edgequake-llm Rust core via PyO3, providing a high-performance drop-in for LiteLLM. Swap the import — the rest of your code stays unchanged.

# Before
import litellm

# After — same API, Rust-backed
import edgequake_litellm as litellm

Features

  • LiteLLM-compatible APIcompletion(), acompletion(), stream(), embedding(), same call signatures, same response shape (resp.choices[0].message.content).
  • Multi-provider routing — OpenAI, Anthropic, Gemini, Mistral, OpenRouter, xAI, Ollama, LM Studio, HuggingFace, and more, via provider/model strings.
  • Async-native — built on Tokio; sync and async Python both supported.
  • Single wheel per platform — uses PyO3's abi3-py39 stable ABI, one .whl covers Python 3.9–3.13+.
  • Zero Python runtime dependencies — the Rust extension is self-contained.
  • Full type annotations — ships with py.typed and .pyi stubs.
  • max_completion_tokens support — works for all OpenAI model families including o1, o3-mini, o4-mini, gpt-4.1, gpt-4.1-nano that require this field.
  • Cache hit tokensresp.cache_hit_tokens exposes OpenAI prompt cache hits and Anthropic cache reads.
  • Reasoning tokensresp.thinking_tokens surfaces o-series reasoning and Claude extended thinking token counts.

What's New in 0.1.1

  • max_completion_tokens fixed for OpenAI o-series and gpt-4.1 model families (previously returned 400 Bad Request).
  • resp.cache_hit_tokens — new property returning tokens served from provider cache (None if not applicable).
  • resp.thinking_tokens — new property returning reasoning/thinking token count for o-series and Claude models.
  • Both new properties are included in resp.to_dict().

See CHANGELOG.md for the full history.

Installation

pip install edgequake-litellm

Quick Start

import edgequake_litellm as litellm   # drop-in import alias

# ── Synchronous chat ────────────────────────────────────────────────────────
resp = litellm.completion(
    "openai/gpt-4o-mini",
    [{"role": "user", "content": "Hello, world!"}],
)
# litellm-compatible access
print(resp.choices[0].message.content)
# convenience shortcut
print(resp.content)

# ── Asynchronous chat ───────────────────────────────────────────────────────
import asyncio

async def main():
    resp = await litellm.acompletion(
        "anthropic/claude-3-5-haiku-20241022",
        [{"role": "user", "content": "Tell me a joke."}],
        max_tokens=128,
        temperature=0.8,
    )
    print(resp.choices[0].message.content)

asyncio.run(main())

# ── Streaming (async generator) ─────────────────────────────────────────────
async def stream_example():
    messages = [{"role": "user", "content": "Count to five."}]
    async for chunk in litellm.acompletion("openai/gpt-4o", messages, stream=True):
        print(chunk.choices[0].delta.content or "", end="", flush=True)

# ── Embeddings ──────────────────────────────────────────────────────────────
result = litellm.embedding(
    "openai/text-embedding-3-small",
    ["Hello world", "Rust is fast"],
)
# litellm-compatible access
print(result.data[0].embedding[:3])
# legacy list access still works
print(len(result), len(result[0]))  # 2 1536

Provider Routing

Pass provider/model as the first argument — the prefix selects the provider:

Provider Example model string
OpenAI openai/gpt-4o
Anthropic anthropic/claude-3-5-sonnet-20241022
Google Gemini gemini/gemini-2.0-flash
Mistral mistral/mistral-large-latest
OpenRouter openrouter/meta-llama/llama-3.1-70b-instruct
xAI xai/grok-3-beta
Ollama ollama/llama3.2
LM Studio lmstudio/local-model
HuggingFace huggingface/mistralai/Mixtral-8x7B-Instruct-v0.1
Mock (tests) mock/any-name

API Reference

completion(model, messages, **kwargs) → ModelResponseCompat

Synchronous chat completion. Blocks but releases the GIL during Rust I/O so other Python threads keep running.

resp = litellm.completion(
    "openai/gpt-4o",
    messages,
    max_tokens=256,
    temperature=0.7,
    system="You are a helpful assistant.",
    max_completion_tokens=256,  # alias for max_tokens; required for o1/o3/gpt-4.1 models
    seed=42,
    response_format={"type": "json_object"},  # or "text" / "json_object"
)

# All of these access the same content:
resp.choices[0].message.content   # litellm path
resp.content                       # shortcut
resp["choices"][0]["message"]["content"]  # dict-style

resp.usage.total_tokens
resp.model
resp.response_ms                  # latency in milliseconds
resp.to_dict()                    # plain dict

# New in 0.1.1 — cache and reasoning token metadata
resp.cache_hit_tokens             # int | None — tokens served from provider cache
resp.thinking_tokens              # int | None — reasoning tokens (o-series, Claude)
resp.thinking_content             # str | None — visible thinking text (Claude)

# The same data via usage object:
resp.usage.cache_read_input_tokens  # same as resp.cache_hit_tokens
resp.usage.reasoning_tokens         # same as resp.thinking_tokens

acompletion(model, messages, stream=False, **kwargs)

Async chat completion. Returns ModelResponseCompat or (if stream=True) AsyncGenerator[StreamChunkCompat, None].

# Non-streaming
resp = await litellm.acompletion("openai/gpt-4o", messages)

# Streaming
async for chunk in await litellm.acompletion("openai/gpt-4o", messages, stream=True):
    print(chunk.choices[0].delta.content or "", end="")

stream(model, messages, **kwargs) → AsyncGenerator[StreamChunk, None]

Low-level streaming. Raw StreamChunk objects:

async for chunk in litellm.stream("openai/gpt-4o", messages):
    if chunk.content:
        print(chunk.content, end="")
    elif chunk.is_finished:
        print(f"\n[stop: {chunk.finish_reason}]")

embedding(model, input, **kwargs) → EmbeddingResponseCompat

Synchronous embeddings. Returns an EmbeddingResponseCompat that supports both litellm-style and legacy list-style access:

result = litellm.embedding("openai/text-embedding-3-small", ["foo", "bar"])

# litellm path
result.data[0].embedding

# backwards-compatible list access
for vec in result:          # iterates List[float]
    print(len(vec))
result[0]                   # List[float]
len(result)                 # number of vectors

aembedding(model, input, **kwargs) → EmbeddingResponseCompat

Async embeddings — same return type as embedding().

stream_chunk_builder(chunks, messages=None) → ModelResponseCompat

Reconstruct a full ModelResponseCompat from a collected list of streaming chunks:

from edgequake_litellm import stream_chunk_builder

chunks = []
async for chunk in litellm.stream("openai/gpt-4o", messages):
    chunks.append(chunk)

full = stream_chunk_builder(chunks, messages=messages)
print(full.content)

Configuration

Module-level globals mirror litellm:

import edgequake_litellm as litellm

litellm.set_verbose = True      # enable debug logging
litellm.drop_params = True      # drop unknown params (always True)

# Set default provider / model
litellm.set_default_provider("anthropic")
litellm.set_default_model("claude-3-5-haiku-20241022")

# Now the provider prefix can be omitted:
resp = litellm.completion("claude-3-5-haiku-20241022", messages)

Exception Hierarchy

Exceptions mirror LiteLLM for painless migration:

import edgequake_litellm as litellm

try:
    resp = litellm.completion("openai/gpt-4o", messages)
except litellm.AuthenticationError as e:
    print(f"Check your API key: {e}")
except litellm.RateLimitError:
    time.sleep(5)
except litellm.ContextWindowExceededError:
    # trim messages and retry
    pass
except litellm.NotFoundError:      # alias for ModelNotFoundError
    pass
except litellm.APIConnectionError:
    pass

All exceptions (AuthenticationError, RateLimitError, ContextWindowExceededError, ModelNotFoundError, Timeout, APIConnectionError, APIError) are also available from edgequake_litellm.exceptions.

Environment Variables

Provider credentials follow the standard naming convention:

Provider Environment variable
OpenAI OPENAI_API_KEY
Anthropic ANTHROPIC_API_KEY
Gemini GEMINI_API_KEY
Mistral MISTRAL_API_KEY
OpenRouter OPENROUTER_API_KEY
xAI XAI_API_KEY
HuggingFace HF_TOKEN
Ollama OLLAMA_HOST (default: http://localhost:11434)
LM Studio LMSTUDIO_HOST (default: http://localhost:1234)

Defaults can also be set via LITELLM_EDGE_PROVIDER / LITELLM_EDGE_MODEL.

Development

Prerequisites

  • Rust ≥ 1.83 (rustup toolchain install stable)
  • Python ≥ 3.9
  • pip install maturin

Build from source

git clone https://github.com/raphaelmansuy/edgequake-llm.git
cd edgequake-llm/edgequake-litellm

# Create a virtual environment
python -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate

pip install maturin pytest pytest-asyncio ruff mypy

# Build & install in dev mode (incremental Rust + Python)
maturin develop --release

# Run unit tests (mock provider — no API keys needed)
pytest tests/ -k "not e2e" -v

Running E2E tests

export OPENAI_API_KEY=sk-...
pytest tests/test_e2e_openai.py -v

Publishing

# Bump version in pyproject.toml AND Cargo.toml (must match), then:
git tag py-v0.2.0
git push --tags
# GitHub Actions builds and publishes to PyPI automatically.

License

Apache-2.0 — see LICENSE-APACHE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

edgequake_litellm-0.1.2.tar.gz (716.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

edgequake_litellm-0.1.2-cp39-abi3-win_amd64.whl (3.0 MB view details)

Uploaded CPython 3.9+Windows x86-64

edgequake_litellm-0.1.2-cp39-abi3-musllinux_1_2_x86_64.whl (3.8 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ x86-64

edgequake_litellm-0.1.2-cp39-abi3-musllinux_1_2_aarch64.whl (3.7 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARM64

edgequake_litellm-0.1.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

edgequake_litellm-0.1.2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.6 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

edgequake_litellm-0.1.2-cp39-abi3-macosx_11_0_arm64.whl (3.2 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

edgequake_litellm-0.1.2-cp39-abi3-macosx_10_12_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file edgequake_litellm-0.1.2.tar.gz.

File metadata

  • Download URL: edgequake_litellm-0.1.2.tar.gz
  • Upload date:
  • Size: 716.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for edgequake_litellm-0.1.2.tar.gz
Algorithm Hash digest
SHA256 af70bd608fe450a98c285b883c5cb7c17c8626450c56d7944b57638e09013e91
MD5 272c0aa6701b861d3a95766a7276304f
BLAKE2b-256 627ad253b8a3ae197a102a868b565a4b481fbd5c467f3adce51a04588578fcbe

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.1.2-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.1.2-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 8745bc48861256378b3c0a59e75cec5a9287a3446081e78a5a8b9feebb250e1d
MD5 6543cc35cf0fef2dc986726f0d1c2674
BLAKE2b-256 a51d0c1ad77eb0c3887cb1e5466c6d55d9065b0a4c9d90f6d8fbd366480a88d9

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.1.2-cp39-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.1.2-cp39-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 3cb61574fad76c993a7c7274fd84a2b16a97a1f6b34518f383d134f774ad1c0d
MD5 367d09d15927b4320d60c6e8da1b059c
BLAKE2b-256 03b4267c03e6dcf23fa7907e0eb8f8f003dafabf449a3027e66e51ce20cf95b6

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.1.2-cp39-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.1.2-cp39-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 9606da689adbac966097565fc8ff85ac0bc2ad00f46414c13985bbf696f14d0d
MD5 1fd4fa546b2c9852d8a573fa527e89a5
BLAKE2b-256 44ee009550814831c48eba4d32d3c5e967d3fa2d35e6da775600e02ca25a3340

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.1.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.1.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9d37d734ff85fc8738d36ebefa8c445a93d00552ea675c99914296a38fb39520
MD5 c55be43623224558da1eac312e3ed958
BLAKE2b-256 5ce2faa2a5d37bb65a4afd2774ac194fa94d5eead4510f08ab0dbd9ff7bc7a6f

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.1.2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.1.2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 0ba712e2c8ea3810659b6cc4d011ad6e76f5a57683894d4c2babb27ad9d23568
MD5 97ee186f244117a5c8858dacd864bfbd
BLAKE2b-256 641539ed9b653c0e27ebb025d19c08b1c8c89287a9c6ab905c171046a3d60135

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.1.2-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.1.2-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 bc063fe9e67aaa5604ba242ba32885b917dc6b2f7dfd9729c82707aa7ad969af
MD5 ac08542867ef35db4eab0799f7197395
BLAKE2b-256 160fd5b09ac2b7b29ef2759086aeb16835e045f5b340dbc220f636b50290234b

See more details on using hashes here.

File details

Details for the file edgequake_litellm-0.1.2-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for edgequake_litellm-0.1.2-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 f3326b3effe8b30e807881803ba08135ec8779d8bcd169cfc2db31e877f5d568
MD5 b8f583911a52b4c220d63dd4cf76202a
BLAKE2b-256 eeaf7e994b6a25fd2c3e8354afc289461b5e07ad2ca6fb56aa6db9fd4b20c65a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page