Free LLM client + gateway: one always-up, OpenAI-compatible API (with streaming) over free-tier providers (OpenRouter, Google Gemini / AI Studio, NVIDIA NIM, Groq, Cerebras, Mistral) with automatic key rotation, cross-provider failover, circuit breaking, rate-limit/quota-aware routing, and live free-model discovery.
Project description
freelm — free, always-up LLM client for Python
freelm is a free, always-up LLM client and gateway for Python that pools multiple free-tier LLM providers — OpenRouter, Google Gemini (AI Studio), NVIDIA NIM, Groq, Cerebras, and Mistral — behind one OpenAI-compatible call (with streaming), with automatic API-key rotation, cross-provider failover, circuit breaking, rate-limit/quota-aware routing, and live free-model discovery. Drop in whichever free keys you have and your app keeps talking to an LLM even when one source rate-limits or goes down.
📦 PyPI: https://pypi.org/project/freelm/ — pip install freelm
Python first. JS/TS and Go ports planned (the core is spec-driven for portability).
Why
LLMs show up in nearly every project, and they cost money — but there's a lot of free capacity scattered across providers:
- OpenRouter — free models (
:free), ~50 req/day under $10 credit, ~1000/day at ≥$10. - Google AI Studio (Gemini) — generous free tier; Tier 1 (billing on) lifts limits hard.
- NVIDIA NIM (
build.nvidia.com) — many models free against build credits. - Groq — 30 RPM / 14,400 req-day free, very fast inference, no card.
- Cerebras — ~30 RPM, 1M tokens/day free (8K context cap), no card.
- Mistral — free "Experiment" tier: 2 RPM, 500K TPM, 1B tokens/month.
freelm pools them behind one fault-tolerant client.
Free-tier numbers above were verified 2026-06 and change often — they're defaults you can override with
tier/rpm/rpd.
Install
pip install freelm
Quick start
import freelm
llm = freelm.FreeLLM.from_env() # reads keys from environment
print(llm.text("Explain black holes in one sentence."))
Explicit config:
from freelm import FreeLLM, OpenRouter, GoogleAIStudio, NIM
llm = FreeLLM(
providers=[
OpenRouter("sk-or-...", tier="free"), # or tier="credit" if ≥ $10
GoogleAIStudio("AIza...", tier="free"), # or tier="tier1"
NIM("nvapi-..."),
],
strategy="quota_aware", # priority | round_robin | quota_aware | latency
)
resp = llm.chat(
[{"role": "user", "content": "Write a haiku about failover."}],
model="chat:fast", # virtual model, see below
)
print(resp.text, "via", resp.provider)
Async is symmetric:
from freelm import AsyncFreeLLM
async with AsyncFreeLLM.from_env() as llm:
print(await llm.text("hi"))
Streaming
Token streaming works across every provider and through the same failover. It fails over between providers before the first token; once tokens start flowing it stays on that provider (no mid-stream switching).
llm = freelm.FreeLLM.from_env()
for chunk in llm.stream("Write a haiku about failover."):
print(chunk, end="", flush=True)
async with freelm.AsyncFreeLLM.from_env() as llm:
async for chunk in llm.astream("Stream me some tokens"):
print(chunk, end="", flush=True)
Drop-in OpenAI shim
# from openai import OpenAI
from freelm.compat import OpenAI
client = OpenAI() # backed by FreeLLM.from_env()
r = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "hi"}],
)
print(r.choices[0].message.content)
Environment variables
| Provider | Key vars (first match wins) | Tier var |
|---|---|---|
| OpenRouter | OPENROUTER_API_KEY / FREELM_OPENROUTER_KEYS |
FREELM_OPENROUTER_TIER (free|credit) |
| Google AI Studio | GEMINI_API_KEY / GOOGLE_API_KEY / GOOGLE_AI_STUDIO_KEY / FREELM_GOOGLE_KEYS |
FREELM_GOOGLE_TIER (free|tier1) |
| NVIDIA NIM | NVIDIA_API_KEY / NIM_API_KEY / FREELM_NIM_KEYS |
FREELM_NIM_TIER (free) |
| Groq | GROQ_API_KEY / FREELM_GROQ_KEYS |
FREELM_GROQ_TIER (free) |
| Cerebras | CEREBRAS_API_KEY / FREELM_CEREBRAS_KEYS |
FREELM_CEREBRAS_TIER (free) |
| Mistral | MISTRAL_API_KEY / FREELM_MISTRAL_KEYS |
FREELM_MISTRAL_TIER (free) |
Multiple keys per provider: comma-separate them. See .env.example.
Groq vs xAI Grok: different companies. Groq (
gsk_…) has a free tier and is supported. xAI Grok (xai-…) is paid, so it's intentionally not included — freelm is free-only.
Virtual models
Names differ per provider, so ask by intent and freelm maps to a concrete model:
| Alias | Meaning |
|---|---|
auto / chat |
any available chat model (registry order) |
chat:large / large |
a larger/stronger model |
chat:fast / fast |
a fast/cheap model |
chat:small / small |
smallest model |
vendor/model-id |
passthrough — use exactly this model |
Override the table per provider with models=[ModelSpec(...)].
Dynamic model discovery
Free model IDs churn constantly, so freelm doesn't trust its hardcoded list. For OpenRouter (on by default), it queries GET /models on first use, derives tags (large/fast/small, plus tools/vision/reasoning from supported_parameters), and caches the list to disk.
Resolution order: live API → disk cache → hardcoded fallback (so it still works offline / key-less).
from freelm import list_free_models
for m in list_free_models()[:5]: # live OpenRouter free models, cached
print(m.id, m.tags, m.ctx)
Control it:
OpenRouter("sk-or-...", discover=True, discover_free_only=True, cache_ttl=3600)
GoogleAIStudio("AIza...", discover=True) # opt-in for other providers' /models
llm.refresh_models() # force re-fetch on next call
| Env var | Default | Meaning |
|---|---|---|
FREELM_CACHE_DIR |
~/.cache/freelm |
where the model cache lives (file is 0600) |
FREELM_CACHE_TTL |
3600 |
cache lifetime in seconds |
Configuration & tuning
Client knobs — FreeLLM(...) / AsyncFreeLLM(...):
| Param | Default | What it does |
|---|---|---|
strategy |
"priority" |
how providers are ranked (see below) |
max_attempts |
12 |
hard cap on total tries across all providers/keys/models per call |
timeout |
60.0 |
per-request timeout (s); also the overall deadline for one chat() |
wait |
False |
if every key is cooling, sleep until one frees instead of failing |
max_wait |
20.0 |
longest single sleep (s) when wait=True |
http_client |
None |
bring your own httpx.Client / AsyncClient |
Provider knobs — OpenRouter(...), GoogleAIStudio(...), NIM(...):
| Param | Default | What it does |
|---|---|---|
keys |
— | one key (str) or many (list, or comma-string via env) |
tier |
"free" |
selects built-in rpm/rpd limits |
priority |
0 |
lower = tried first (with strategy="priority") |
rpm / rpd |
tier default | override requests-per-minute / per-day |
models |
discovered / built-in | override model list (order = preference) |
discover |
OpenRouter True, else False |
live-fetch /models |
cache_ttl |
env / 1h | discovery cache lifetime |
Strategies
| Strategy | Behaviour |
|---|---|
priority |
providers in ascending priority, then list order. Deterministic. |
round_robin |
rotate which provider goes first each call. Spreads load evenly. |
quota_aware |
rank by current headroom (rpm tokens bounded by daily quota); cooling/disabled keys score 0. Unlimited-quota providers rank high but deplete as used, so traffic still spreads. |
latency |
prefer the provider with the lowest observed average latency. |
Whatever the ranking, candidates are interleaved across providers — the best model of every provider is tried before any provider's 2nd model — so failover always reaches every provider, even when your first provider has dozens of throttled free models.
Defining your own priority order
from freelm import FreeLLM, OpenRouter, GoogleAIStudio, NIM
llm = FreeLLM(
[
OpenRouter("sk-or-...", priority=0), # try first
GoogleAIStudio("AIza...", priority=1), # then this
NIM("nvapi-...", priority=2), # last resort
],
strategy="priority",
)
Within a provider, model preference is the order of its models list:
from freelm import OpenRouter, ModelSpec
OpenRouter("sk-or-...", discover=False, models=[
ModelSpec("openai/gpt-oss-120b:free", ("chat", "large")),
ModelSpec("meta-llama/llama-3.3-70b-instruct:free", ("chat", "large")),
])
Errors
from freelm import NoProvidersAvailable, ProviderError
try:
resp = llm.chat("hi")
except NoProvidersAvailable as e:
print("all providers exhausted:", e.attempts) # [(candidate, exception), ...]
except ProviderError as e:
print(e.provider, e.status, e.retryable) # e.g. a malformed 400
Hierarchy: FreeLLMError → ConfigError · NoProvidersAvailable · ProviderError → AuthError / RateLimited / Transient / ModelNotFound. Retryable errors (RateLimited, Transient) are handled internally and only surface, bundled, inside NoProvidersAvailable.
Response & introspection
r = llm.chat("hi")
r.text # assistant text (also: str(r))
r.provider # which provider served it, e.g. "openrouter"
r.model # concrete model id used
r.usage # .prompt_tokens / .completion_tokens / .total_tokens
r.latency_ms # round-trip latency
r.raw # original provider JSON
llm.health() → one dict per key: provider, key (masked), ready, breaker, rpd_used, last_error, ewma_latency_ms.
Concurrency:
AsyncFreeLLMis safe across many concurrent tasks on one event loop. A syncFreeLLMmutates per-key state without locks — use one client per thread, or use the async client, for multi-threaded workloads.
How "always-up" works
- Key pool per provider, round-robined to spread load.
- Failover chain: interleaved across providers (best model of each, then next-best) so every provider is reached fast — never starved by one provider's many models.
- Circuit breaker per key: opens after repeated failures, half-opens after a cooldown — no hammering a dead key.
- Retry classification:
429→ cool the key & rotate;5xx/timeout → breaker + backoff;401/403→ disable the key;4xxmodel errors → try another model/provider; other4xx→ surfaced as a caller bug. - Quota guard: per-key requests/minute (token bucket) + requests/day counter, so a key predicted to be exhausted is skipped before you waste a call.
wait=True(optional): briefly sleep until a key frees up instead of failing, bounded bymax_wait.
Inspect live state any time:
for row in llm.health():
print(row) # provider, key (masked), ready, breaker, rpd_used, last_error, latency
Roadmap
- v1.1 — streaming (SSE normalization across providers)
- v1.2 — persistent quota tracking (sqlite/json) + tighter tier pacing
- v1.3 — tool / function-calling normalization
- v2 — embeddings, vision; JS/TS and Go ports
FAQ
How do I use free LLMs in Python?
Install freelm, set one or more free API keys (OpenRouter, Google AI Studio, or NVIDIA NIM) as environment variables, and call freelm.FreeLLM.from_env().text("..."). freelm picks an available free model and handles rate limits and failover automatically.
How do I fall back between OpenRouter, Gemini, and NVIDIA NIM?
Pass several providers to FreeLLM([...]). On a rate limit (429), dead key (401), or server error, freelm rotates keys and fails over to the next provider — interleaved so every provider is reached quickly instead of stalling on one.
Is there an OpenAI-compatible free LLM client?
Yes — from freelm.compat import OpenAI is a drop-in for the OpenAI SDK (client.chat.completions.create(...)), backed by free providers.
How do I avoid free-tier rate limits?
freelm paces each key with a requests-per-minute token bucket plus a daily counter and skips keys predicted to be exhausted. Add more keys or providers to raise total throughput.
Which free LLM models are available right now?
Free model IDs change constantly, so freelm discovers them live from the provider API and caches them. Run from freelm import list_free_models; list_free_models() for the current list.
Is freelm really free?
freelm itself is MIT-licensed and free. It runs on providers' free tiers; the actual request limits depend on each provider's free quota.
License
MIT © Shahriar Labs
Free-tier model lists change often —
freelmdiscovers OpenRouter models live and caches them, so you rarely touch the hardcoded list. Tier rate-limit numbers are still heuristic defaults; overriderpm/rpd/tieras providers evolve.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file freelm-0.2.2.tar.gz.
File metadata
- Download URL: freelm-0.2.2.tar.gz
- Upload date:
- Size: 33.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38e03721cef9241ad1950ec1edfc95955bd3d82710f9aabda654a1490d48d2dd
|
|
| MD5 |
c7c6d82ae0bb0f785e77d2272dacd330
|
|
| BLAKE2b-256 |
664cc67df767298428cf3c85cadbdc927b23a40b4bbccf8c7863938666a4377d
|
File details
Details for the file freelm-0.2.2-py3-none-any.whl.
File metadata
- Download URL: freelm-0.2.2-py3-none-any.whl
- Upload date:
- Size: 35.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a20a474c1869c75ec6b1e03cf0d9fbbcde1dfa874404178cf79f6da1d7034315
|
|
| MD5 |
4ab0d323647d2a3185ad9192fd388018
|
|
| BLAKE2b-256 |
7dfc88db1460d0e666a3d0dba063d88c7ad1c5584f50db1ffbc2a0dd673a3b99
|