Official Python SDK for FlexInference - a deadline-aware, OpenAI-compatible inference router.

These details have not been verified by PyPI

Project links

Project description

FlexInference (Python)

The official Python SDK for FlexInference - a deadline-aware inference router across OpenAI, Google Gemini, and Anthropic. Send the OpenAI requests you already send, bring your own provider key, and set one required field - start_within - to trade latency for cost. Four caller formats are supported: responses, chat.completions, interactions (Gemini shape), and messages (Anthropic shape) - any of them reaches any provider.

pip install flexinference

Quickstart

from flexinference import FlexInference, output_text

client = FlexInference(api_key="flex_live_...")

res = client.responses.create({
    "model": "gpt-5.5",
    "input": "Write a haiku about cheap GPUs.",
    "start_within": "00h-00m-30s",
})

print(output_text(res))

Responses come back as the raw OpenAI JSON (we never reshape the body), so there is no output_text field on the wire - that is computed by OpenAI's own SDKs. output_text(res) pulls the assistant's text out of either a response or a chat completion for you.

start_within is required on every request. It takes "default", "priority", "auto", or a duration "HHh-MMm-SSs" (5s-10m). The duration races OpenAI's flex tier on a flex-capable model and falls back to standard if it can't start in time; "default", "priority", and "auto" map to those OpenAI service tiers and proxy any model. See the docs.

Providers (OpenAI, Gemini, and Anthropic)

FlexInference routes to OpenAI, Google Gemini, and Anthropic. Send the same OpenAI-shaped request and pass whichever model id you want - gpt-5.5, o4-mini, gemini-3.5-flash, claude-opus-4-8, and so on. We translate Gemini and Anthropic to and from the OpenAI shape, so your code is identical for all three.

OpenAI: default (standard tier), priority, auto, and the flex race (a duration) on flex-capable models.
Gemini: default maps to Gemini's standard tier, plus priority and the flex race on the Gemini flex models (gemini-3.5-flash, gemini-3.1-flash-lite, gemini-3.1-pro-preview, gemini-3-flash-preview, gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite). Gemini has no auto tier, so start_within="auto" on a Gemini model returns 400.
Anthropic (Claude): proxy-only. default, priority, and auto work; there is no flex race, so a duration start_within on a claude-* model returns 400 flex_unsupported_for_anthropic. Anthropic requires a token cap, so set max_output_tokens (max_completion_tokens on Chat, max_tokens on Messages) or you get 400 missing_max_tokens. You keep the unified API and tier control, and draw down your own Anthropic credits.

Add the provider key you'll use (OpenAI, Gemini, and/or Anthropic) in the dashboard. Text, streaming, structured outputs, function calling, image input, and web search work across providers (send a Responses web_search tool; we map it to Gemini's google_search).

Don't send service_tier - the router controls the tier from start_within and rejects a caller-supplied service_tier with 400 service_tier_not_allowed.

Streaming

stream = client.responses.create(
    {"model": "gpt-5-nano", "input": "Count to ten.", "start_within": "00h-00m-20s"},
    stream=True,
)
for event in stream:
    if event.get("type") == "response.output_text.delta":
        print(event["delta"], end="")

Chat Completions

res = client.chat.completions.create({
    "model": "gpt-5.5",
    "messages": [{"role": "user", "content": "Hello!"}],
    "start_within": "default",
})
print(res["choices"][0]["message"]["content"])

Interactions (Gemini shape)

Speak Google's Interactions shape and reach any model. interaction_output_text(res) pulls the assistant text out of the interaction's steps.

from flexinference import interaction_output_text

res = client.interactions.create({
    "model": "gemini-3.5-flash",
    "input": "Summarize this contract.",
    "start_within": "00h-01m-00s",
})
print(interaction_output_text(res))

Messages (Anthropic shape)

Speak Anthropic's Messages shape and reach any model. max_tokens is required (Anthropic requires it). message_output_text(res) pulls the assistant text out of the message content.

from flexinference import message_output_text

res = client.messages.create({
    "model": "claude-opus-4-8",
    "max_tokens": 1024,
    "messages": [{"role": "user", "content": "Summarize this contract."}],
    "start_within": "default",
})
print(message_output_text(res))

Closing the client

The client holds a pooled httpx.Client, so close it when you're done to release connections. Use it as a context manager:

with FlexInference(api_key="flex_live_...") as client:
    res = client.responses.create({"model": "gpt-5.5", "input": "Hi.", "start_within": "default"})
    print(output_text(res))
# connections are released on exit

Or close it yourself:

client = FlexInference(api_key="flex_live_...")
try:
    ...
finally:
    client.close()

Request validation

Before a request leaves your machine, the SDK validates the parts it owns. start_within is required and must be "default", "priority", "auto", or a duration "HHh-MMm-SSs" between 5s and 10m; model and input/messages must be present. A missing or bad value raises a ValueError locally instead of making a round trip to a provider 400:

client.responses.create({"model": "gpt-5.5", "input": "hi"})
# ValueError: Invalid request body:
#   Missing required parameter: `start_within`. Set it to "default", "priority", "auto", or a duration "HHh-MMm-SSs".

Validation is request-only. Unknown fields pass straight through to the provider (so new OpenAI parameters keep working), and responses are never validated or reshaped.

Errors

Non-2xx responses raise FlexInferenceError, carrying the OpenAI-shaped status, type, code, and param:

from flexinference import FlexInferenceError

try:
    client.responses.create({"model": "gpt-5.5", "input": "hi", "start_within": "priority"})
except FlexInferenceError as err:
    if err.code == "no_byok_key":
        print("Add your OpenAI key in the dashboard.")
    else:
        raise

Billing / 402

If your account's billing is past due, the router pauses billable flex and returns 402 Payment Required on those requests; free routing keeps working. The SDK raises a typed PaymentRequiredError (a subclass of FlexInferenceError) for HTTP 402, so you can catch it on its own and prompt the user to update payment while letting other errors propagate:

from flexinference import PaymentRequiredError

try:
    client.responses.create({"model": "gpt-5.5", "input": "hi", "start_within": "00h-00m-30s"})
except PaymentRequiredError:
    print("Billing is past due - update payment in the dashboard to resume flex.")
except FlexInferenceError:
    raise

Because PaymentRequiredError subclasses FlexInferenceError, existing except FlexInferenceError handlers keep catching 402s too.

Configuration

Argument	Default	Description
`api_key`	(required)	Your `flex_live_` key.
`base_url`	`https://api.flexinference.com/v1`	Override the router endpoint.
`client`	`httpx.Client` (600s read, 10s connect)	Provide your own `httpx.Client`.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.4.1

Jul 3, 2026

This version

1.3.0

Jul 2, 2026

1.2.0

Jul 1, 2026

1.1.0

Jun 30, 2026

1.0.1

Jun 28, 2026

1.0.0

Jun 28, 2026

0.1.0

Jun 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flexinference-1.3.0.tar.gz (51.2 kB view details)

Uploaded Jul 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

flexinference-1.3.0-py3-none-any.whl (22.5 kB view details)

Uploaded Jul 2, 2026 Python 3

File details

Details for the file flexinference-1.3.0.tar.gz.

File metadata

Download URL: flexinference-1.3.0.tar.gz
Upload date: Jul 2, 2026
Size: 51.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.15

File hashes

Hashes for flexinference-1.3.0.tar.gz
Algorithm	Hash digest
SHA256	`f078632689b914c8e777bf4fd402a805b71f5417d9ffa31ed348e82444941090`
MD5	`adfddd530d96ede912de775d4c4f655f`
BLAKE2b-256	`5253a05805db5cf1ad9c98e2b6343b006cddad963225570d5ce94ca092672c1e`

See more details on using hashes here.

File details

Details for the file flexinference-1.3.0-py3-none-any.whl.

File metadata

Download URL: flexinference-1.3.0-py3-none-any.whl
Upload date: Jul 2, 2026
Size: 22.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.15

File hashes

Hashes for flexinference-1.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e4cf979c9a6428aa3da4329a9817730d40d5519e16034872b5534442813c630a`
MD5	`a4bae3556211d45d14c516fff7bb4b09`
BLAKE2b-256	`3e6f89ba6fee232df36781b60bc0f592ab81c746f17a6d4037de3b0840b124f3`

See more details on using hashes here.

flexinference 1.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

FlexInference (Python)

Quickstart

Providers (OpenAI, Gemini, and Anthropic)

Streaming

Chat Completions

Interactions (Gemini shape)

Messages (Anthropic shape)

Closing the client

Request validation

Errors

Billing / 402

Configuration

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes