Skip to main content

OpenAI-compatible reasoning-aware inference proxy for Qwen3.6

Project description

forge-cloud

OpenAI-compatible reasoning-aware inference proxy for Qwen3.6.

Point your OpenAI client at forge-cloud instead of directly at vLLM/SGLang/Ollama. The proxy routes thinking mode based on query complexity, swaps sampling parameters to match the mode, normalizes backend flags, and tags responses with routing metadata.

What it does

  1. Receives a standard /v1/chat/completions request
  2. Classifies query complexity (simple/moderate/complex)
  3. Decides thinking mode (think vs no_think) with correct sampling params
  4. Normalizes the enable_thinking flag for the target backend (vLLM nested, DashScope top-level, llama.cpp server-side)
  5. Forwards to the user's configured backend
  6. Tags the response with routing metadata and estimated token split (thinking vs response)

The proxy does not run inference. It configures and monitors it.

Install

pip install forge-cloud

Quick start

# Set admin key and backend URL
export FORGE_ADMIN_KEY=my-secret
export FORGE_BACKEND_URL=http://localhost:8000
export FORGE_BACKEND_TYPE=vllm

# Start the proxy
forge-cloud

Create an API key:

curl -X POST http://localhost:8741/v1/keys \
  -H "Authorization: Bearer my-secret" \
  -H "Content-Type: application/json" \
  -d '{"name": "my-app"}'
# Returns: {"key": "fk-...", "name": "my-app", "tier": "free", ...}

Use it like any OpenAI endpoint:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8741/v1",
    api_key="fk-..."  # key from above
)

response = client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=[{"role": "user", "content": "refactor this module"}],
)
print(response.choices[0].message.content)

Response metadata

Every response includes a forge field with routing metadata and estimated token counts:

{
  "id": "chatcmpl-test",
  "choices": ["..."],
  "usage": {"...": 0},
  "forge": {
    "thinking_mode": "think",
    "complexity": "complex",
    "backend": "vllm",
    "sampling_profile": "thinking",
    "thinking_tokens": 450,
    "response_tokens": 120
  }
}

Endpoints

Method Path Description
POST /v1/chat/completions Proxied chat completion with forge routing
POST /v1/keys Create API key (admin auth required)
GET /health Proxy health check

Configuration

All settings are environment variables with FORGE_ prefix:

Variable Default Description
FORGE_HOST 0.0.0.0 Bind address
FORGE_PORT 8741 Port
FORGE_BACKEND_URL http://localhost:8000 Default backend URL
FORGE_BACKEND_TYPE vllm Backend type: vllm, sglang, dashscope, llamacpp
FORGE_FREE_DAILY_LIMIT 1000 Free tier requests per day
FORGE_ADMIN_KEY (empty) Admin key for creating API keys
FORGE_DB_PATH forge.db SQLite database path
FORGE_REQUEST_TIMEOUT 120.0 Backend request timeout (seconds)

Per-key backend override

Each API key can have its own backend URL and type:

curl -X POST http://localhost:8741/v1/keys \
  -H "Authorization: Bearer my-secret" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "sglang-user",
    "tier": "paid",
    "backend_url": "http://sglang-server:30000",
    "backend_type": "sglang"
  }'

Tiers

  • Free: 1,000 requests/day, single backend target
  • Paid: no rate limit, per-key backend routing

Streaming

Streaming is supported. Set stream: true in the request and the proxy forwards the SSE stream from the backend.

Dependencies

  • qwen-think -- thinking session manager (routing, budget, sampling)
  • FastAPI + uvicorn
  • httpx -- async HTTP client for backend forwarding
  • aiosqlite -- async SQLite for API keys and usage tracking

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

forge_infer_cloud-0.1.1.tar.gz (16.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

forge_infer_cloud-0.1.1-py3-none-any.whl (15.0 kB view details)

Uploaded Python 3

File details

Details for the file forge_infer_cloud-0.1.1.tar.gz.

File metadata

  • Download URL: forge_infer_cloud-0.1.1.tar.gz
  • Upload date:
  • Size: 16.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for forge_infer_cloud-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e973ffafc7635012e3a350a7b4fea5d39ce0fdb570480c96b7e982bc7ce3e354
MD5 73fcd2f1770be3fd9e92f116ff0822de
BLAKE2b-256 9a181bd9053df116956d3bfa16fc5b231e59f8508b5188b7239528f12a7a8d5d

See more details on using hashes here.

Provenance

The following attestation bundles were made for forge_infer_cloud-0.1.1.tar.gz:

Publisher: publish.yml on ArkaD171717/forge-cloud

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file forge_infer_cloud-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for forge_infer_cloud-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2f9e775bc6070a87201c7e95b362c40e08ca38ccaa4c01a426502eee814e1d7b
MD5 5753172a7334e6e333d98169d92c3276
BLAKE2b-256 a245a9788ec8142b3d3aa26e9240760c08ab43be5e6310a32efbbefe13525c1f

See more details on using hashes here.

Provenance

The following attestation bundles were made for forge_infer_cloud-0.1.1-py3-none-any.whl:

Publisher: publish.yml on ArkaD171717/forge-cloud

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page