Skip to main content

OpenAI-compatible reasoning-aware inference proxy for Qwen3.6

Project description

forge-cloud

OpenAI-compatible reasoning-aware inference proxy for Qwen3.6.

Point your OpenAI client at forge-cloud instead of directly at vLLM/SGLang/Ollama. The proxy routes thinking mode based on query complexity, swaps sampling parameters to match the mode, normalizes backend flags, and tags responses with routing metadata.

What it does

  1. Receives a standard /v1/chat/completions request
  2. Classifies query complexity (simple/moderate/complex)
  3. Decides thinking mode (think vs no_think) with correct sampling params
  4. Normalizes the enable_thinking flag for the target backend (vLLM nested, DashScope top-level, llama.cpp server-side)
  5. Forwards to the user's configured backend
  6. Tags the response with routing metadata and estimated token split (thinking vs response)

The proxy does not run inference. It configures and monitors it.

Install

pip install forge-cloud

Quick start

# Set admin key and backend URL
export FORGE_ADMIN_KEY=my-secret
export FORGE_BACKEND_URL=http://localhost:8000
export FORGE_BACKEND_TYPE=vllm

# Start the proxy
forge-cloud

Create an API key:

curl -X POST http://localhost:8741/v1/keys \
  -H "Authorization: Bearer my-secret" \
  -H "Content-Type: application/json" \
  -d '{"name": "my-app"}'
# Returns: {"key": "fk-...", "name": "my-app", "tier": "free", ...}

Use it like any OpenAI endpoint:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8741/v1",
    api_key="fk-..."  # key from above
)

response = client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=[{"role": "user", "content": "refactor this module"}],
)
print(response.choices[0].message.content)

Response metadata

Every response includes a forge field with routing metadata and estimated token counts:

{
  "id": "chatcmpl-test",
  "choices": ["..."],
  "usage": {"...": 0},
  "forge": {
    "thinking_mode": "think",
    "complexity": "complex",
    "backend": "vllm",
    "sampling_profile": "thinking",
    "thinking_tokens": 450,
    "response_tokens": 120
  }
}

Endpoints

Method Path Description
POST /v1/chat/completions Proxied chat completion with forge routing
POST /v1/keys Create API key (admin auth required)
GET /health Proxy health check

Configuration

All settings are environment variables with FORGE_ prefix:

Variable Default Description
FORGE_HOST 0.0.0.0 Bind address
FORGE_PORT 8741 Port
FORGE_BACKEND_URL http://localhost:8000 Default backend URL
FORGE_BACKEND_TYPE vllm Backend type: vllm, sglang, dashscope, llamacpp
FORGE_FREE_DAILY_LIMIT 1000 Free tier requests per day
FORGE_ADMIN_KEY (empty) Admin key for creating API keys
FORGE_DB_PATH forge.db SQLite database path
FORGE_REQUEST_TIMEOUT 120.0 Backend request timeout (seconds)

Per-key backend override

Each API key can have its own backend URL and type:

curl -X POST http://localhost:8741/v1/keys \
  -H "Authorization: Bearer my-secret" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "sglang-user",
    "tier": "paid",
    "backend_url": "http://sglang-server:30000",
    "backend_type": "sglang"
  }'

Tiers

  • Free: 1,000 requests/day, single backend target
  • Paid: no rate limit, per-key backend routing

Streaming

Streaming is supported. Set stream: true in the request and the proxy forwards the SSE stream from the backend.

Dependencies

  • qwen-think -- thinking session manager (routing, budget, sampling)
  • FastAPI + uvicorn
  • httpx -- async HTTP client for backend forwarding
  • aiosqlite -- async SQLite for API keys and usage tracking

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

forge_infer_cloud-0.1.2.tar.gz (16.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

forge_infer_cloud-0.1.2-py3-none-any.whl (15.1 kB view details)

Uploaded Python 3

File details

Details for the file forge_infer_cloud-0.1.2.tar.gz.

File metadata

  • Download URL: forge_infer_cloud-0.1.2.tar.gz
  • Upload date:
  • Size: 16.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for forge_infer_cloud-0.1.2.tar.gz
Algorithm Hash digest
SHA256 4f6873c7644b405fffa6aa18841988ee6dcbd5a16d29dd2d963bcb1d1f045e0d
MD5 0f752e43388b53870e72706115da5f4f
BLAKE2b-256 5b66851f45f87daba79c336d1216477fb57c86d1946af590d27aa73d97367bce

See more details on using hashes here.

Provenance

The following attestation bundles were made for forge_infer_cloud-0.1.2.tar.gz:

Publisher: publish.yml on ArkaD171717/forge-cloud

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file forge_infer_cloud-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for forge_infer_cloud-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 78a8ec0dad50adc1e5fe15f492102f4556d48d53f34dfa02b4fd63dcbb057528
MD5 870aac0a6449be28475fee07342a5cb4
BLAKE2b-256 0b8f42899cad111fbb0b2b8d78a7469cca4ce8628a7cfc471680fb8136ed8472

See more details on using hashes here.

Provenance

The following attestation bundles were made for forge_infer_cloud-0.1.2-py3-none-any.whl:

Publisher: publish.yml on ArkaD171717/forge-cloud

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page