OpenAI-compatible reasoning-aware inference proxy for Qwen3.6
Project description
forge-cloud
OpenAI-compatible reasoning-aware inference proxy for Qwen3.6.
Point your OpenAI client at forge-cloud instead of directly at vLLM/SGLang/Ollama. The proxy routes thinking mode based on query complexity, swaps sampling parameters to match the mode, normalizes backend flags, and tags responses with routing metadata.
What it does
- Receives a standard
/v1/chat/completionsrequest - Classifies query complexity (simple/moderate/complex)
- Decides thinking mode (think vs no_think) with correct sampling params
- Normalizes the
enable_thinkingflag for the target backend (vLLM nested, DashScope top-level, llama.cpp server-side) - Forwards to the user's configured backend
- Tags the response with routing metadata and estimated token split (thinking vs response)
The proxy does not run inference. It configures and monitors it.
Install
pip install forge-cloud
Quick start
# Set admin key and backend URL
export FORGE_ADMIN_KEY=my-secret
export FORGE_BACKEND_URL=http://localhost:8000
export FORGE_BACKEND_TYPE=vllm
# Start the proxy
forge-cloud
Create an API key:
curl -X POST http://localhost:8741/v1/keys \
-H "Authorization: Bearer my-secret" \
-H "Content-Type: application/json" \
-d '{"name": "my-app"}'
# Returns: {"key": "fk-...", "name": "my-app", "tier": "free", ...}
Use it like any OpenAI endpoint:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8741/v1",
api_key="fk-..." # key from above
)
response = client.chat.completions.create(
model="Qwen/Qwen3.6-35B-A3B",
messages=[{"role": "user", "content": "refactor this module"}],
)
print(response.choices[0].message.content)
Response metadata
Every response includes a forge field with routing metadata and estimated token counts:
{
"id": "chatcmpl-test",
"choices": ["..."],
"usage": {"...": 0},
"forge": {
"thinking_mode": "think",
"complexity": "complex",
"backend": "vllm",
"sampling_profile": "thinking",
"thinking_tokens": 450,
"response_tokens": 120
}
}
Endpoints
| Method | Path | Description |
|---|---|---|
| POST | /v1/chat/completions |
Proxied chat completion with forge routing |
| POST | /v1/keys |
Create API key (admin auth required) |
| GET | /health |
Proxy health check |
Configuration
All settings are environment variables with FORGE_ prefix:
| Variable | Default | Description |
|---|---|---|
FORGE_HOST |
0.0.0.0 |
Bind address |
FORGE_PORT |
8741 |
Port |
FORGE_BACKEND_URL |
http://localhost:8000 |
Default backend URL |
FORGE_BACKEND_TYPE |
vllm |
Backend type: vllm, sglang, dashscope, llamacpp |
FORGE_FREE_DAILY_LIMIT |
1000 |
Free tier requests per day |
FORGE_ADMIN_KEY |
(empty) | Admin key for creating API keys |
FORGE_DB_PATH |
forge.db |
SQLite database path |
FORGE_REQUEST_TIMEOUT |
120.0 |
Backend request timeout (seconds) |
Per-key backend override
Each API key can have its own backend URL and type:
curl -X POST http://localhost:8741/v1/keys \
-H "Authorization: Bearer my-secret" \
-H "Content-Type: application/json" \
-d '{
"name": "sglang-user",
"tier": "paid",
"backend_url": "http://sglang-server:30000",
"backend_type": "sglang"
}'
Tiers
- Free: 1,000 requests/day, single backend target
- Paid: no rate limit, per-key backend routing
Streaming
Streaming is supported. Set stream: true in the request and the proxy forwards the SSE stream from the backend.
Dependencies
- qwen-think -- thinking session manager (routing, budget, sampling)
- FastAPI + uvicorn
- httpx -- async HTTP client for backend forwarding
- aiosqlite -- async SQLite for API keys and usage tracking
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file forge_infer_cloud-0.1.1.tar.gz.
File metadata
- Download URL: forge_infer_cloud-0.1.1.tar.gz
- Upload date:
- Size: 16.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e973ffafc7635012e3a350a7b4fea5d39ce0fdb570480c96b7e982bc7ce3e354
|
|
| MD5 |
73fcd2f1770be3fd9e92f116ff0822de
|
|
| BLAKE2b-256 |
9a181bd9053df116956d3bfa16fc5b231e59f8508b5188b7239528f12a7a8d5d
|
Provenance
The following attestation bundles were made for forge_infer_cloud-0.1.1.tar.gz:
Publisher:
publish.yml on ArkaD171717/forge-cloud
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
forge_infer_cloud-0.1.1.tar.gz -
Subject digest:
e973ffafc7635012e3a350a7b4fea5d39ce0fdb570480c96b7e982bc7ce3e354 - Sigstore transparency entry: 1413795521
- Sigstore integration time:
-
Permalink:
ArkaD171717/forge-cloud@aadcec624ffc59e7c2859220e6a14f1de34eb2a9 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/ArkaD171717
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@aadcec624ffc59e7c2859220e6a14f1de34eb2a9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file forge_infer_cloud-0.1.1-py3-none-any.whl.
File metadata
- Download URL: forge_infer_cloud-0.1.1-py3-none-any.whl
- Upload date:
- Size: 15.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2f9e775bc6070a87201c7e95b362c40e08ca38ccaa4c01a426502eee814e1d7b
|
|
| MD5 |
5753172a7334e6e333d98169d92c3276
|
|
| BLAKE2b-256 |
a245a9788ec8142b3d3aa26e9240760c08ab43be5e6310a32efbbefe13525c1f
|
Provenance
The following attestation bundles were made for forge_infer_cloud-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on ArkaD171717/forge-cloud
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
forge_infer_cloud-0.1.1-py3-none-any.whl -
Subject digest:
2f9e775bc6070a87201c7e95b362c40e08ca38ccaa4c01a426502eee814e1d7b - Sigstore transparency entry: 1413795611
- Sigstore integration time:
-
Permalink:
ArkaD171717/forge-cloud@aadcec624ffc59e7c2859220e6a14f1de34eb2a9 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/ArkaD171717
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@aadcec624ffc59e7c2859220e6a14f1de34eb2a9 -
Trigger Event:
push
-
Statement type: