Thinking session manager for Qwen3.6: backend normalization, sampling parameter swap, and 128K context budget guard.
Project description
Qwen3-Think
The Thinking Session Manager for Qwen3.6 (or any Qwen3+ model that uses enable_thinking) -- SDK + Router + Bug Fixes
Python library for managing Qwen3.6's thinking state across sessions, backends, and frameworks.
What It Solves
-
Backend Normalization -- Qwen3.6's
enable_thinkingflag has three different invocation patterns across backends. This library normalizes them into a single API. -
Sampling Parameter Swap -- Qwen3.6 requires different sampling parameters for thinking vs. non-thinking mode. A router that flips
enable_thinkingwithout also swapping params produces silently degraded output. -
Context Budget Guard -- Qwen3.6 advises maintaining at least 128K tokens of context to preserve thinking capabilities. This library tracks and guards against silent degradation.
Installation
pip install qwen-think
# With OpenAI client support:
pip install qwen-think[openai]
Quick Start
from openai import OpenAI
from qwen_think import ThinkingSession
client = OpenAI(base_url="http://localhost:8000/v1", api_key="...")
session = ThinkingSession(client, backend="vllm", budget=200_000)
# Auto-routes: detects complexity, sets mode, swaps sampling
response = session.chat("refactor this module", preserve=True)
The Three Backend Patterns
| Backend | Flag Format | Notes |
|---|---|---|
| vLLM / SGLang | extra_body={"chat_template_kwargs": {"enable_thinking": False}} |
Nested |
| DashScope | extra_body={"enable_thinking": False} |
Top-level |
| llama.cpp | --chat-template-kwargs '{"enable_thinking": false}' |
Server-side only |
Bug Fixes Included
vLLM Semantic Router (#858)
When use_reasoning: false is configured, the router removes the field instead of explicitly setting enable_thinking: false. Since Qwen3.6 thinks by default, removing the field has no effect.
Fix: This library always explicitly sets the boolean -- never omits it.
Ray Serve (#52979)
enable_thinking: false in the HTTP body doesn't propagate to the model, and thinking output continues appearing.
Fix: The normalized payload always includes the explicit flag.
Sampling Parameters
Qwen3.6 requires different sampling params depending on thinking mode:
Thinking mode:
temperature=0.6, top_p=0.95, top_k=20, min_p=0.0,
presence_penalty=1.5, repetition_penalty=1.0
Instruct / non-thinking mode:
temperature=0.7, top_p=0.80, top_k=20, min_p=0.0,
presence_penalty=1.5
A router that flips enable_thinking without atomically swapping these params produces silently degraded output -- not incorrect, just suboptimal in ways that don't surface as errors.
Context Budget
Qwen3.6 advises maintaining a context length of at least 128K tokens to preserve thinking capabilities. The BudgetManager tracks usage and:
- Warns when approaching the threshold
- Auto-compresses older messages when running low
- Refuses to continue when below the minimum
session = ThinkingSession(client, budget=200_000, min_context=128_000)
status = session.budget_status
# BudgetStatus(total_tokens=200000, used_tokens=10000, available_tokens=190000,
# min_context=128000, action=BudgetAction.OK,
# message="Available: 190,000 of 200,000 tokens.")
Thinking Preservation
Qwen3.6 introduces preserve_thinking -- a feature that retains thinking context across conversation history, improving reasoning quality for iterative development.
session = ThinkingSession(client, preserve_thinking=True)
# Thinking content is cached and included in subsequent turns
Complexity Router
The router classifies query complexity and selects the appropriate mode:
| Complexity | Mode | Preserved | Use Case |
|---|---|---|---|
| SIMPLE | NO_THINK | No | "What is X?" |
| MODERATE | THINK | No | Multi-sentence reasoning |
| COMPLEX | THINK + preserve | Yes | Coding, debugging, refactoring |
from qwen_think import ComplexityRouter
router = ComplexityRouter()
decision = router.route("implement a REST API with authentication")
# RouterDecision(complexity=COMPLEX, mode=THINK, preserve_thinking=True)
API Reference
ThinkingSession
session = ThinkingSession(
client, # OpenAI-compatible client
backend="vllm", # or "sglang", "dashscope", "llamacpp"
model="Qwen/Qwen3.6-35B-A3B",
budget=200_000, # Total context budget
min_context=128_000, # 128K minimum for thinking
preserve_thinking=True,
auto_route=True, # Auto-classify complexity
)
Manual Mode Control
# Force a specific mode
session.chat("quick answer", mode=ThinkingMode.NO_THINK)
# Let the router decide
session.chat("refactor this module")
# Check current state
session.thinking_mode # -> ThinkingMode.THINK
session.budget_status # -> BudgetStatus(...)
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file qwen_think-0.1.2.tar.gz.
File metadata
- Download URL: qwen_think-0.1.2.tar.gz
- Upload date:
- Size: 25.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d97decd2bd5d02ee9ca0725ba5f050b8d19fa6c82d2aedc9195f24d4c43b2f89
|
|
| MD5 |
e77b0901f01ae6207ccf60b21bf8e48c
|
|
| BLAKE2b-256 |
55d6cc8cb4c55a42aa6d5b460f314a2324d9634ed663b4d19bfb4972f727603c
|
Provenance
The following attestation bundles were made for qwen_think-0.1.2.tar.gz:
Publisher:
publish.yml on ArkaD171717/Qwen3-Think
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
qwen_think-0.1.2.tar.gz -
Subject digest:
d97decd2bd5d02ee9ca0725ba5f050b8d19fa6c82d2aedc9195f24d4c43b2f89 - Sigstore transparency entry: 1406423010
- Sigstore integration time:
-
Permalink:
ArkaD171717/Qwen3-Think@801d10d996b1c0e20011dbef5433599cade5e87d -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/ArkaD171717
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@801d10d996b1c0e20011dbef5433599cade5e87d -
Trigger Event:
release
-
Statement type:
File details
Details for the file qwen_think-0.1.2-py3-none-any.whl.
File metadata
- Download URL: qwen_think-0.1.2-py3-none-any.whl
- Upload date:
- Size: 23.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fb471265c3cdbe69d4dae67788a5c8b109aff97c5a5d2ef2854d7f42b59b6569
|
|
| MD5 |
84c47d361aa3fb338f8e4eef78a84f3a
|
|
| BLAKE2b-256 |
2474f6062a14dda1aec650b3d16c1fc1e81a25756e7d5d724725426af6faa7de
|
Provenance
The following attestation bundles were made for qwen_think-0.1.2-py3-none-any.whl:
Publisher:
publish.yml on ArkaD171717/Qwen3-Think
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
qwen_think-0.1.2-py3-none-any.whl -
Subject digest:
fb471265c3cdbe69d4dae67788a5c8b109aff97c5a5d2ef2854d7f42b59b6569 - Sigstore transparency entry: 1406423072
- Sigstore integration time:
-
Permalink:
ArkaD171717/Qwen3-Think@801d10d996b1c0e20011dbef5433599cade5e87d -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/ArkaD171717
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@801d10d996b1c0e20011dbef5433599cade5e87d -
Trigger Event:
release
-
Statement type: