A drop-in DSPy adapter for reliable tool calling across the Qwen 3+ family.
Project description
dspy-qwen-adapter
A DSPy adapter that makes dspy.ReAct and dspy.Predict reliable across
the Qwen 3+ family (Qwen 3, 3.5, Coder variants, and forward-compatible
with 3.6) on any OpenAI-compatible local inference server — LM Studio,
vLLM, llama.cpp, Ollama, SGLang.
Why this exists
Qwen's tool-calling wire format changes across generations:
- Qwen 3 (base): Hermes-style —
<tool_call>{"name": "...", "arguments": {...}}</tool_call>. vLLM's--tool-call-parser hermes. - Qwen 3.5 / 3-Coder lineage: XML —
<tool_call><function=NAME><parameter=K>\nVALUE\n</parameter>...</function></tool_call>. vLLM's--tool-call-parser qwen3_coder.
DSPy's stock adapters (ChatAdapter, JSONAdapter, XMLAdapter) don't
know about either format. They ask the model for their own delimiter /
JSON / tagged-field schemes, which Qwen will follow via in-context
compliance — but that drifts the model off its trained multi-turn
distribution and silently loses quality on longer chains, or outright
fails when the model's output mixes formats.
This adapter:
- Prompts the model in Qwen 3.5's canonical format. In-context compliance pulls Qwen 3 to the same format via prompt exemplar, so a single adapter covers both generations. Benchmarks below show zero parse failures across 880 runs spanning three models.
- Replays multi-turn trajectories using the Qwen chat-template shape
(
<tool_call>on assistant turns,<tool_response name="...">on tool turns) so long agent runs stay in-distribution. - Bypasses the inference server's tool-call parser entirely (never
passes
tools=[]), so it works even on servers whose native Qwen parsers have known bugs. - Strips leaked
<think>tags from completions before parsing. - Rescues empty
textturns by falling back toreasoning_content— important for thinking-mode models on LM Studio, where the server can route the entire completion into a side channel and leavetextempty. - Inherits
XMLAdapterfor plaindspy.Predict/ChainOfThought— non-tool-calling paths get<field>content</field>tags, which is still in Qwen's XML-heavy training distribution, plus demos anddspy.Historysupport for free.
Benchmark results
880 total runs: 4 adapters × 8 scenarios × 5 runs × 3 models, each scored by an LLM judge. See docs/benchmarks.md for methodology, per-cell reasoning, and limitations.
Legend:
- Task success = fraction of 5 runs where the LLM judge scored the final answer correct. (A cheap substring metric is also recorded; it agrees with the judge on every cell except one infrastructure anomaly, so we report only the judge value here.)
- Tool-fail / run = average number of tool-call executions per run that raised an exception — ReAct recovered, but each one is a wasted turn. Shown in a separate sparse table; only non-zero cells included.
- ⚠ marks cells where every run had a parse failure (adapter couldn't extract output fields from the LM response).
qwen3.5-35b-a3b
Task success
| scenario | chat | json | xml | qwen |
|---|---|---|---|---|
| s1 | 100% | 100% | 100% | 100% |
| s3 | 100% | 100% | 0% ⚠ | 100% |
| s10 | 100% | 100% | 100% | 100% |
| s_sql | 100% | 100% | 100% | 100% |
| s_code | 100% | 100% | 100% | 100% |
| s_echo | 100% | 100% | 100% | 100% |
| s_deep | 100% | 100% | 80% ⚠ | 100% |
| s_i18n | 0% | 40% | 0% | 80% |
Tool-fail / run
| scenario | chat | json | xml | qwen |
|---|---|---|---|---|
| s1 | 0.00 | 0.00 | 0.00 | 0.00 |
| s3 | 0.00 | 0.00 | 0.00 | 0.00 |
| s10 | 0.00 | 0.80 | 0.40 | 0.00 |
| s_sql | 0.00 | 0.00 | 0.00 | 0.00 |
| s_code | 0.00 | 0.00 | 0.00 | 0.00 |
| s_echo | 0.00 | 0.40 | 0.20 | 0.00 |
| s_deep | 0.60 | 0.20 | 2.20 | 0.00 |
| s_i18n | 0.00 | 0.00 | 0.00 | 0.00 |
qwen3.5-4b
Task success
| scenario | chat | json | xml | qwen |
|---|---|---|---|---|
| s1 | 100% | 100% | 100% | 100% |
| s3 | 100% | 100% | 100% | 100% |
| s10 | 100% | 100% | 100% | 100% |
| s_sql | 100% | 100% | 100% | 100% |
| s_code | 100% | 100% | 0% ⚠ | 100% |
| s_echo | 100% | 80% | 100% | 100% |
| s_deep | 100% | 100% | 100% | 100% |
| s_i18n | 100% | 0% | 0% | 0% |
Tool-fail / run
| scenario | chat | json | xml | qwen |
|---|---|---|---|---|
| s1 | 1.00 | 0.00 | 0.00 | 0.00 |
| s3 | 0.00 | 0.00 | 0.00 | 0.00 |
| s10 | 0.00 | 0.00 | 0.00 | 0.00 |
| s_sql | 0.00 | 0.00 | 0.00 | 0.00 |
| s_code | 0.00 | 0.00 | 0.00 | 0.00 |
| s_echo | 0.00 | 0.00 | 0.00 | 0.00 |
| s_deep | 1.00 | 0.00 | 0.00 | 0.00 |
| s_i18n | 0.00 | 0.00 | 0.00 | 0.00 |
qwen3-4b (out-of-distribution — Hermes-format model)
Qwen 3 was trained on Hermes-style tool calls, not the XML format this adapter prompts for. The benchmark tests whether in-context compliance bridges the distribution gap.
Task success
| scenario | chat | json | xml | qwen |
|---|---|---|---|---|
| s1 | 100% | 100% | 100% | 100% |
| s3 | 100% | 100% | 100% | 100% |
| s10 | 100% | 100% | 100% | 100% |
| s_sql | 100% | 100% | 100% | 100% |
| s_code | 100% | 100% | 100% | 100% |
| s_echo | 0% | 0% | 0% | 0% |
| s_deep | 100% | 100% | 100% | 100% |
| s_i18n | 0% | 0% | 0% | 0% |
Tool-fail / run
| scenario | chat | json | xml | qwen |
|---|---|---|---|---|
| s1 | 0.00 | 0.00 | 0.00 | 0.00 |
| s3 | 0.00 | 2.00 | 0.00 | 0.00 |
| s10 | 0.00 | 0.00 | 0.00 | 0.00 |
| s_sql | 0.00 | 1.00 | 0.00 | 0.00 |
| s_code | 1.00 | 1.00 | 2.00 | 0.00 |
| s_echo | 0.00 | 0.00 | 0.00 | 0.00 |
| s_deep | 0.00 | 0.00 | 0.00 | 0.00 |
| s_i18n | 0.00 | 0.00 | 0.00 | 0.00 |
(s_echo and s_i18n failures across all adapters on 4B-class models are weak-model + mock-tool artifacts — the model hallucinates lengths or paraphrases narrative prefixes regardless of adapter. Not a production tool-calling regression. See docs/benchmarks.md.)
Headline findings
- 0 parse failures across all 600 qwen-adapter runs on all three models.
XMLAdapter by comparison failed every
s3run on 35B and everys_coderun on 4B. - 0.00 tool-fail / run on every scenario on every model. The closest alternatives spike to 0.20 – 2.20 on multi-step and structured-arg scenarios. Same or better task success, fewer wasted turns.
- Only adapter that reliably handles multilingual / delimiter-leaking
tool output on 35B.
s_i18n: qwen 80% vs chat 0%, json 40%, xml 0%. - Rescues
reasoning_contentturns that silently break stock adapters on thinking-mode models.jsonlost a run ons_echo4B this way;qwencaught it via the fallback. - Works on Qwen 3 despite the training-distribution mismatch. The XML exemplar in our prompt is strong enough that Qwen 3 (trained on Hermes) follows it anyway, and qwen still posts the best tool-fail numbers across all scenarios.
Install
From PyPI (once published):
pip install dspy-qwen-adapter
From source (editable):
git clone https://github.com/<user>/dspy-qwen-adapter
cd dspy-qwen-adapter
pip install -e .
Quickstart
import dspy
from dspy_qwen_adapter import QwenAdapter
dspy.configure(
lm=dspy.LM(
"openai/qwen/qwen3.5-35b-a3b",
api_base="http://127.0.0.1:1234/v1",
api_key="lm-studio",
temperature=1.0,
max_tokens=8192,
),
adapter=QwenAdapter(),
)
def get_weather(city: str) -> str:
"""Get the current weather in a city."""
return f"sunny, 72F in {city}"
react = dspy.ReAct("question -> answer", tools=[get_weather])
print(react(question="What's the weather in Tokyo?").answer)
That's the whole user-facing surface — instantiate QwenAdapter(), pass
it to dspy.configure, use dspy.ReAct or dspy.Predict as normal. No
prompt templates, no parser configuration, no server-specific flags.
Same code works unchanged on Qwen 3 — swap openai/qwen/qwen3-4b as the
model and run.
Configuration
QwenAdapter(
callbacks=None, # list[BaseCallback] — standard DSPy callbacks
native_response_types=None, # list[type] — forwarded to base Adapter
strict_parse=False, # True: raise AdapterParseError when no tool call
# is present. False (default): treat as a
# graceful finish — the model's text becomes
# the thought, and ReAct moves to extract.
)
use_native_function_calling is hardcoded off — we never pass tools=[]
to the server, which is what makes this adapter robust across servers with
different Qwen tool-parser quirks.
Compatibility
- Model: Qwen 3+ family. Optimized for Qwen 3.5 (XML-format lineage); works on Qwen 3 (Hermes-format) and Qwen 3-Coder via in-context compliance. Smaller variants (4B and below) can have weak-model artifacts on narrative-mock benchmarks but still post the best tool_fail rates.
- Server: any OpenAI-compatible
chat/completionsendpoint. Tested against LM Studio 0.4.x; should work against vLLM, SGLang, llama.cpp, and Ollama without any server-specific flags, since this adapter doesn't rely on native function calling. - Python: 3.12+.
- DSPy: 3.1+.
How it's different
| ChatAdapter | JSONAdapter | XMLAdapter | QwenAdapter | |
|---|---|---|---|---|
| Tool call format | [[ ## field ## ]] delimiters |
JSON text (+ response_format) |
<field>content</field> per output |
canonical Qwen <tool_call> XML |
| Trajectory replay | flat name: value lines |
flat JSON per turn | <field> lines per turn |
<tool_call> + <tool_response name="..."> XML per turn |
<think> tag handling |
— | — | — | stripped before parsing |
| Empty-text (thinking mode) | drops the turn (all fields None) | drops the turn | drops the turn | falls back to reasoning_content |
| Server native tool parser | not used | used when response_format is supported |
not used | not used (by design) |
Plain dspy.Predict |
works | works | works | works (via XMLAdapter inheritance) |
See docs/benchmarks.md for the measured effect of each.
Limitations
- Only text-native mode. This adapter does not use the server's native
tool-call parser — by design. If you're on a server whose tool parser for
Qwen works perfectly, stock
JSONAdapterwith native function calling may be faster. The benchmarks show this adapter is at worst equivalent and at best dramatically better, at the cost of parsing tool calls in Python instead of at the server. - No demo / few-shot support on the ReAct path. DSPy optimizers that
rely on demo interleaving (BootstrapFewShot, MIPRO) will silently get
zero-shot behavior on ReAct calls. Plain
Predictinherits demo support fromXMLAdapter. Tracking as a future enhancement. - Non-streaming only. Streaming parsers for Qwen are buggy in most current inference stacks; this adapter targets non-streaming responses.
- Small-model quirks. The 4B-class models occasionally paraphrase
narrative tool output or hallucinate numeric details on contrived
benchmark scenarios (
s_echo,s_i18n). Not a production tool-calling regression — real tools return real data. Bigger models (35B+) pass these cleanly.
Development
Run the tests:
pip install -e '.[dev]'
pytest tests/ -v
Run the benchmark harness against a local model:
./harness/run_matrix.sh --runs 5 --use-judge
See docs/benchmarks.md for the harness docs.
License
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dspy_qwen_adapter-0.0.3.tar.gz.
File metadata
- Download URL: dspy_qwen_adapter-0.0.3.tar.gz
- Upload date:
- Size: 13.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c564a17cb412ceadbfc6774927cee69fd606fca18c5b0c909ecd4e14b1b758d6
|
|
| MD5 |
05d5e4ae8ac9f365936d2f6d34ec95ee
|
|
| BLAKE2b-256 |
77150a90e3d4e0762c6ad881675ff53b161017c81f9d04edb011c774fbe3f244
|
File details
Details for the file dspy_qwen_adapter-0.0.3-py3-none-any.whl.
File metadata
- Download URL: dspy_qwen_adapter-0.0.3-py3-none-any.whl
- Upload date:
- Size: 14.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fafde6c004f91b1bb8c86dcb20d14293fe89446ebdf5acc90e841643cefdfc8d
|
|
| MD5 |
db10fa385da932ae9c56c7a568f812fe
|
|
| BLAKE2b-256 |
610934ab1357fddea14417408831b796bfc57464d076fae01ff6b8c1e25b45eb
|