A drop-in DSPy adapter for reliable tool calling across the Qwen 3+ family.

These details have not been verified by PyPI

Project links

Project description

dspy-qwen-adapter

A DSPy adapter that makes dspy.ReAct and dspy.Predict reliable across the Qwen 3+ family (Qwen 3, 3.5, Coder variants, and forward-compatible with 3.6) on any OpenAI-compatible local inference server — LM Studio, vLLM, llama.cpp, Ollama, SGLang.

Why this exists

Qwen's tool-calling wire format changes across generations:

Qwen 3 (base): Hermes-style — <tool_call>{"name": "...", "arguments": {...}}</tool_call>. vLLM's --tool-call-parser hermes.
Qwen 3.5 / 3-Coder lineage: XML — <tool_call><function=NAME><parameter=K>\nVALUE\n</parameter>...</function></tool_call>. vLLM's --tool-call-parser qwen3_coder.

DSPy's stock adapters (ChatAdapter, JSONAdapter, XMLAdapter) don't know about either format. They ask the model for their own delimiter / JSON / tagged-field schemes, which Qwen will follow via in-context compliance — but that drifts the model off its trained multi-turn distribution and silently loses quality on longer chains, or outright fails when the model's output mixes formats.

This adapter:

Prompts the model in Qwen 3.5's canonical format. In-context compliance pulls Qwen 3 to the same format via prompt exemplar, so a single adapter covers both generations. Benchmarks below show zero parse failures across 880 runs spanning three models.
Replays multi-turn trajectories using the Qwen chat-template shape (<tool_call> on assistant turns, <tool_response name="..."> on tool turns) so long agent runs stay in-distribution.
Bypasses the inference server's tool-call parser entirely (never passes tools=[]), so it works even on servers whose native Qwen parsers have known bugs.
Strips leaked <think> tags from completions before parsing.
Rescues empty text turns by falling back to reasoning_content — important for thinking-mode models on LM Studio, where the server can route the entire completion into a side channel and leave text empty.
Inherits XMLAdapter for plain dspy.Predict / ChainOfThought — non-tool-calling paths get <field>content</field> tags, which is still in Qwen's XML-heavy training distribution, plus demos and dspy.History support for free.

Benchmark results

880 total runs: 4 adapters × 8 scenarios × 5 runs × 3 models, each scored by an LLM judge. See docs/benchmarks.md for methodology, per-cell reasoning, and limitations.

Legend:

Task success = fraction of 5 runs where the LLM judge scored the final answer correct. (A cheap substring metric is also recorded; it agrees with the judge on every cell except one infrastructure anomaly, so we report only the judge value here.)
Tool-fail / run = average number of tool-call executions per run that raised an exception — ReAct recovered, but each one is a wasted turn. Shown in a separate sparse table; only non-zero cells included.
⚠ marks cells where every run had a parse failure (adapter couldn't extract output fields from the LM response).

qwen3.5-35b-a3b

Task success

scenario	chat	json	xml	qwen
s1	100%	100%	100%	100%
s3	100%	100%	0% ⚠	100%
s10	100%	100%	100%	100%
s_sql	100%	100%	100%	100%
s_code	100%	100%	100%	100%
s_echo	100%	100%	100%	100%
s_deep	100%	100%	80% ⚠	100%
s_i18n	0%	40%	0%	80%

Tool-fail / run

scenario	chat	json	xml
s1	0.00	0.00	0.00
s3	0.00	0.00	0.00
s10	0.00	0.80	0.40
s_sql	0.00	0.00	0.00
s_code	0.00	0.00	0.00
s_echo	0.00	0.40	0.20
s_deep	0.60	0.20	2.20
s_i18n	0.00	0.00	0.00

qwen3.5-4b

Task success

scenario	chat	json	xml	qwen
s1	100%	100%	100%	100%
s3	100%	100%	100%	100%
s10	100%	100%	100%	100%
s_sql	100%	100%	100%	100%
s_code	100%	100%	0% ⚠	100%
s_echo	100%	80%	100%	100%
s_deep	100%	100%	100%	100%
s_i18n	100%	0%	0%	0%

Tool-fail / run

scenario	chat	json	xml	qwen
s1	1.00	0.00	0.00	0.00
s3	0.00	0.00	0.00	0.00
s10	0.00	0.00	0.00	0.00
s_sql	0.00	0.00	0.00	0.00
s_code	0.00	0.00	0.00	0.00
s_echo	0.00	0.00	0.00	0.00
s_deep	1.00	0.00	0.00	0.00
s_i18n	0.00	0.00	0.00	0.00

qwen3-4b (out-of-distribution — Hermes-format model)

Qwen 3 was trained on Hermes-style tool calls, not the XML format this adapter prompts for. The benchmark tests whether in-context compliance bridges the distribution gap.

Task success

scenario	chat	json	xml	qwen
s1	100%	100%	100%	100%
s3	100%	100%	100%	100%
s10	100%	100%	100%	100%
s_sql	100%	100%	100%	100%
s_code	100%	100%	100%	100%
s_echo	0%	0%	0%	0%
s_deep	100%	100%	100%	100%
s_i18n	0%	0%	0%	0%

Tool-fail / run

scenario	chat	json	xml
s1	0.00	0.00	0.00
s3	0.00	2.00	0.00
s10	0.00	0.00	0.00
s_sql	0.00	1.00	0.00
s_code	1.00	1.00	2.00
s_echo	0.00	0.00	0.00
s_deep	0.00	0.00	0.00
s_i18n	0.00	0.00	0.00

(s_echo and s_i18n failures across all adapters on 4B-class models are weak-model + mock-tool artifacts — the model hallucinates lengths or paraphrases narrative prefixes regardless of adapter. Not a production tool-calling regression. See docs/benchmarks.md.)

Headline findings

0 parse failures across all 600 qwen-adapter runs on all three models. XMLAdapter by comparison failed every s3 run on 35B and every s_code run on 4B.
0.00 tool-fail / run on every scenario on every model. The closest alternatives spike to 0.20 – 2.20 on multi-step and structured-arg scenarios. Same or better task success, fewer wasted turns.
Only adapter that reliably handles multilingual / delimiter-leaking tool output on 35B. s_i18n: qwen 80% vs chat 0%, json 40%, xml 0%.
Rescues reasoning_content turns that silently break stock adapters on thinking-mode models. json lost a run on s_echo 4B this way; qwen caught it via the fallback.
Works on Qwen 3 despite the training-distribution mismatch. The XML exemplar in our prompt is strong enough that Qwen 3 (trained on Hermes) follows it anyway, and qwen still posts the best tool-fail numbers across all scenarios.

Install

From PyPI (once published):

pip install dspy-qwen-adapter

From source (editable):

git clone https://github.com/<user>/dspy-qwen-adapter
cd dspy-qwen-adapter
pip install -e .

Quickstart

import dspy
from dspy_qwen_adapter import QwenAdapter

dspy.configure(
    lm=dspy.LM(
        "openai/qwen/qwen3.5-35b-a3b",
        api_base="http://127.0.0.1:1234/v1",
        api_key="lm-studio",
        temperature=1.0,
        max_tokens=8192,
    ),
    adapter=QwenAdapter(),
)

def get_weather(city: str) -> str:
    """Get the current weather in a city."""
    return f"sunny, 72F in {city}"

react = dspy.ReAct("question -> answer", tools=[get_weather])
print(react(question="What's the weather in Tokyo?").answer)

That's the whole user-facing surface — instantiate QwenAdapter(), pass it to dspy.configure, use dspy.ReAct or dspy.Predict as normal. No prompt templates, no parser configuration, no server-specific flags.

Same code works unchanged on Qwen 3 — swap openai/qwen/qwen3-4b as the model and run.

Configuration

QwenAdapter(
    callbacks=None,                 # list[BaseCallback] — standard DSPy callbacks
    native_response_types=None,     # list[type] — forwarded to base Adapter
    strict_parse=False,             # True: raise AdapterParseError when no tool call
                                    # is present. False (default): treat as a
                                    # graceful finish — the model's text becomes
                                    # the thought, and ReAct moves to extract.
)

use_native_function_calling is hardcoded off — we never pass tools=[] to the server, which is what makes this adapter robust across servers with different Qwen tool-parser quirks.

Compatibility

Model: Qwen 3+ family. Optimized for Qwen 3.5 (XML-format lineage); works on Qwen 3 (Hermes-format) and Qwen 3-Coder via in-context compliance. Smaller variants (4B and below) can have weak-model artifacts on narrative-mock benchmarks but still post the best tool_fail rates.
Server: any OpenAI-compatible chat/completions endpoint. Tested against LM Studio 0.4.x; should work against vLLM, SGLang, llama.cpp, and Ollama without any server-specific flags, since this adapter doesn't rely on native function calling.
Python: 3.12+.
DSPy: 3.1+.

How it's different

	ChatAdapter	JSONAdapter	XMLAdapter	QwenAdapter
Tool call format	`[[ ## field ## ]]` delimiters	JSON text (+ `response_format`)	`<field>content</field>` per output	canonical Qwen `<tool_call>` XML
Trajectory replay	flat `name: value` lines	flat JSON per turn	`<field>` lines per turn	`<tool_call>` + `<tool_response name="...">` XML per turn
`<think>` tag handling	—	—	—	stripped before parsing
Empty-text (thinking mode)	drops the turn (all fields None)	drops the turn	drops the turn	falls back to `reasoning_content`
Server native tool parser	not used	used when `response_format` is supported	not used	not used (by design)
Plain `dspy.Predict`	works	works	works	works (via XMLAdapter inheritance)

See docs/benchmarks.md for the measured effect of each.

Limitations

Only text-native mode. This adapter does not use the server's native tool-call parser — by design. If you're on a server whose tool parser for Qwen works perfectly, stock JSONAdapter with native function calling may be faster. The benchmarks show this adapter is at worst equivalent and at best dramatically better, at the cost of parsing tool calls in Python instead of at the server.
No demo / few-shot support on the ReAct path. DSPy optimizers that rely on demo interleaving (BootstrapFewShot, MIPRO) will silently get zero-shot behavior on ReAct calls. Plain Predict inherits demo support from XMLAdapter. Tracking as a future enhancement.
Non-streaming only. Streaming parsers for Qwen are buggy in most current inference stacks; this adapter targets non-streaming responses.
Small-model quirks. The 4B-class models occasionally paraphrase narrative tool output or hallucinate numeric details on contrived benchmark scenarios (s_echo, s_i18n). Not a production tool-calling regression — real tools return real data. Bigger models (35B+) pass these cleanly.

Development

Run the tests:

pip install -e '.[dev]'
pytest tests/ -v

Run the benchmark harness against a local model:

./harness/run_matrix.sh --runs 5 --use-judge

See docs/benchmarks.md for the harness docs.

License

MIT.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.3

Apr 29, 2026

This version

0.0.2

Apr 28, 2026

0.0.1

Apr 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dspy_qwen_adapter-0.0.2.tar.gz (12.4 kB view details)

Uploaded Apr 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dspy_qwen_adapter-0.0.2-py3-none-any.whl (13.8 kB view details)

Uploaded Apr 28, 2026 Python 3

File details

Details for the file dspy_qwen_adapter-0.0.2.tar.gz.

File metadata

Download URL: dspy_qwen_adapter-0.0.2.tar.gz
Upload date: Apr 28, 2026
Size: 12.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for dspy_qwen_adapter-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`48ecb321eb5191d4622eff619a829d33642e9192dc40625bba92e86b66e5d303`
MD5	`90b3c0f672b3e97a85817f0c64b9b16c`
BLAKE2b-256	`1c48b9e8b9f853e98cc2afe1146534fdac9c39c3c0dbab908ba3324e0932d6d6`

See more details on using hashes here.

File details

Details for the file dspy_qwen_adapter-0.0.2-py3-none-any.whl.

File metadata

Download URL: dspy_qwen_adapter-0.0.2-py3-none-any.whl
Upload date: Apr 28, 2026
Size: 13.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for dspy_qwen_adapter-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b385851b750a8fd248c5f1d5ec6b6b1741b6050f72fa30d91b01fe580191e78f`
MD5	`54ff7468d5c674f7365fcae1c884ce93`
BLAKE2b-256	`6fc9d8d1b66941b64534ae69cdb0bac8b878d9c2e1440c383f8c45c66b046514`

See more details on using hashes here.

dspy-qwen-adapter 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

dspy-qwen-adapter

Why this exists

Benchmark results

qwen3.5-35b-a3b

qwen3.5-4b

qwen3-4b (out-of-distribution — Hermes-format model)

Headline findings

Install

Quickstart

Configuration

Compatibility

How it's different

Limitations

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes