Tool Use
Tool use is documented as a core capability, not an add-on
The original Qwen README gives tool use, ReAct prompting, system prompts, and code interpreter their own public-facing sections and benchmark tables.
System prompt note
The upstream README says Qwen-1.8B-Chat and Qwen-72B-Chat were trained on more diverse system prompts and multi-round interactions.
That claim is tied to customization, role playing, style transfer, task setting, and behavior setting in context.
Chinese tool-use benchmark
| Model | Tool selection | Tool input | False positive error |
|---|---|---|---|
| GPT-4 | 98.0% | 0.953 | 23.9% |
| GPT-3.5 | 74.5% | 0.807 | 80.6% |
| Qwen-1.8B-Chat | 85.0% | 0.839 | 27.6% |
| Qwen-7B-Chat | 95.5% | 0.900 | 11.6% |
| Qwen-14B-Chat | 96.9% | 0.917 | 5.6% |
| Qwen-72B-Chat | 98.2% | 0.927 | 1.1% |
Chinese tool-use benchmark version 20231206, as reported in the upstream README.
Code interpreter benchmark
| Model | Math | Visualization hard | Visualization easy | Executable rate |
|---|---|---|---|---|
| GPT-4 | 82.8 | 66.7 | 60.8 | 82.8 |
| GPT-3.5 | 47.3 | 33.3 | 55.7 | 74.1 |
| Qwen-1.8B-Chat | 25.6 | 21.4 | 22.8 | 65.5 |
| Qwen-7B-Chat | 41.9 | 23.8 | 38.0 | 67.2 |
| Qwen-14B-Chat | 58.4 | 31.0 | 45.6 | 65.5 |
| Qwen-72B-Chat | 72.7 | 41.7 | 43.0 | 82.8 |
Code Interpreter benchmark version 20231206, mirrored from the upstream README.
Implementation framing
The upstream docs point readers to a ReAct prompting example for implementing tool calls and to `openai_api.py` for function calling support.
For deeper agent workflows and the benchmark assets behind code-interpreter evaluation, the README points to Qwen-Agent.
Source anchors