Benchmarks
Benchmark claims stay visible, but visibly historical
Because the source input is only the README, the site keeps every benchmark claim tied to the original table rather than presenting it as fresh leaderboard data.
How to read the numbers
The upstream README states that compared-model numbers use the best value between official results and OpenCompass.
That makes these tables useful as product-surface evidence, but not as a substitute for current benchmark research.
- Natural language understanding
- Math and reasoning
- Code generation
- Chinese evaluation coverage
Representative performance table
| Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU |
|---|---|---|---|---|---|---|---|---|
| LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 |
| InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 |
| Yi-34B | 76.3 | 81.8 | 67.9 | 15.9 | 26.2 | 38.2 | 66.4 | 82.6 |
| Qwen-1.8B | 45.3 | 56.1 | 32.3 | 2.3 | 15.2 | 14.2 | 22.3 | 52.1 |
| Qwen-7B | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 |
| Qwen-14B | 66.3 | 72.1 | 61.3 | 24.8 | 32.3 | 40.8 | 53.4 | 71.0 |
| Qwen-72B | 77.4 | 83.3 | 78.9 | 35.2 | 35.4 | 52.2 | 67.7 | 83.6 |
The upstream README reports the best score between official results and OpenCompass for each compared model.
Freshness note
These scores come from the original Qwen README and technical memo, not from a live benchmark feed.
The site keeps them because they define the documented public surface for this historical model line.
Source anchors