Long Context

Long-context capability is documented with both method notes and evaluation tables

The upstream README ties longer context windows to NTK-aware interpolation, window attention, LogN scaling, and larger rotary bases for the 72B model.

NTK-aware interpolationWindow attention32K context

Technique summary

For Qwen-14B, the README describes extending the context length from 2K to over 8K with NTK-aware interpolation, window attention, and LogN scaling.

For Qwen-1.8B and Qwen-7B, the long-context framing extends the 8K native setting to 32K. For Qwen-72B, the README says the model adapts RoPE with a larger rotary base.

Long-context perplexity snapshot

Model10242048409681921638432768
Qwen-7B (original)4.233.7839.35469.812645.09-
+ dynamic_ntk4.233.783.593.665.71-
Qwen-1.8B5.004.484.133.8917.42433.85
Qwen-1.8B + dynamic_ntk + logn + window_attn5.004.484.143.933.823.83
Qwen-7B4.233.813.523.317.27181.49
Qwen-7B + dynamic_ntk + logn + window_attn4.233.813.523.333.223.17
Qwen-14B + dynamic_ntk + logn + window_attn-3.463.293.183.42-
Qwen-72B---2.832.732.72

Perplexity results on arXiv long-context evaluation from the upstream README.

L-Eval comparison

ModelInput lengthAverageCourseraGSMQuALITYTOEFLCodeUSFiction
ChatGPT-3.5-16k16K60.7363.5184.0061.3878.4312.2264.84
Qwen-72B-Chat32K62.3058.1376.0077.2286.246.6669.53

Closed-ended L-Eval result block mirrored from the upstream README.

Needle-in-a-haystack note

The upstream README also includes a qualitative needle-in-a-haystack result for Qwen-72B-Chat and states that it can retrieve information across positions within 32K inputs.

Source anchors

Long Context | Qwen Code