Runtime baseline
The upstream README calls out Python 3.8+, PyTorch 1.12+, Transformers 4.32+, and CUDA 11.4+ as the baseline environment.
Flash Attention is optional, but the README recommends it for supported fp16 or bf16 devices to improve efficiency and reduce memory usage.
- Python 3.8 or newer
- PyTorch 1.12 or newer, with 2.0+ recommended
- Transformers 4.32 or newer
- CUDA 11.4+ for GPU-oriented paths
Quickstart flow
Install baseline dependencies
Start with `pip install -r requirements.txt` if you want the simplest source-aligned local environment.
Add Flash Attention only when the hardware supports it
Treat flash-attention as an optimization layer, not a prerequisite, because the upstream README explicitly says the project still runs without it.
Load the chat checkpoint with `trust_remote_code=True`
The official quickstart shows `AutoTokenizer` and `AutoModelForCausalLM` loading the chat model directly from the public model hub.
Minimal Transformers example
The upstream quickstart centers the local experience on a direct `model.chat()` flow.
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-7B-Chat",
device_map="auto",
trust_remote_code=True
).eval()
response, history = model.chat(tokenizer, "你好", history=None)
print(response)
Where builders usually branch next
Hugging Face
Use the public Qwen organization when you want the standard open-source model-card and checkpoint flow.
Open linkModelScope
Mirror the same model line in the China-friendly distribution hub used throughout the original docs.
Open linkDocker images
The README also points to prebuilt Docker images for faster environment setup when you do not want to build from scratch.
Open link