Demos

Document the runtime surfaces without pretending the app hosts inference

The public web experience should point users toward Web UI, CLI, vLLM, and FastChat flows while keeping the marketing/docs server separate from model serving.

Web UICLI demovLLM + FastChat

Recommended deployment path from upstream

  1. Start with vLLM for fast inference

    The upstream deployment section recommends vLLM first when you need serving-oriented throughput.

  2. Layer FastChat for Web UI or an OpenAI-style API

    FastChat becomes the orchestration layer for controller, worker, Gradio web server, and compatible API server.

  3. Use the simple demos if the stack above is too heavy

    The README also keeps lighter Web UI, CLI, and API entry points for direct local demos.

vLLM + FastChat shell flow

pip install "fschat[model_worker,webui]"
python -m fastchat.serve.controller
python -m fastchat.serve.vllm_worker --model-path $MODEL_PATH --trust-remote-code --dtype bfloat16
python -m fastchat.serve.gradio_web_server
python -m fastchat.serve.openai_api_server --host localhost --port 8000

Documented demo surfaces

Gradio-oriented

Web UI

The original repo exposes a `web_demo.py` path for a quick browser-based local demo.

Terminal-first

CLI demo

The CLI path focuses on streaming token output for local prompt-response testing.

Public studio

Hosted demo

The upstream README links to a public ModelScope studio demo for the 72B chat variant.

Open link

Source anchors

Demos and Deployment Surfaces | Qwen Code