๐ Deploying Qwen 3.5 122B on Mahidol Cluster#
Author: Snit Sanhlao , AI Assitant Gemini
High-Performance Coding & Chat via 4-bit AWQ Quantization#
This guide provides the configuration and setup for Qwen 3.5, optimized for resource-constrained GPU clusters. By utilizing AWQ 4-bit quantization and a Mixture of Experts (MoE) architecture, we achieve high-tier reasoning capabilities while significantly reducing VRAM footprint and token costs.
๐ Model Specifications#
Feature |
Detail |
|---|---|
Model ID |
|
Architecture |
Mixture of Experts (MoE) |
Active Params |
~10B (Efficiency of a 10B model, knowledge of a 122B) |
Quantization |
4-bit AWQ (Activation-aware Weight Quantization) |
Context Window |
32,768 Tokens (Optimized to 16,384 for stability) |
vLLM Engine |
V1 (Experimental Asynchronous Engine) |
API Endpoint |
|
๐ป IDE Integration (VS Code)#
Continue.dev Configuration#
To use Qwen 3.5 as your coding assistant, update your ~/.continue/config.yaml:
name: Local Config
version: 1.0.0
schema: v1
models:
- name: Qwen3.5
provider: openai
model: qwen3.5
apiBase: https://aicenter.mahidol.ac.th/vllm/v1
systemMessage: "You are a helpful assistant."
apiKey: "sk-xxxx"
contextLength: 81920 # Explicitly set to your 16k window
maxTokens: 4096 # Leave room for the model to respond
requestOptions:
extraBodyProperties:
chat_template_kwargs:
enable_thinking: false
context:
- provider: web
params:
engine: "searxng"
query: ""
searxngBaseUrl: https://aicenter.mahidol.ac.th/metasearch/
n: 5
- provider: code
- provider: docs
- provider: diff
- provider: terminal
- provider: problems
- provider: folder
- provider: codebase
๐ Open WebUI Deployment#
The easiest way to interact with the model is via Open WebUI. Run the following Docker command to connect to the cluster:
docker run -d -p 3000:8080 \
--name open-webui \
--restart always \
-e WEBUI_AUTH=False \
-e OPENAI_API_BASE_URL=https://aicenter.mahidol.ac.th/vllm/v1 \
-e OPENAI_API_KEY=sk-xxxx \
-v open-webui:/app/backend/data \
ghcr.io/open-webui/open-webui:main
[!TIP] Use the
-d(detached) flag instead of-itto keep the UI running in the background after you close your terminal.
๐งช Verification & Testing#
Connectivity Test (cURL)#
Run this in your terminal to verify the endpoint is reachable and the model is loaded:
curl https://aicenter.mahidol.ac.th/vllm/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-xxxx" \
-d '{
"model": "qwen3.5",
"messages": [{"role": "user", "content": "Explain the benefit of MoE architecture."}],
"temperature": 0.7
}'
๐ Troubleshooting & Maintenance#
Symptom |
Action |
|---|---|
Tokenizer Error |
Ensure vLLM is upgraded to support |
CUDA Out of Memory |
Lower |
504 Gateway Timeout |
The model is large; increase your client-side timeout (e.g., NGINX proxy-read-timeout). |
401 Unauthorized |
Verify your |
Container Management#
# View real-time logs
docker logs -f open-webui
# Stop and Clean up
docker stop open-webui && docker rm open-webui
๐ Usage Notes#
Efficiency: The
A10Bsuffix indicates that only ~10B parameters are activated per token, making this much faster than a standard 122B dense model.Architecture: Uses the
Qwen3_5MoeForConditionalGenerationarchitecture, optimized via FlashInfer.Privacy: All data remains within the Mahidol University infrastructure.
Security: Never commit your actual
apiKeyto a public GitHub repository.
Last Updated: 2026-03-05