🚀 Deploying Qwen 3.5 122B on Mahidol Cluster#

Author: Snit Sanhlao , AI Assitant Gemini

High-Performance Coding & Chat via 4-bit AWQ Quantization#

This guide provides the configuration and setup for Qwen 3.5, optimized for resource-constrained GPU clusters. By utilizing AWQ 4-bit quantization and a Mixture of Experts (MoE) architecture, we achieve high-tier reasoning capabilities while significantly reducing VRAM footprint and token costs.

📊 Model Specifications#

Feature	Detail
Model ID	`cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit`
Architecture	Mixture of Experts (MoE)
Active Params	~10B (Efficiency of a 10B model, knowledge of a 122B)
Quantization	4-bit AWQ (Activation-aware Weight Quantization)
Context Window	32,768 Tokens (Optimized to 16,384 for stability)
vLLM Engine	V1 (Experimental Asynchronous Engine)
API Endpoint	`https://aicenter.mahidol.ac.th/vllm/v1`

💻 IDE Integration (VS Code)#

Continue.dev Configuration#

To use Qwen 3.5 as your coding assistant, update your ~/.continue/config.yaml:

name: Local Config
version: 1.0.0
schema: v1
models:
  - name: Qwen3.5
    provider: openai
    model: qwen3.5
    apiBase: https://aicenter.mahidol.ac.th/vllm/v1
    systemMessage: "You are a helpful assistant."
    apiKey: "sk-xxxx"
    contextLength: 81920  # Explicitly set to your 16k window
    maxTokens: 4096       # Leave room for the model to respond
    requestOptions:
      extraBodyProperties:
        chat_template_kwargs:
          enable_thinking: false
context:
  - provider: web
    params:
      engine: "searxng"
      query: ""
      searxngBaseUrl: https://aicenter.mahidol.ac.th/metasearch/
      n: 5
  - provider: code
  - provider: docs
  - provider: diff
  - provider: terminal
  - provider: problems
  - provider: folder
  - provider: codebase

🌐 Open WebUI Deployment#

The easiest way to interact with the model is via Open WebUI. Run the following Docker command to connect to the cluster:

docker run -d -p 3000:8080 \
  --name open-webui \
  --restart always \
  -e WEBUI_AUTH=False \
  -e OPENAI_API_BASE_URL=https://aicenter.mahidol.ac.th/vllm/v1 \
  -e OPENAI_API_KEY=sk-xxxx \
  -v open-webui:/app/backend/data \
  ghcr.io/open-webui/open-webui:main

[!TIP] Use the -d (detached) flag instead of -it to keep the UI running in the background after you close your terminal.

🧪 Verification & Testing#

Connectivity Test (cURL)#

Run this in your terminal to verify the endpoint is reachable and the model is loaded:

curl https://aicenter.mahidol.ac.th/vllm/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-xxxx" \
  -d '{
    "model": "qwen3.5",
    "messages": [{"role": "user", "content": "Explain the benefit of MoE architecture."}],
    "temperature": 0.7
  }'

🛠 Troubleshooting & Maintenance#

Symptom	Action
Tokenizer Error	Ensure vLLM is upgraded to support `TokenizersBackend` (transformers v5.0+).
CUDA Out of Memory	Lower `gpu-memory-utilization` to `0.80` or reduce `max-model-len`.
504 Gateway Timeout	The model is large; increase your client-side timeout (e.g., NGINX proxy-read-timeout).
401 Unauthorized	Verify your `sk-xxxx` API key is passed in the Authorization header.

Container Management#

# View real-time logs
docker logs -f open-webui

# Stop and Clean up
docker stop open-webui && docker rm open-webui

📝 Usage Notes#

Efficiency: The A10B suffix indicates that only ~10B parameters are activated per token, making this much faster than a standard 122B dense model.
Architecture: Uses the Qwen3_5MoeForConditionalGeneration architecture, optimized via FlashInfer.
Privacy: All data remains within the Mahidol University infrastructure.
Security: Never commit your actual apiKey to a public GitHub repository.

Last Updated: 2026-03-05