๐Ÿš€ Deploying Qwen 3.5 122B on Mahidol Cluster#

Author: Snit Sanhlao , AI Assitant Gemini

High-Performance Coding & Chat via 4-bit AWQ Quantization#

This guide provides the configuration and setup for Qwen 3.5, optimized for resource-constrained GPU clusters. By utilizing AWQ 4-bit quantization and a Mixture of Experts (MoE) architecture, we achieve high-tier reasoning capabilities while significantly reducing VRAM footprint and token costs.


๐Ÿ“Š Model Specifications#

Feature

Detail

Model ID

cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit

Architecture

Mixture of Experts (MoE)

Active Params

~10B (Efficiency of a 10B model, knowledge of a 122B)

Quantization

4-bit AWQ (Activation-aware Weight Quantization)

Context Window

32,768 Tokens (Optimized to 16,384 for stability)

vLLM Engine

V1 (Experimental Asynchronous Engine)

API Endpoint

https://aicenter.mahidol.ac.th/vllm/v1


๐Ÿ’ป IDE Integration (VS Code)#

Continue.dev Configuration#

To use Qwen 3.5 as your coding assistant, update your ~/.continue/config.yaml:

name: Local Config
version: 1.0.0
schema: v1
models:
  - name: Qwen3.5
    provider: openai
    model: qwen3.5
    apiBase: https://aicenter.mahidol.ac.th/vllm/v1
    systemMessage: "You are a helpful assistant."
    apiKey: "sk-xxxx"
    contextLength: 81920  # Explicitly set to your 16k window
    maxTokens: 4096       # Leave room for the model to respond
    requestOptions:
      extraBodyProperties:
        chat_template_kwargs:
          enable_thinking: false
context:
  - provider: web
    params:
      engine: "searxng"
      query: ""
      searxngBaseUrl: https://aicenter.mahidol.ac.th/metasearch/
      n: 5
  - provider: code
  - provider: docs
  - provider: diff
  - provider: terminal
  - provider: problems
  - provider: folder
  - provider: codebase

๐ŸŒ Open WebUI Deployment#

The easiest way to interact with the model is via Open WebUI. Run the following Docker command to connect to the cluster:

docker run -d -p 3000:8080 \
  --name open-webui \
  --restart always \
  -e WEBUI_AUTH=False \
  -e OPENAI_API_BASE_URL=https://aicenter.mahidol.ac.th/vllm/v1 \
  -e OPENAI_API_KEY=sk-xxxx \
  -v open-webui:/app/backend/data \
  ghcr.io/open-webui/open-webui:main

[!TIP] Use the -d (detached) flag instead of -it to keep the UI running in the background after you close your terminal.


๐Ÿงช Verification & Testing#

Connectivity Test (cURL)#

Run this in your terminal to verify the endpoint is reachable and the model is loaded:

curl https://aicenter.mahidol.ac.th/vllm/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-xxxx" \
  -d '{
    "model": "qwen3.5",
    "messages": [{"role": "user", "content": "Explain the benefit of MoE architecture."}],
    "temperature": 0.7
  }'

๐Ÿ›  Troubleshooting & Maintenance#

Symptom

Action

Tokenizer Error

Ensure vLLM is upgraded to support TokenizersBackend (transformers v5.0+).

CUDA Out of Memory

Lower gpu-memory-utilization to 0.80 or reduce max-model-len.

504 Gateway Timeout

The model is large; increase your client-side timeout (e.g., NGINX proxy-read-timeout).

401 Unauthorized

Verify your sk-xxxx API key is passed in the Authorization header.

Container Management#

# View real-time logs
docker logs -f open-webui

# Stop and Clean up
docker stop open-webui && docker rm open-webui

๐Ÿ“ Usage Notes#

  • Efficiency: The A10B suffix indicates that only ~10B parameters are activated per token, making this much faster than a standard 122B dense model.

  • Architecture: Uses the Qwen3_5MoeForConditionalGeneration architecture, optimized via FlashInfer.

  • Privacy: All data remains within the Mahidol University infrastructure.

  • Security: Never commit your actual apiKey to a public GitHub repository.


Last Updated: 2026-03-05