# Hermes Remote HPC + Slack — Research Acceleration Guide **Authors:** Snit Sanghlao, Qwen, Hermes **Published by:** Mahidol University AI Center **Reference:** `aicenter.mahidol.ac.th/hermes-remote-hpc-slack` > **Inspired by:** [Andrej Karpathy's AutoResearch](https://github.com/karpathy/autoresearch) — an autonomous research agent that iterates on experiments, analysis, and reporting without manual intervention. We adapt this vision for cluster-based ML research using Hermes as the local orchestrator. > > **Hermes documentation:** [https://hermes-agent.nousresearch.com/docs](https://hermes-agent.nousresearch.com/docs) --- ## Table of Contents 1. [Executive Summary](#1-executive-summary) 2. [Prerequisites](#2-prerequisites) 3. [SSH Setup](#3-step-1-ssh-setup) 4. [Install Hermes Locally](#4-step-2-install-hermes-locally) 5. [Configure Hermes SSH Backend](#5-step-3-configure-hermes-ssh-backend) 6. [Slack Gateway Setup](#6-step-4-slack-gateway-setup) 7. [SLURM Job Templates + Singularity](#7-step-5-slurm-job-templates--singularity) 8. [Simple CNN Demo](#8-step-6-simple-cnn-demo) 9. [Automated Research Loop via Hermes Cron](#9-step-7-automated-research-loop-via-hermes-cron) 10. [Security Hardening](#10-step-8-security-hardening) 11. [Troubleshooting](#11-troubleshooting) 12. [Citations](#12-citations) --- ## 1. Executive Summary Machine learning research on HPC clusters should not require a terminal session glued to your screen. This guide shows how to drive experiments **entirely through Slack** using: | Component | Role | |-----------|------| | **Hermes** (local machine) | AI-powered orchestrator — parses Slack commands, manages workflows, coordinates cluster jobs | | **SSH** | Secure, key-based tunnel from your workstation to the HPC login node | | **SLURM + Singularity** (remote cluster) | Batch scheduling and containerized GPU training environments | | **Slack** | Natural-language interface for submitting runs, checking logs, and reviewing results | The goal is a closed-loop research workflow where you describe an idea in Slack and receive plots, metrics, and the next research direction — all without opening a terminal. ### The Research Benefit Loop ``` IDEA (Slack) ──► HERMES (Local) ──► SSH ──► SLURM (Cluster) ──► GPU TRAINING │ RESULTS (Logs/Plots) ◄── SLURM ◄── HERMES ◄──────── ┌─────────────┐ │ │ │ ITERATE │◄─┘ SLACK REPORT + NEXT IDEA ◄───────────────────────────└─────────────┘ ``` You type an idea in Slack → Hermes translates it into a workflow → the cluster trains → results flow back → Hermes summarizes and suggests the next step → repeat. ### The Convergence Principle The loop above is an instance of a general research iteration model: ``` Ideal-Starter(Human | LLM) − Validator(any abstraction level) = Residual │ Auto Research Loop ◄───────┘ │ Residual Converges? Yes │ No │ └──► iterate STOP ``` | Term | Meaning | |------|---------| | **Ideal-Starter** | The initial hypothesis or experiment design — provided by a human, generated by an LLM, or both | | **Validator** | Any evaluation signal: loss curves, accuracy metrics, statistical tests, human review, or a critic model — operating at any abstraction level | | **Residual** | The gap between the current result and the ideal — what remains to be explained or improved | | **Converge** | Residual falls below an acceptable threshold (e.g., validation loss plateaus, accuracy target met, human approves result) | The auto research loop driven by Hermes operationalizes this model: each iteration reduces the residual until convergence, at which point the loop stops and the final result is reported. --- ## 2. Prerequisites Both the local workstation (where Hermes runs) and the remote HPC cluster must meet the requirements listed below. | Requirement | Home Machine (Local) | Remote Cluster (HPC) | |-------------|----------------------|----------------------| | **OS** | Ubuntu 20.04+ / macOS 12+ / WSL2 | Linux (CentOS 7+, Rocky 8+, Ubuntu 20+) | | **Python** | 3.10+ (for Hermes + helpers) | 3.9+ (inside Singularity containers) | | **Node.js** | 18+ (for Slack Bolt SDK) | Not required | | **SSH Key** | Yes — public key installed on cluster | Yes — `authorized_keys` on login node | | **SLURM** | No | **Yes** — `sbatch`, `squeue`, `scancel` | | **Singularity/Apptainer** | No | **Yes** — containerized ML environments | | **Hermes** | **Yes** — installed locally | No (runs locally only) | > **Note:** If your cluster uses a different scheduler (PBS/Torque, LSF), adapt the SLURM commands accordingly. The architecture remains the same. --- ## 3. Step 1: SSH Setup Hermes communicates with the HPC cluster over SSH. A password-less, key-based configuration is essential for fully automated workflows. ### 3.1. Generate an SSH Key ```bash ssh-keygen -t ed25519 -C "your_email@example.com" # Private key: /home/snitsan/.ssh/id_ed25519 (keep secret) # Public key: /home/snitsan/.ssh/id_ed25519.pub (copy to cluster) ``` ### 3.2. Install the Public Key on the Cluster ```bash # Copy the public key to the cluster ssh-copy-id YOUR_USERNAME@YOUR_CLUSTER_ADDRESS # Alternatively, paste the contents of id_ed25519.pub into # ~/.ssh/authorized_keys on the cluster login node. ``` ### 3.3. Configure `~/.ssh/config` Add a named host entry. Hermes will reference this alias throughout the workflow. ``` Host hpc HostName YOUR_CLUSTER_ADDRESS User YOUR_USERNAME IdentityFile ~/.ssh/id_ed25519 Port 22 # Uncomment if your key is on a hardware token (YubiKey, etc.) # ForwardAgent yes ``` > Replace `YOUR_CLUSTER_ADDRESS` and `YOUR_USERNAME` with your actual values. ### 3.4. Test the Connection ```bash ssh hpc whoami # Expected: YOUR_USERNAME exit ``` If you reach the cluster without a password prompt, the SSH tunnel is ready. --- ## 4. Step 2: Install Hermes Locally **Hermes** is the local AI agent that bridges Slack and your HPC cluster. It receives Slack commands, translates them into terminal / SLURM / file operations, and returns results. Think of it as an always-on research assistant running on your machine. ### 4.1. Install Hermes ```bash curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash ``` This installs the Hermes CLI and supporting tools into `~/.hermes`. ### 4.2. Verify the Installation ```bash hermes doctor ``` All health-check items should pass: ``` ✅ Python .......... 3.11.x (OK) ✅ Node.js ......... 18.x (OK) ✅ Config folder ... ~/.hermes (OK) ✅ Disk space ...... 45 GB free (OK) ``` Resolve any flagged issues before proceeding. See the [Hermes documentation](https://hermes-agent.nousresearch.com/docs) for troubleshooting guidance. ### 4.3. Hermes Capabilities at a Glance | Capability | Description | |------------|-------------| | **Command execution** | Runs shell commands locally and over SSH | | **File management** | Reads, writes, and diffs files on local and remote systems | | **Process orchestration** | Manages long-running background jobs | | **Slack integration** | Bolt SDK app that maps Slack messages to Hermes actions | | **Self-reasoning** | Breaks complex tasks into steps, executes, and verifies outcomes | --- ## 5. Step 3: Configure Hermes SSH Backend Hermes must know how to reach the cluster. Rather than hardcoding hostnames in scripts, you set a single backend directive that points to the SSH alias defined in `~/.ssh/config`. ### 5.1. Set the Backend and Host ```bash # Use SSH as the terminal backend hermes config set terminal.backend ssh # Point to the SSH alias defined in ~/.ssh/config hermes config set terminal.host hpc ``` The `hpc` alias must match a `Host hpc` stanza in your SSH config. Hermes resolves the hostname, user, and identity file automatically. ### 5.2. Set the Remote Working Directory ```bash hermes config set terminal.remote_dir /scratch/YOUR_USERNAME ``` > Use `/scratch` (not `/home`) for large datasets and model checkpoints to avoid quota issues. ### 5.3. Verify the Config File ```yaml # ~/.hermes/config.yaml terminal: backend: ssh host: hpc remote_dir: /scratch/YOUR_USERNAME ``` ### 5.4. Run a Connectivity Test ```bash hermes chat -q "whoami; hostname; sinfo" ``` Expected output: ``` ┌─ output ────────────────────────────┐ │ YOUR_USERNAME │ │ login-node-01.cluster.edu │ │ PARTITION AVAIL TIMELIMIT │ │ gpu_normal up 7-00:00 │ │ gpu_dev up 1-00:00 │ └─────────────────────────────────────┘ ``` > **Troubleshooting:** If the test hangs, verify password-less login with plain `ssh hpc` first. Hermes reuses your SSH key — it does not manage authentication itself. --- ## 6. Step 4: Slack Gateway Setup The Slack gateway turns your workspace into a natural-language control panel for Hermes. This section covers creating a Slack app, wiring it to Hermes, and running the gateway as a system service. ### 6.1. Create a Slack App 1. Go to [https://api.slack.com/apps](https://api.slack.com/apps) and click **Create New App → From scratch**. 2. Name the app (e.g., `hermes-research-bot`) and select your workspace. 3. Under **Features → Bot Tokens**, add the following OAuth scopes: - `channels:history`, `channels:read`, `channels:write` - `chat:write`, `chat:write.customize` - `groups:history`, `im:history`, `im:write` - `files:read`, `files:write` - `mpim:history`, `users:read` 4. Under **OAuth & Permissions**, install the app to your workspace and copy the **Bot User OAuth Token** (begins with `xoxb-`). > These scopes allow the bot to read channel history, send formatted messages and file attachments, and inspect user information. Request only the scopes your workflow requires. ### 6.2. Generate an App-Level Token 1. Go to **Settings → Basic Information → App-Level Tokens**. 2. Create a token with the `connections:write` scope (required for Socket Mode). 3. Copy the token (begins with `xapp-`). ### 6.3. Configure Hermes Credentials Run the interactive setup wizard: ```bash hermes gateway setup # [?] Select gateway type: → Slack # [?] Enable Socket Mode?: → Yes ``` Store credentials in `~/.hermes/.env` — never in `config.yaml`: ```bash # ~/.hermes/.env — never commit to version control SLACK_BOT_TOKEN=xoxb-YOUR_SLACK_BOT_TOKEN SLACK_APP_TOKEN=xapp-YOUR_SLACK_APP_TOKEN ``` Reference the environment variables in config: ```bash hermes config set gateway.slack.token '${SLACK_BOT_TOKEN}' hermes config set gateway.slack.app_token '${SLACK_APP_TOKEN}' ``` Your `~/.hermes/config.yaml` should show variable references, not raw secrets: ```yaml gateway: type: slack socket_mode: true slack: token: ${SLACK_BOT_TOKEN} app_token: ${SLACK_APP_TOKEN} ``` > **Security note:** Hermes loads `~/.hermes/.env` automatically at startup. If you paste a raw `xoxb-` token directly into `config.yaml`, `hermes doctor` will warn you. ### 6.4. Install and Start the Gateway ```bash hermes gateway install # Register as a systemd service (runs as your user) hermes gateway start # Start the service hermes gateway status # Verify it is running ``` Expected status output: ``` ✅ Service active: hermes-gateway.service ✅ Socket Mode: connected ✅ Listening on workspace: your-team.slack.com ``` ### 6.5. Send Your First Command Open Slack and message the bot directly: ``` you → @hermes-research-bot: whoami; hostname; sinfo bot ← [cluster output formatted as a table] ``` Or mention it in a channel: ``` you → #research-gpu: @hermes-research-bot show me active jobs on gpu_normal bot ← [squeue output formatted as a table] ``` Hermes parses natural language, routes the request through SSH, executes it on the cluster, and returns formatted results — entirely within Slack. --- ## 7. Step 5: SLURM Job Templates + Singularity Hermes submits training jobs to the cluster via SLURM `sbatch`. The standard workflow is: 1. Hermes writes a SLURM script on your local machine 2. It SCPs the script to the cluster 3. It runs `sbatch` via SSH 4. It monitors job status and tails logs in real time ### 7.1. SLURM Script Template Save the following as `cnn_train.slurm`: ```bash #!/bin/bash #SBATCH --job-name=cnn_cifar10 #SBATCH --partition=gpu_normal #SBATCH --gres=gpu:1 #SBATCH --time=24:00:00 #SBATCH --cpus-per-task=8 #SBATCH --mem=32G #SBATCH --output=/scratch/%u/cnn_%j.out #SBATCH --error=/scratch/%u/cnn_%j.err #SBATCH --mail-type=END,FAIL #SBATCH --mail-user=YOUR_EMAIL # Load required modules module load singularity # Path to the Singularity container CONTAINER=/opt/containers/pytorch_2.0.sif # Run training inside the container singularity exec \ --bind /scratch:/data,/home:/home \ --nv \ $CONTAINER \ python /data/cnn_train.py \ --epochs 50 \ --batch_size 128 \ --lr 0.01 \ --data_dir /data/cifar10 \ --output_dir /data/cnn_results ``` ### 7.2. Submit via Hermes ```bash # Hermes handles this automatically; the equivalent manual steps are: scp cnn_train.slurm hpc:/scratch/YOUR_USERNAME/ ssh hpc "sbatch /scratch/YOUR_USERNAME/cnn_train.slurm" ``` ### 7.3. Monitor via Hermes ```bash # Check job status ssh hpc "squeue -u YOUR_USERNAME --format='%i %P %j %S %R'" # Stream live output ssh hpc "tail -f /scratch/YOUR_USERNAME/cnn_JOB_ID.out" ``` --- ## 8. Step 6: Simple CNN Demo A ready-to-use PyTorch CNN script (`train_cnn.py`) is included in this repository. It: - Trains a 4-layer CNN on CIFAR-10 - Accepts `--epochs`, `--batch_size`, `--lr`, `--data_dir`, `--output_dir` arguments - Saves `metrics.csv` (per-epoch train/val loss and accuracy) - Saves `best_model.pth` checkpoint - Outputs `summary.json` with the best validation accuracy ```bash # Quick local test (CPU, small dataset) python train_cnn.py \ --epochs 3 --batch_size 64 --lr 0.001 \ --data_dir ./data --output_dir ./results # Submit to cluster via SLURM + Singularity ssh hpc "sbatch cnn_train.slurm" ``` See [`train_cnn.py`](train_cnn.py) in this directory for the full source code. --- ## 9. Step 7: Automated Research Loop via Hermes Cron Hermes can proactively monitor your experiments and Slack-DM you with updates — no manual polling required. ### 9.1. Create a Monitoring Cron Job ```bash hermes cron create "every 5m" \ --prompt "SSH to hpc, run squeue -u YOUR_USERNAME, tail the latest CNN training log, parse the latest epoch metrics, and report train/val accuracy and loss. If training is complete, summarize results and suggest next experiments." ``` ### 9.2. Manage Cron Jobs ```bash hermes cron list # View all scheduled jobs hermes cron pause ID # Pause a specific job hermes cron resume ID # Resume a paused job hermes cron remove ID # Delete a job permanently ``` **The benefit in practice:** start a 24-hour training run, close your laptop, and go home. Every 5 minutes, Hermes checks the log, parses metrics, and messages you in Slack. When training finishes, you receive a summary and a suggested next step, for example: > *"Val accuracy plateaued at 82% after epoch 12. Suggest: increase lr to 0.01 or add data augmentation."* --- ## 10. Step 8: Security Hardening ### 10.1. SLURM Best Practices | Practice | Directive | Purpose | |----------|-----------|---------| | Resource limits | `--mem=32G --cpus-per-task=8 --time=24:00:00` | Prevent runaway jobs | | Account chargeback | `--account=research_group` | Track compute costs per group | | Node constraints | `--constraint=v100` | Pin jobs to a specific GPU type | | Job arrays | `--array=0-9%3` | Run 10 sweeps with max 3 concurrent | | Node exclusion | `--exclude=node05` | Skip known problematic nodes | ### 10.2. Singularity / Apptainer Best Practices | Practice | Flag | Purpose | |----------|------|---------| | Clean environment | `--cleanenv` | Prevent credential leakage from the host | | Minimal bind mounts | `--bind /data:/data` | Mount only required directories | | GPU passthrough | `--nv` | NVIDIA GPU access without full host privileges | | Unprivileged containers | `--userns` (Apptainer) | User-namespace isolation | | Image verification | `singularity verify container.sif` | Validate signatures before execution | ### 10.3. SSH Hardening ``` # ~/.ssh/config Host hpc ServerAliveInterval 60 ServerAliveCountMax 3 ForwardAgent no # Do not expose your SSH agent to the cluster StrictHostKeyChecking yes # Always verify host keys ``` --- ## 11. Troubleshooting | Symptom | Diagnostic Command | Resolution | |---------|--------------------|------------| | Slack bot not responding | `hermes gateway status` | `hermes gateway restart` | | SSH connection refused | `ssh hpc "echo OK"` | Check `~/.ssh/config` and cluster DNS | | SLURM job stuck in PD | `squeue -u USER` | Verify partition and GPU availability | | `singularity exec` fails | Inspect job log | Ensure `--nv` flag is present for NVIDIA GPUs | | Hermes cannot write to `/scratch` | Check permissions | `chmod 755 /scratch/YOUR_USERNAME` | | Cron job not triggering | `hermes cron list` | Verify schedule format and enabled status | --- ## 12. Citations ```bibtex @misc{hermes_remote_hpc_slack, author = {Sanghlao, Snit }, title = {Hermes Remote HPC + Slack: Research Acceleration Guide}, year = {2026}, howpublished = {Mahidol University AI Center}, note = {aicenter.mahidol.ac.th/hermes-remote-hpc-slack}, } @software{hermes_agent, author = {{Nous Research}}, title = {Hermes Agent}, url = {https://hermes-agent.nousresearch.com/docs}, version = {latest}, year = {2026}, } @software{karpathy_autoresearch, author = {Karpathy, Andrej}, title = {AutoResearch: Autonomous ML Research Agent}, url = {https://github.com/karpathy/autoresearch}, year = {2025}, } ``` --- *Built for researchers who refuse to compromise on productivity, privacy, and control.* **Mahidol University AI Center** · **Hermes Agent** · **SLURM + Singularity**