# Hermes Remote HPC + Slack — Research Acceleration Guide

**Authors:** Snit Sanghlao, Qwen, Hermes  
**Published by:** Mahidol University AI Center  
**Reference:** `aicenter.mahidol.ac.th/hermes-remote-hpc-slack`

> **Inspired by:** [Andrej Karpathy's AutoResearch](https://github.com/karpathy/autoresearch) — an autonomous research agent that iterates on experiments, analysis, and reporting without manual intervention. We adapt this vision for cluster-based ML research using Hermes as the local orchestrator.
>
> **Hermes documentation:** [https://hermes-agent.nousresearch.com/docs](https://hermes-agent.nousresearch.com/docs)

---

## Table of Contents

1. [Executive Summary](#1-executive-summary)
2. [Prerequisites](#2-prerequisites)
3. [SSH Setup](#3-step-1-ssh-setup)
4. [Install Hermes Locally](#4-step-2-install-hermes-locally)
5. [Configure Hermes SSH Backend](#5-step-3-configure-hermes-ssh-backend)
6. [Slack Gateway Setup](#6-step-4-slack-gateway-setup)
7. [SLURM Job Templates + Singularity](#7-step-5-slurm-job-templates--singularity)
8. [Simple CNN Demo](#8-step-6-simple-cnn-demo)
9. [Automated Research Loop via Hermes Cron](#9-step-7-automated-research-loop-via-hermes-cron)
10. [Security Hardening](#10-step-8-security-hardening)
11. [Troubleshooting](#11-troubleshooting)
12. [Citations](#12-citations)

---

## 1. Executive Summary

Machine learning research on HPC clusters should not require a terminal session glued to your screen. This guide shows how to drive experiments **entirely through Slack** using:

| Component | Role |
|-----------|------|
| **Hermes** (local machine) | AI-powered orchestrator — parses Slack commands, manages workflows, coordinates cluster jobs |
| **SSH** | Secure, key-based tunnel from your workstation to the HPC login node |
| **SLURM + Singularity** (remote cluster) | Batch scheduling and containerized GPU training environments |
| **Slack** | Natural-language interface for submitting runs, checking logs, and reviewing results |

The goal is a closed-loop research workflow where you describe an idea in Slack and receive plots, metrics, and the next research direction — all without opening a terminal.

### The Research Benefit Loop

```
IDEA (Slack) ──► HERMES (Local) ──► SSH ──► SLURM (Cluster) ──► GPU TRAINING
                                                                       │
RESULTS (Logs/Plots) ◄── SLURM ◄── HERMES ◄──────── ┌─────────────┐  │
        │                                             │   ITERATE   │◄─┘
SLACK REPORT + NEXT IDEA ◄───────────────────────────└─────────────┘
```

You type an idea in Slack → Hermes translates it into a workflow → the cluster trains → results flow back → Hermes summarizes and suggests the next step → repeat.

### The Convergence Principle

The loop above is an instance of a general research iteration model:

```
Ideal-Starter(Human | LLM)  −  Validator(any abstraction level)  =  Residual
                                                                         │
                                              Auto Research Loop ◄───────┘
                                                      │
                                              Residual Converges?
                                                 Yes  │  No
                                                      │   └──► iterate
                                                   STOP
```

| Term | Meaning |
|------|---------|
| **Ideal-Starter** | The initial hypothesis or experiment design — provided by a human, generated by an LLM, or both |
| **Validator** | Any evaluation signal: loss curves, accuracy metrics, statistical tests, human review, or a critic model — operating at any abstraction level |
| **Residual** | The gap between the current result and the ideal — what remains to be explained or improved |
| **Converge** | Residual falls below an acceptable threshold (e.g., validation loss plateaus, accuracy target met, human approves result) |

The auto research loop driven by Hermes operationalizes this model: each iteration reduces the residual until convergence, at which point the loop stops and the final result is reported.

---

## 2. Prerequisites

Both the local workstation (where Hermes runs) and the remote HPC cluster must meet the requirements listed below.

| Requirement | Home Machine (Local) | Remote Cluster (HPC) |
|-------------|----------------------|----------------------|
| **OS** | Ubuntu 20.04+ / macOS 12+ / WSL2 | Linux (CentOS 7+, Rocky 8+, Ubuntu 20+) |
| **Python** | 3.10+ (for Hermes + helpers) | 3.9+ (inside Singularity containers) |
| **Node.js** | 18+ (for Slack Bolt SDK) | Not required |
| **SSH Key** | Yes — public key installed on cluster | Yes — `authorized_keys` on login node |
| **SLURM** | No | **Yes** — `sbatch`, `squeue`, `scancel` |
| **Singularity/Apptainer** | No | **Yes** — containerized ML environments |
| **Hermes** | **Yes** — installed locally | No (runs locally only) |

> **Note:** If your cluster uses a different scheduler (PBS/Torque, LSF), adapt the SLURM commands accordingly. The architecture remains the same.

---

## 3. Step 1: SSH Setup

Hermes communicates with the HPC cluster over SSH. A password-less, key-based configuration is essential for fully automated workflows.

### 3.1. Generate an SSH Key

```bash
ssh-keygen -t ed25519 -C "your_email@example.com"
# Private key: /home/snitsan/.ssh/id_ed25519   (keep secret)
# Public key:  /home/snitsan/.ssh/id_ed25519.pub (copy to cluster)
```

### 3.2. Install the Public Key on the Cluster

```bash
# Copy the public key to the cluster
ssh-copy-id YOUR_USERNAME@YOUR_CLUSTER_ADDRESS

# Alternatively, paste the contents of id_ed25519.pub into
# ~/.ssh/authorized_keys on the cluster login node.
```

### 3.3. Configure `~/.ssh/config`

Add a named host entry. Hermes will reference this alias throughout the workflow.

```
Host hpc
    HostName YOUR_CLUSTER_ADDRESS
    User     YOUR_USERNAME
    IdentityFile ~/.ssh/id_ed25519
    Port     22
    # Uncomment if your key is on a hardware token (YubiKey, etc.)
    # ForwardAgent yes
```

> Replace `YOUR_CLUSTER_ADDRESS` and `YOUR_USERNAME` with your actual values.

### 3.4. Test the Connection

```bash
ssh hpc
whoami        # Expected: YOUR_USERNAME
exit
```

If you reach the cluster without a password prompt, the SSH tunnel is ready.

---

## 4. Step 2: Install Hermes Locally

**Hermes** is the local AI agent that bridges Slack and your HPC cluster. It receives Slack commands, translates them into terminal / SLURM / file operations, and returns results. Think of it as an always-on research assistant running on your machine.

### 4.1. Install Hermes

```bash
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
```

This installs the Hermes CLI and supporting tools into `~/.hermes`.

### 4.2. Verify the Installation

```bash
hermes doctor
```

All health-check items should pass:

```
✅ Python .......... 3.11.x  (OK)
✅ Node.js ......... 18.x    (OK)
✅ Config folder ... ~/.hermes (OK)
✅ Disk space ...... 45 GB free (OK)
```

Resolve any flagged issues before proceeding. See the [Hermes documentation](https://hermes-agent.nousresearch.com/docs) for troubleshooting guidance.

### 4.3. Hermes Capabilities at a Glance

| Capability | Description |
|------------|-------------|
| **Command execution** | Runs shell commands locally and over SSH |
| **File management** | Reads, writes, and diffs files on local and remote systems |
| **Process orchestration** | Manages long-running background jobs |
| **Slack integration** | Bolt SDK app that maps Slack messages to Hermes actions |
| **Self-reasoning** | Breaks complex tasks into steps, executes, and verifies outcomes |

---

## 5. Step 3: Configure Hermes SSH Backend

Hermes must know how to reach the cluster. Rather than hardcoding hostnames in scripts, you set a single backend directive that points to the SSH alias defined in `~/.ssh/config`.

### 5.1. Set the Backend and Host

```bash
# Use SSH as the terminal backend
hermes config set terminal.backend ssh

# Point to the SSH alias defined in ~/.ssh/config
hermes config set terminal.host hpc
```

The `hpc` alias must match a `Host hpc` stanza in your SSH config. Hermes resolves the hostname, user, and identity file automatically.

### 5.2. Set the Remote Working Directory

```bash
hermes config set terminal.remote_dir /scratch/YOUR_USERNAME
```

> Use `/scratch` (not `/home`) for large datasets and model checkpoints to avoid quota issues.

### 5.3. Verify the Config File

```yaml
# ~/.hermes/config.yaml
terminal:
  backend: ssh
  host: hpc
  remote_dir: /scratch/YOUR_USERNAME
```

### 5.4. Run a Connectivity Test

```bash
hermes chat -q "whoami; hostname; sinfo"
```

Expected output:

```
┌─ output ────────────────────────────┐
│ YOUR_USERNAME                       │
│ login-node-01.cluster.edu           │
│ PARTITION      AVAIL  TIMELIMIT     │
│ gpu_normal      up    7-00:00       │
│ gpu_dev         up    1-00:00       │
└─────────────────────────────────────┘
```

> **Troubleshooting:** If the test hangs, verify password-less login with plain `ssh hpc` first. Hermes reuses your SSH key — it does not manage authentication itself.

---

## 6. Step 4: Slack Gateway Setup

The Slack gateway turns your workspace into a natural-language control panel for Hermes. This section covers creating a Slack app, wiring it to Hermes, and running the gateway as a system service.

### 6.1. Create a Slack App

1. Go to [https://api.slack.com/apps](https://api.slack.com/apps) and click **Create New App → From scratch**.
2. Name the app (e.g., `hermes-research-bot`) and select your workspace.
3. Under **Features → Bot Tokens**, add the following OAuth scopes:
   - `channels:history`, `channels:read`, `channels:write`
   - `chat:write`, `chat:write.customize`
   - `groups:history`, `im:history`, `im:write`
   - `files:read`, `files:write`
   - `mpim:history`, `users:read`
4. Under **OAuth & Permissions**, install the app to your workspace and copy the **Bot User OAuth Token** (begins with `xoxb-`).

> These scopes allow the bot to read channel history, send formatted messages and file attachments, and inspect user information. Request only the scopes your workflow requires.

### 6.2. Generate an App-Level Token

1. Go to **Settings → Basic Information → App-Level Tokens**.
2. Create a token with the `connections:write` scope (required for Socket Mode).
3. Copy the token (begins with `xapp-`).

### 6.3. Configure Hermes Credentials

Run the interactive setup wizard:

```bash
hermes gateway setup
# [?] Select gateway type:  → Slack
# [?] Enable Socket Mode?:  → Yes
```

Store credentials in `~/.hermes/.env` — never in `config.yaml`:

```bash
# ~/.hermes/.env — never commit to version control
SLACK_BOT_TOKEN=xoxb-YOUR_SLACK_BOT_TOKEN
SLACK_APP_TOKEN=xapp-YOUR_SLACK_APP_TOKEN
```

Reference the environment variables in config:

```bash
hermes config set gateway.slack.token '${SLACK_BOT_TOKEN}'
hermes config set gateway.slack.app_token '${SLACK_APP_TOKEN}'
```

Your `~/.hermes/config.yaml` should show variable references, not raw secrets:

```yaml
gateway:
  type: slack
  socket_mode: true
  slack:
    token: ${SLACK_BOT_TOKEN}
    app_token: ${SLACK_APP_TOKEN}
```

> **Security note:** Hermes loads `~/.hermes/.env` automatically at startup. If you paste a raw `xoxb-` token directly into `config.yaml`, `hermes doctor` will warn you.

### 6.4. Install and Start the Gateway

```bash
hermes gateway install   # Register as a systemd service (runs as your user)
hermes gateway start     # Start the service
hermes gateway status    # Verify it is running
```

Expected status output:

```
✅ Service active: hermes-gateway.service
✅ Socket Mode: connected
✅ Listening on workspace: your-team.slack.com
```

### 6.5. Send Your First Command

Open Slack and message the bot directly:

```
you  →  @hermes-research-bot: whoami; hostname; sinfo
bot  ←  [cluster output formatted as a table]
```

Or mention it in a channel:

```
you  →  #research-gpu: @hermes-research-bot show me active jobs on gpu_normal
bot  ←  [squeue output formatted as a table]
```

Hermes parses natural language, routes the request through SSH, executes it on the cluster, and returns formatted results — entirely within Slack.

---

## 7. Step 5: SLURM Job Templates + Singularity

Hermes submits training jobs to the cluster via SLURM `sbatch`. The standard workflow is:

1. Hermes writes a SLURM script on your local machine
2. It SCPs the script to the cluster
3. It runs `sbatch` via SSH
4. It monitors job status and tails logs in real time

### 7.1. SLURM Script Template

Save the following as `cnn_train.slurm`:

```bash
#!/bin/bash
#SBATCH --job-name=cnn_cifar10
#SBATCH --partition=gpu_normal
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --output=/scratch/%u/cnn_%j.out
#SBATCH --error=/scratch/%u/cnn_%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=YOUR_EMAIL

# Load required modules
module load singularity

# Path to the Singularity container
CONTAINER=/opt/containers/pytorch_2.0.sif

# Run training inside the container
singularity exec \
  --bind /scratch:/data,/home:/home \
  --nv \
  $CONTAINER \
  python /data/cnn_train.py \
    --epochs      50  \
    --batch_size  128 \
    --lr          0.01 \
    --data_dir    /data/cifar10 \
    --output_dir  /data/cnn_results
```

### 7.2. Submit via Hermes

```bash
# Hermes handles this automatically; the equivalent manual steps are:
scp cnn_train.slurm hpc:/scratch/YOUR_USERNAME/
ssh hpc "sbatch /scratch/YOUR_USERNAME/cnn_train.slurm"
```

### 7.3. Monitor via Hermes

```bash
# Check job status
ssh hpc "squeue -u YOUR_USERNAME --format='%i %P %j %S %R'"

# Stream live output
ssh hpc "tail -f /scratch/YOUR_USERNAME/cnn_JOB_ID.out"
```

---

## 8. Step 6: Simple CNN Demo

A ready-to-use PyTorch CNN script (`train_cnn.py`) is included in this repository. It:

- Trains a 4-layer CNN on CIFAR-10
- Accepts `--epochs`, `--batch_size`, `--lr`, `--data_dir`, `--output_dir` arguments
- Saves `metrics.csv` (per-epoch train/val loss and accuracy)
- Saves `best_model.pth` checkpoint
- Outputs `summary.json` with the best validation accuracy

```bash
# Quick local test (CPU, small dataset)
python train_cnn.py \
  --epochs 3 --batch_size 64 --lr 0.001 \
  --data_dir ./data --output_dir ./results

# Submit to cluster via SLURM + Singularity
ssh hpc "sbatch cnn_train.slurm"
```

See [`train_cnn.py`](train_cnn.py) in this directory for the full source code.

---

## 9. Step 7: Automated Research Loop via Hermes Cron

Hermes can proactively monitor your experiments and Slack-DM you with updates — no manual polling required.

### 9.1. Create a Monitoring Cron Job

```bash
hermes cron create "every 5m" \
  --prompt "SSH to hpc, run squeue -u YOUR_USERNAME, tail the latest CNN training log,
            parse the latest epoch metrics, and report train/val accuracy and loss.
            If training is complete, summarize results and suggest next experiments."
```

### 9.2. Manage Cron Jobs

```bash
hermes cron list        # View all scheduled jobs
hermes cron pause ID    # Pause a specific job
hermes cron resume ID   # Resume a paused job
hermes cron remove ID   # Delete a job permanently
```

**The benefit in practice:** start a 24-hour training run, close your laptop, and go home. Every 5 minutes, Hermes checks the log, parses metrics, and messages you in Slack. When training finishes, you receive a summary and a suggested next step, for example:

> *"Val accuracy plateaued at 82% after epoch 12. Suggest: increase lr to 0.01 or add data augmentation."*

---

## 10. Step 8: Security Hardening

### 10.1. SLURM Best Practices

| Practice | Directive | Purpose |
|----------|-----------|---------|
| Resource limits | `--mem=32G --cpus-per-task=8 --time=24:00:00` | Prevent runaway jobs |
| Account chargeback | `--account=research_group` | Track compute costs per group |
| Node constraints | `--constraint=v100` | Pin jobs to a specific GPU type |
| Job arrays | `--array=0-9%3` | Run 10 sweeps with max 3 concurrent |
| Node exclusion | `--exclude=node05` | Skip known problematic nodes |

### 10.2. Singularity / Apptainer Best Practices

| Practice | Flag | Purpose |
|----------|------|---------|
| Clean environment | `--cleanenv` | Prevent credential leakage from the host |
| Minimal bind mounts | `--bind /data:/data` | Mount only required directories |
| GPU passthrough | `--nv` | NVIDIA GPU access without full host privileges |
| Unprivileged containers | `--userns` (Apptainer) | User-namespace isolation |
| Image verification | `singularity verify container.sif` | Validate signatures before execution |

### 10.3. SSH Hardening

```
# ~/.ssh/config
Host hpc
    ServerAliveInterval  60
    ServerAliveCountMax  3
    ForwardAgent         no    # Do not expose your SSH agent to the cluster
    StrictHostKeyChecking yes  # Always verify host keys
```

---

## 11. Troubleshooting

| Symptom | Diagnostic Command | Resolution |
|---------|--------------------|------------|
| Slack bot not responding | `hermes gateway status` | `hermes gateway restart` |
| SSH connection refused | `ssh hpc "echo OK"` | Check `~/.ssh/config` and cluster DNS |
| SLURM job stuck in PD | `squeue -u USER` | Verify partition and GPU availability |
| `singularity exec` fails | Inspect job log | Ensure `--nv` flag is present for NVIDIA GPUs |
| Hermes cannot write to `/scratch` | Check permissions | `chmod 755 /scratch/YOUR_USERNAME` |
| Cron job not triggering | `hermes cron list` | Verify schedule format and enabled status |

---

## 12. Citations

```bibtex
@misc{hermes_remote_hpc_slack,
  author       = {Sanghlao, Snit },
  title        = {Hermes Remote HPC + Slack: Research Acceleration Guide},
  year         = {2026},
  howpublished = {Mahidol University AI Center},
  note         = {aicenter.mahidol.ac.th/hermes-remote-hpc-slack},
}

@software{hermes_agent,
  author       = {{Nous Research}},
  title        = {Hermes Agent},
  url          = {https://hermes-agent.nousresearch.com/docs},
  version      = {latest},
  year         = {2026},
}

@software{karpathy_autoresearch,
  author       = {Karpathy, Andrej},
  title        = {AutoResearch: Autonomous ML Research Agent},
  url          = {https://github.com/karpathy/autoresearch},
  year         = {2025},
}
```

---

*Built for researchers who refuse to compromise on productivity, privacy, and control.*

**Mahidol University AI Center** · **Hermes Agent** · **SLURM + Singularity**