Hermes Remote HPC + Slack — Research Acceleration Guide#

Authors: Snit Sanghlao, Qwen, Hermes
Published by: Mahidol University AI Center
Reference: aicenter.mahidol.ac.th/hermes-remote-hpc-slack

Inspired by: Andrej Karpathy’s AutoResearch — an autonomous research agent that iterates on experiments, analysis, and reporting without manual intervention. We adapt this vision for cluster-based ML research using Hermes as the local orchestrator.

Hermes documentation: https://hermes-agent.nousresearch.com/docs

Table of Contents#

Executive Summary
Prerequisites
SSH Setup
Install Hermes Locally
Configure Hermes SSH Backend
Slack Gateway Setup
SLURM Job Templates + Singularity
Simple CNN Demo
Automated Research Loop via Hermes Cron
Security Hardening
Troubleshooting
Citations

1. Executive Summary#

Machine learning research on HPC clusters should not require a terminal session glued to your screen. This guide shows how to drive experiments entirely through Slack using:

Component	Role
Hermes (local machine)	AI-powered orchestrator — parses Slack commands, manages workflows, coordinates cluster jobs
SSH	Secure, key-based tunnel from your workstation to the HPC login node
SLURM + Singularity (remote cluster)	Batch scheduling and containerized GPU training environments
Slack	Natural-language interface for submitting runs, checking logs, and reviewing results

The goal is a closed-loop research workflow where you describe an idea in Slack and receive plots, metrics, and the next research direction — all without opening a terminal.

The Research Benefit Loop#

IDEA (Slack) ──► HERMES (Local) ──► SSH ──► SLURM (Cluster) ──► GPU TRAINING
                                                                       │
RESULTS (Logs/Plots) ◄── SLURM ◄── HERMES ◄──────── ┌─────────────┐  │
        │                                             │   ITERATE   │◄─┘
SLACK REPORT + NEXT IDEA ◄───────────────────────────└─────────────┘

You type an idea in Slack → Hermes translates it into a workflow → the cluster trains → results flow back → Hermes summarizes and suggests the next step → repeat.

The Convergence Principle#

The loop above is an instance of a general research iteration model:

Ideal-Starter(Human | LLM)  −  Validator(any abstraction level)  =  Residual
                                                                         │
                                              Auto Research Loop ◄───────┘
                                                      │
                                              Residual Converges?
                                                 Yes  │  No
                                                      │   └──► iterate
                                                   STOP

Term	Meaning
Ideal-Starter	The initial hypothesis or experiment design — provided by a human, generated by an LLM, or both
Validator	Any evaluation signal: loss curves, accuracy metrics, statistical tests, human review, or a critic model — operating at any abstraction level
Residual	The gap between the current result and the ideal — what remains to be explained or improved
Converge	Residual falls below an acceptable threshold (e.g., validation loss plateaus, accuracy target met, human approves result)

The auto research loop driven by Hermes operationalizes this model: each iteration reduces the residual until convergence, at which point the loop stops and the final result is reported.

2. Prerequisites#

Both the local workstation (where Hermes runs) and the remote HPC cluster must meet the requirements listed below.

Requirement	Home Machine (Local)	Remote Cluster (HPC)
OS	Ubuntu 20.04+ / macOS 12+ / WSL2	Linux (CentOS 7+, Rocky 8+, Ubuntu 20+)
Python	3.10+ (for Hermes + helpers)	3.9+ (inside Singularity containers)
Node.js	18+ (for Slack Bolt SDK)	Not required
SSH Key	Yes — public key installed on cluster	Yes — `authorized_keys` on login node
SLURM	No	Yes — `sbatch`, `squeue`, `scancel`
Singularity/Apptainer	No	Yes — containerized ML environments
Hermes	Yes — installed locally	No (runs locally only)

Note: If your cluster uses a different scheduler (PBS/Torque, LSF), adapt the SLURM commands accordingly. The architecture remains the same.

3. Step 1: SSH Setup#

Hermes communicates with the HPC cluster over SSH. A password-less, key-based configuration is essential for fully automated workflows.

3.1. Generate an SSH Key#

ssh-keygen -t ed25519 -C "your_email@example.com"
# Private key: /home/snitsan/.ssh/id_ed25519   (keep secret)
# Public key:  /home/snitsan/.ssh/id_ed25519.pub (copy to cluster)

3.2. Install the Public Key on the Cluster#

# Copy the public key to the cluster
ssh-copy-id YOUR_USERNAME@YOUR_CLUSTER_ADDRESS

# Alternatively, paste the contents of id_ed25519.pub into
# ~/.ssh/authorized_keys on the cluster login node.

3.3. Configure `~/.ssh/config`#

Add a named host entry. Hermes will reference this alias throughout the workflow.

Host hpc
    HostName YOUR_CLUSTER_ADDRESS
    User     YOUR_USERNAME
    IdentityFile ~/.ssh/id_ed25519
    Port     22
    # Uncomment if your key is on a hardware token (YubiKey, etc.)
    # ForwardAgent yes

Replace YOUR_CLUSTER_ADDRESS and YOUR_USERNAME with your actual values.

3.4. Test the Connection#

ssh hpc
whoami        # Expected: YOUR_USERNAME
exit

If you reach the cluster without a password prompt, the SSH tunnel is ready.

4. Step 2: Install Hermes Locally#

Hermes is the local AI agent that bridges Slack and your HPC cluster. It receives Slack commands, translates them into terminal / SLURM / file operations, and returns results. Think of it as an always-on research assistant running on your machine.

4.1. Install Hermes#

curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash

This installs the Hermes CLI and supporting tools into ~/.hermes.

4.2. Verify the Installation#

hermes doctor

All health-check items should pass:

✅ Python .......... 3.11.x  (OK)
✅ Node.js ......... 18.x    (OK)
✅ Config folder ... ~/.hermes (OK)
✅ Disk space ...... 45 GB free (OK)

Resolve any flagged issues before proceeding. See the Hermes documentation for troubleshooting guidance.

4.3. Hermes Capabilities at a Glance#

Capability	Description
Command execution	Runs shell commands locally and over SSH
File management	Reads, writes, and diffs files on local and remote systems
Process orchestration	Manages long-running background jobs
Slack integration	Bolt SDK app that maps Slack messages to Hermes actions
Self-reasoning	Breaks complex tasks into steps, executes, and verifies outcomes

5. Step 3: Configure Hermes SSH Backend#

Hermes must know how to reach the cluster. Rather than hardcoding hostnames in scripts, you set a single backend directive that points to the SSH alias defined in ~/.ssh/config.

5.1. Set the Backend and Host#

# Use SSH as the terminal backend
hermes config set terminal.backend ssh

# Point to the SSH alias defined in ~/.ssh/config
hermes config set terminal.host hpc

The hpc alias must match a Host hpc stanza in your SSH config. Hermes resolves the hostname, user, and identity file automatically.

5.2. Set the Remote Working Directory#

hermes config set terminal.remote_dir /scratch/YOUR_USERNAME

Use /scratch (not /home) for large datasets and model checkpoints to avoid quota issues.

5.3. Verify the Config File#

# ~/.hermes/config.yaml
terminal:
  backend: ssh
  host: hpc
  remote_dir: /scratch/YOUR_USERNAME

5.4. Run a Connectivity Test#

hermes chat -q "whoami; hostname; sinfo"

Expected output:

┌─ output ────────────────────────────┐
│ YOUR_USERNAME                       │
│ login-node-01.cluster.edu           │
│ PARTITION      AVAIL  TIMELIMIT     │
│ gpu_normal      up    7-00:00       │
│ gpu_dev         up    1-00:00       │
└─────────────────────────────────────┘

Troubleshooting: If the test hangs, verify password-less login with plain ssh hpc first. Hermes reuses your SSH key — it does not manage authentication itself.

6. Step 4: Slack Gateway Setup#

The Slack gateway turns your workspace into a natural-language control panel for Hermes. This section covers creating a Slack app, wiring it to Hermes, and running the gateway as a system service.

6.1. Create a Slack App#

Go to https://api.slack.com/apps and click Create New App → From scratch.
Name the app (e.g., hermes-research-bot) and select your workspace.
Under Features → Bot Tokens, add the following OAuth scopes:
- channels:history, channels:read, channels:write
- chat:write, chat:write.customize
- groups:history, im:history, im:write
- files:read, files:write
- mpim:history, users:read
Under OAuth & Permissions, install the app to your workspace and copy the Bot User OAuth Token (begins with xoxb-).

These scopes allow the bot to read channel history, send formatted messages and file attachments, and inspect user information. Request only the scopes your workflow requires.

6.2. Generate an App-Level Token#

Go to Settings → Basic Information → App-Level Tokens.
Create a token with the connections:write scope (required for Socket Mode).
Copy the token (begins with xapp-).

6.3. Configure Hermes Credentials#

Run the interactive setup wizard:

hermes gateway setup
# [?] Select gateway type:  → Slack
# [?] Enable Socket Mode?:  → Yes

Store credentials in ~/.hermes/.env — never in config.yaml:

# ~/.hermes/.env — never commit to version control
SLACK_BOT_TOKEN=xoxb-YOUR_SLACK_BOT_TOKEN
SLACK_APP_TOKEN=xapp-YOUR_SLACK_APP_TOKEN

Reference the environment variables in config:

hermes config set gateway.slack.token '${SLACK_BOT_TOKEN}'
hermes config set gateway.slack.app_token '${SLACK_APP_TOKEN}'

Your ~/.hermes/config.yaml should show variable references, not raw secrets:

gateway:
  type: slack
  socket_mode: true
  slack:
    token: ${SLACK_BOT_TOKEN}
    app_token: ${SLACK_APP_TOKEN}

Security note: Hermes loads ~/.hermes/.env automatically at startup. If you paste a raw xoxb- token directly into config.yaml, hermes doctor will warn you.

6.4. Install and Start the Gateway#

hermes gateway install   # Register as a systemd service (runs as your user)
hermes gateway start     # Start the service
hermes gateway status    # Verify it is running

Expected status output:

✅ Service active: hermes-gateway.service
✅ Socket Mode: connected
✅ Listening on workspace: your-team.slack.com

6.5. Send Your First Command#

Open Slack and message the bot directly:

you  →  @hermes-research-bot: whoami; hostname; sinfo
bot  ←  [cluster output formatted as a table]

Or mention it in a channel:

you  →  #research-gpu: @hermes-research-bot show me active jobs on gpu_normal
bot  ←  [squeue output formatted as a table]

Hermes parses natural language, routes the request through SSH, executes it on the cluster, and returns formatted results — entirely within Slack.

7. Step 5: SLURM Job Templates + Singularity#

Hermes submits training jobs to the cluster via SLURM sbatch. The standard workflow is:

Hermes writes a SLURM script on your local machine
It SCPs the script to the cluster
It runs sbatch via SSH
It monitors job status and tails logs in real time

7.1. SLURM Script Template#

Save the following as cnn_train.slurm:

#!/bin/bash
#SBATCH --job-name=cnn_cifar10
#SBATCH --partition=gpu_normal
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --output=/scratch/%u/cnn_%j.out
#SBATCH --error=/scratch/%u/cnn_%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=YOUR_EMAIL

# Load required modules
module load singularity

# Path to the Singularity container
CONTAINER=/opt/containers/pytorch_2.0.sif

# Run training inside the container
singularity exec \
  --bind /scratch:/data,/home:/home \
  --nv \
  $CONTAINER \
  python /data/cnn_train.py \
    --epochs      50  \
    --batch_size  128 \
    --lr          0.01 \
    --data_dir    /data/cifar10 \
    --output_dir  /data/cnn_results

7.2. Submit via Hermes#

# Hermes handles this automatically; the equivalent manual steps are:
scp cnn_train.slurm hpc:/scratch/YOUR_USERNAME/
ssh hpc "sbatch /scratch/YOUR_USERNAME/cnn_train.slurm"

7.3. Monitor via Hermes#

# Check job status
ssh hpc "squeue -u YOUR_USERNAME --format='%i %P %j %S %R'"

# Stream live output
ssh hpc "tail -f /scratch/YOUR_USERNAME/cnn_JOB_ID.out"

8. Step 6: Simple CNN Demo#

A ready-to-use PyTorch CNN script (train_cnn.py) is included in this repository. It:

Trains a 4-layer CNN on CIFAR-10
Accepts --epochs, --batch_size, --lr, --data_dir, --output_dir arguments
Saves metrics.csv (per-epoch train/val loss and accuracy)
Saves best_model.pth checkpoint
Outputs summary.json with the best validation accuracy

# Quick local test (CPU, small dataset)
python train_cnn.py \
  --epochs 3 --batch_size 64 --lr 0.001 \
  --data_dir ./data --output_dir ./results

# Submit to cluster via SLURM + Singularity
ssh hpc "sbatch cnn_train.slurm"

See train_cnn.py in this directory for the full source code.

9. Step 7: Automated Research Loop via Hermes Cron#

Hermes can proactively monitor your experiments and Slack-DM you with updates — no manual polling required.

9.1. Create a Monitoring Cron Job#

hermes cron create "every 5m" \
  --prompt "SSH to hpc, run squeue -u YOUR_USERNAME, tail the latest CNN training log,
            parse the latest epoch metrics, and report train/val accuracy and loss.
            If training is complete, summarize results and suggest next experiments."

9.2. Manage Cron Jobs#

hermes cron list        # View all scheduled jobs
hermes cron pause ID    # Pause a specific job
hermes cron resume ID   # Resume a paused job
hermes cron remove ID   # Delete a job permanently

The benefit in practice: start a 24-hour training run, close your laptop, and go home. Every 5 minutes, Hermes checks the log, parses metrics, and messages you in Slack. When training finishes, you receive a summary and a suggested next step, for example:

“Val accuracy plateaued at 82% after epoch 12. Suggest: increase lr to 0.01 or add data augmentation.”

10. Step 8: Security Hardening#

10.1. SLURM Best Practices#

Practice	Directive	Purpose
Resource limits	`--mem=32G --cpus-per-task=8 --time=24:00:00`	Prevent runaway jobs
Account chargeback	`--account=research_group`	Track compute costs per group
Node constraints	`--constraint=v100`	Pin jobs to a specific GPU type
Job arrays	`--array=0-9%3`	Run 10 sweeps with max 3 concurrent
Node exclusion	`--exclude=node05`	Skip known problematic nodes

10.2. Singularity / Apptainer Best Practices#

Practice	Flag	Purpose
Clean environment	`--cleanenv`	Prevent credential leakage from the host
Minimal bind mounts	`--bind /data:/data`	Mount only required directories
GPU passthrough	`--nv`	NVIDIA GPU access without full host privileges
Unprivileged containers	`--userns` (Apptainer)	User-namespace isolation
Image verification	`singularity verify container.sif`	Validate signatures before execution

10.3. SSH Hardening#

# ~/.ssh/config
Host hpc
    ServerAliveInterval  60
    ServerAliveCountMax  3
    ForwardAgent         no    # Do not expose your SSH agent to the cluster
    StrictHostKeyChecking yes  # Always verify host keys

11. Troubleshooting#

Symptom	Diagnostic Command	Resolution
Slack bot not responding	`hermes gateway status`	`hermes gateway restart`
SSH connection refused	`ssh hpc "echo OK"`	Check `~/.ssh/config` and cluster DNS
SLURM job stuck in PD	`squeue -u USER`	Verify partition and GPU availability
`singularity exec` fails	Inspect job log	Ensure `--nv` flag is present for NVIDIA GPUs
Hermes cannot write to `/scratch`	Check permissions	`chmod 755 /scratch/YOUR_USERNAME`
Cron job not triggering	`hermes cron list`	Verify schedule format and enabled status

12. Citations#

@misc{hermes_remote_hpc_slack,
  author       = {Sanghlao, Snit },
  title        = {Hermes Remote HPC + Slack: Research Acceleration Guide},
  year         = {2026},
  howpublished = {Mahidol University AI Center},
  note         = {aicenter.mahidol.ac.th/hermes-remote-hpc-slack},
}

@software{hermes_agent,
  author       = {{Nous Research}},
  title        = {Hermes Agent},
  url          = {https://hermes-agent.nousresearch.com/docs},
  version      = {latest},
  year         = {2026},
}

@software{karpathy_autoresearch,
  author       = {Karpathy, Andrej},
  title        = {AutoResearch: Autonomous ML Research Agent},
  url          = {https://github.com/karpathy/autoresearch},
  year         = {2025},
}

Built for researchers who refuse to compromise on productivity, privacy, and control.

Mahidol University AI Center · Hermes Agent · SLURM + Singularity