Hermes Remote HPC + Slack β€” Research Acceleration Guide#

Authors: Snit Sanghlao, Qwen, Hermes
Published by: Mahidol University AI Center
Reference: aicenter.mahidol.ac.th/hermes-remote-hpc-slack

Inspired by: Andrej Karpathy’s AutoResearch β€” an autonomous research agent that iterates on experiments, analysis, and reporting without manual intervention. We adapt this vision for cluster-based ML research using Hermes as the local orchestrator.

Hermes documentation: https://hermes-agent.nousresearch.com/docs


Table of Contents#

  1. Executive Summary

  2. Prerequisites

  3. SSH Setup

  4. Install Hermes Locally

  5. Configure Hermes SSH Backend

  6. Slack Gateway Setup

  7. SLURM Job Templates + Singularity

  8. Simple CNN Demo

  9. Automated Research Loop via Hermes Cron

  10. Security Hardening

  11. Troubleshooting

  12. Citations


1. Executive Summary#

Machine learning research on HPC clusters should not require a terminal session glued to your screen. This guide shows how to drive experiments entirely through Slack using:

Component

Role

Hermes (local machine)

AI-powered orchestrator β€” parses Slack commands, manages workflows, coordinates cluster jobs

SSH

Secure, key-based tunnel from your workstation to the HPC login node

SLURM + Singularity (remote cluster)

Batch scheduling and containerized GPU training environments

Slack

Natural-language interface for submitting runs, checking logs, and reviewing results

The goal is a closed-loop research workflow where you describe an idea in Slack and receive plots, metrics, and the next research direction β€” all without opening a terminal.

The Research Benefit Loop#

IDEA (Slack) ──► HERMES (Local) ──► SSH ──► SLURM (Cluster) ──► GPU TRAINING
                                                                       β”‚
RESULTS (Logs/Plots) ◄── SLURM ◄── HERMES ◄──────── β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
        β”‚                                             β”‚   ITERATE   β”‚β—„β”€β”˜
SLACK REPORT + NEXT IDEA β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

You type an idea in Slack β†’ Hermes translates it into a workflow β†’ the cluster trains β†’ results flow back β†’ Hermes summarizes and suggests the next step β†’ repeat.

The Convergence Principle#

The loop above is an instance of a general research iteration model:

Ideal-Starter(Human | LLM)  βˆ’  Validator(any abstraction level)  =  Residual
                                                                         β”‚
                                              Auto Research Loop β—„β”€β”€β”€β”€β”€β”€β”€β”˜
                                                      β”‚
                                              Residual Converges?
                                                 Yes  β”‚  No
                                                      β”‚   └──► iterate
                                                   STOP

Term

Meaning

Ideal-Starter

The initial hypothesis or experiment design β€” provided by a human, generated by an LLM, or both

Validator

Any evaluation signal: loss curves, accuracy metrics, statistical tests, human review, or a critic model β€” operating at any abstraction level

Residual

The gap between the current result and the ideal β€” what remains to be explained or improved

Converge

Residual falls below an acceptable threshold (e.g., validation loss plateaus, accuracy target met, human approves result)

The auto research loop driven by Hermes operationalizes this model: each iteration reduces the residual until convergence, at which point the loop stops and the final result is reported.


2. Prerequisites#

Both the local workstation (where Hermes runs) and the remote HPC cluster must meet the requirements listed below.

Requirement

Home Machine (Local)

Remote Cluster (HPC)

OS

Ubuntu 20.04+ / macOS 12+ / WSL2

Linux (CentOS 7+, Rocky 8+, Ubuntu 20+)

Python

3.10+ (for Hermes + helpers)

3.9+ (inside Singularity containers)

Node.js

18+ (for Slack Bolt SDK)

Not required

SSH Key

Yes β€” public key installed on cluster

Yes β€” authorized_keys on login node

SLURM

No

Yes β€” sbatch, squeue, scancel

Singularity/Apptainer

No

Yes β€” containerized ML environments

Hermes

Yes β€” installed locally

No (runs locally only)

Note: If your cluster uses a different scheduler (PBS/Torque, LSF), adapt the SLURM commands accordingly. The architecture remains the same.


3. Step 1: SSH Setup#

Hermes communicates with the HPC cluster over SSH. A password-less, key-based configuration is essential for fully automated workflows.

3.1. Generate an SSH Key#

ssh-keygen -t ed25519 -C "your_email@example.com"
# Private key: /home/snitsan/.ssh/id_ed25519   (keep secret)
# Public key:  /home/snitsan/.ssh/id_ed25519.pub (copy to cluster)

3.2. Install the Public Key on the Cluster#

# Copy the public key to the cluster
ssh-copy-id YOUR_USERNAME@YOUR_CLUSTER_ADDRESS

# Alternatively, paste the contents of id_ed25519.pub into
# ~/.ssh/authorized_keys on the cluster login node.

3.3. Configure ~/.ssh/config#

Add a named host entry. Hermes will reference this alias throughout the workflow.

Host hpc
    HostName YOUR_CLUSTER_ADDRESS
    User     YOUR_USERNAME
    IdentityFile ~/.ssh/id_ed25519
    Port     22
    # Uncomment if your key is on a hardware token (YubiKey, etc.)
    # ForwardAgent yes

Replace YOUR_CLUSTER_ADDRESS and YOUR_USERNAME with your actual values.

3.4. Test the Connection#

ssh hpc
whoami        # Expected: YOUR_USERNAME
exit

If you reach the cluster without a password prompt, the SSH tunnel is ready.


4. Step 2: Install Hermes Locally#

Hermes is the local AI agent that bridges Slack and your HPC cluster. It receives Slack commands, translates them into terminal / SLURM / file operations, and returns results. Think of it as an always-on research assistant running on your machine.

4.1. Install Hermes#

curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash

This installs the Hermes CLI and supporting tools into ~/.hermes.

4.2. Verify the Installation#

hermes doctor

All health-check items should pass:

βœ… Python .......... 3.11.x  (OK)
βœ… Node.js ......... 18.x    (OK)
βœ… Config folder ... ~/.hermes (OK)
βœ… Disk space ...... 45 GB free (OK)

Resolve any flagged issues before proceeding. See the Hermes documentation for troubleshooting guidance.

4.3. Hermes Capabilities at a Glance#

Capability

Description

Command execution

Runs shell commands locally and over SSH

File management

Reads, writes, and diffs files on local and remote systems

Process orchestration

Manages long-running background jobs

Slack integration

Bolt SDK app that maps Slack messages to Hermes actions

Self-reasoning

Breaks complex tasks into steps, executes, and verifies outcomes


5. Step 3: Configure Hermes SSH Backend#

Hermes must know how to reach the cluster. Rather than hardcoding hostnames in scripts, you set a single backend directive that points to the SSH alias defined in ~/.ssh/config.

5.1. Set the Backend and Host#

# Use SSH as the terminal backend
hermes config set terminal.backend ssh

# Point to the SSH alias defined in ~/.ssh/config
hermes config set terminal.host hpc

The hpc alias must match a Host hpc stanza in your SSH config. Hermes resolves the hostname, user, and identity file automatically.

5.2. Set the Remote Working Directory#

hermes config set terminal.remote_dir /scratch/YOUR_USERNAME

Use /scratch (not /home) for large datasets and model checkpoints to avoid quota issues.

5.3. Verify the Config File#

# ~/.hermes/config.yaml
terminal:
  backend: ssh
  host: hpc
  remote_dir: /scratch/YOUR_USERNAME

5.4. Run a Connectivity Test#

hermes chat -q "whoami; hostname; sinfo"

Expected output:

β”Œβ”€ output ────────────────────────────┐
β”‚ YOUR_USERNAME                       β”‚
β”‚ login-node-01.cluster.edu           β”‚
β”‚ PARTITION      AVAIL  TIMELIMIT     β”‚
β”‚ gpu_normal      up    7-00:00       β”‚
β”‚ gpu_dev         up    1-00:00       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Troubleshooting: If the test hangs, verify password-less login with plain ssh hpc first. Hermes reuses your SSH key β€” it does not manage authentication itself.


6. Step 4: Slack Gateway Setup#

The Slack gateway turns your workspace into a natural-language control panel for Hermes. This section covers creating a Slack app, wiring it to Hermes, and running the gateway as a system service.

6.1. Create a Slack App#

  1. Go to https://api.slack.com/apps and click Create New App β†’ From scratch.

  2. Name the app (e.g., hermes-research-bot) and select your workspace.

  3. Under Features β†’ Bot Tokens, add the following OAuth scopes:

    • channels:history, channels:read, channels:write

    • chat:write, chat:write.customize

    • groups:history, im:history, im:write

    • files:read, files:write

    • mpim:history, users:read

  4. Under OAuth & Permissions, install the app to your workspace and copy the Bot User OAuth Token (begins with xoxb-).

These scopes allow the bot to read channel history, send formatted messages and file attachments, and inspect user information. Request only the scopes your workflow requires.

6.2. Generate an App-Level Token#

  1. Go to Settings β†’ Basic Information β†’ App-Level Tokens.

  2. Create a token with the connections:write scope (required for Socket Mode).

  3. Copy the token (begins with xapp-).

6.3. Configure Hermes Credentials#

Run the interactive setup wizard:

hermes gateway setup
# [?] Select gateway type:  β†’ Slack
# [?] Enable Socket Mode?:  β†’ Yes

Store credentials in ~/.hermes/.env β€” never in config.yaml:

# ~/.hermes/.env β€” never commit to version control
SLACK_BOT_TOKEN=xoxb-YOUR_SLACK_BOT_TOKEN
SLACK_APP_TOKEN=xapp-YOUR_SLACK_APP_TOKEN

Reference the environment variables in config:

hermes config set gateway.slack.token '${SLACK_BOT_TOKEN}'
hermes config set gateway.slack.app_token '${SLACK_APP_TOKEN}'

Your ~/.hermes/config.yaml should show variable references, not raw secrets:

gateway:
  type: slack
  socket_mode: true
  slack:
    token: ${SLACK_BOT_TOKEN}
    app_token: ${SLACK_APP_TOKEN}

Security note: Hermes loads ~/.hermes/.env automatically at startup. If you paste a raw xoxb- token directly into config.yaml, hermes doctor will warn you.

6.4. Install and Start the Gateway#

hermes gateway install   # Register as a systemd service (runs as your user)
hermes gateway start     # Start the service
hermes gateway status    # Verify it is running

Expected status output:

βœ… Service active: hermes-gateway.service
βœ… Socket Mode: connected
βœ… Listening on workspace: your-team.slack.com

6.5. Send Your First Command#

Open Slack and message the bot directly:

you  β†’  @hermes-research-bot: whoami; hostname; sinfo
bot  ←  [cluster output formatted as a table]

Or mention it in a channel:

you  β†’  #research-gpu: @hermes-research-bot show me active jobs on gpu_normal
bot  ←  [squeue output formatted as a table]

Hermes parses natural language, routes the request through SSH, executes it on the cluster, and returns formatted results β€” entirely within Slack.


7. Step 5: SLURM Job Templates + Singularity#

Hermes submits training jobs to the cluster via SLURM sbatch. The standard workflow is:

  1. Hermes writes a SLURM script on your local machine

  2. It SCPs the script to the cluster

  3. It runs sbatch via SSH

  4. It monitors job status and tails logs in real time

7.1. SLURM Script Template#

Save the following as cnn_train.slurm:

#!/bin/bash
#SBATCH --job-name=cnn_cifar10
#SBATCH --partition=gpu_normal
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --output=/scratch/%u/cnn_%j.out
#SBATCH --error=/scratch/%u/cnn_%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=YOUR_EMAIL

# Load required modules
module load singularity

# Path to the Singularity container
CONTAINER=/opt/containers/pytorch_2.0.sif

# Run training inside the container
singularity exec \
  --bind /scratch:/data,/home:/home \
  --nv \
  $CONTAINER \
  python /data/cnn_train.py \
    --epochs      50  \
    --batch_size  128 \
    --lr          0.01 \
    --data_dir    /data/cifar10 \
    --output_dir  /data/cnn_results

7.2. Submit via Hermes#

# Hermes handles this automatically; the equivalent manual steps are:
scp cnn_train.slurm hpc:/scratch/YOUR_USERNAME/
ssh hpc "sbatch /scratch/YOUR_USERNAME/cnn_train.slurm"

7.3. Monitor via Hermes#

# Check job status
ssh hpc "squeue -u YOUR_USERNAME --format='%i %P %j %S %R'"

# Stream live output
ssh hpc "tail -f /scratch/YOUR_USERNAME/cnn_JOB_ID.out"

8. Step 6: Simple CNN Demo#

A ready-to-use PyTorch CNN script (train_cnn.py) is included in this repository. It:

  • Trains a 4-layer CNN on CIFAR-10

  • Accepts --epochs, --batch_size, --lr, --data_dir, --output_dir arguments

  • Saves metrics.csv (per-epoch train/val loss and accuracy)

  • Saves best_model.pth checkpoint

  • Outputs summary.json with the best validation accuracy

# Quick local test (CPU, small dataset)
python train_cnn.py \
  --epochs 3 --batch_size 64 --lr 0.001 \
  --data_dir ./data --output_dir ./results

# Submit to cluster via SLURM + Singularity
ssh hpc "sbatch cnn_train.slurm"

See train_cnn.py in this directory for the full source code.


9. Step 7: Automated Research Loop via Hermes Cron#

Hermes can proactively monitor your experiments and Slack-DM you with updates β€” no manual polling required.

9.1. Create a Monitoring Cron Job#

hermes cron create "every 5m" \
  --prompt "SSH to hpc, run squeue -u YOUR_USERNAME, tail the latest CNN training log,
            parse the latest epoch metrics, and report train/val accuracy and loss.
            If training is complete, summarize results and suggest next experiments."

9.2. Manage Cron Jobs#

hermes cron list        # View all scheduled jobs
hermes cron pause ID    # Pause a specific job
hermes cron resume ID   # Resume a paused job
hermes cron remove ID   # Delete a job permanently

The benefit in practice: start a 24-hour training run, close your laptop, and go home. Every 5 minutes, Hermes checks the log, parses metrics, and messages you in Slack. When training finishes, you receive a summary and a suggested next step, for example:

β€œVal accuracy plateaued at 82% after epoch 12. Suggest: increase lr to 0.01 or add data augmentation.”


10. Step 8: Security Hardening#

10.1. SLURM Best Practices#

Practice

Directive

Purpose

Resource limits

--mem=32G --cpus-per-task=8 --time=24:00:00

Prevent runaway jobs

Account chargeback

--account=research_group

Track compute costs per group

Node constraints

--constraint=v100

Pin jobs to a specific GPU type

Job arrays

--array=0-9%3

Run 10 sweeps with max 3 concurrent

Node exclusion

--exclude=node05

Skip known problematic nodes

10.2. Singularity / Apptainer Best Practices#

Practice

Flag

Purpose

Clean environment

--cleanenv

Prevent credential leakage from the host

Minimal bind mounts

--bind /data:/data

Mount only required directories

GPU passthrough

--nv

NVIDIA GPU access without full host privileges

Unprivileged containers

--userns (Apptainer)

User-namespace isolation

Image verification

singularity verify container.sif

Validate signatures before execution

10.3. SSH Hardening#

# ~/.ssh/config
Host hpc
    ServerAliveInterval  60
    ServerAliveCountMax  3
    ForwardAgent         no    # Do not expose your SSH agent to the cluster
    StrictHostKeyChecking yes  # Always verify host keys

11. Troubleshooting#

Symptom

Diagnostic Command

Resolution

Slack bot not responding

hermes gateway status

hermes gateway restart

SSH connection refused

ssh hpc "echo OK"

Check ~/.ssh/config and cluster DNS

SLURM job stuck in PD

squeue -u USER

Verify partition and GPU availability

singularity exec fails

Inspect job log

Ensure --nv flag is present for NVIDIA GPUs

Hermes cannot write to /scratch

Check permissions

chmod 755 /scratch/YOUR_USERNAME

Cron job not triggering

hermes cron list

Verify schedule format and enabled status


12. Citations#

@misc{hermes_remote_hpc_slack,
  author       = {Sanghlao, Snit },
  title        = {Hermes Remote HPC + Slack: Research Acceleration Guide},
  year         = {2026},
  howpublished = {Mahidol University AI Center},
  note         = {aicenter.mahidol.ac.th/hermes-remote-hpc-slack},
}

@software{hermes_agent,
  author       = {{Nous Research}},
  title        = {Hermes Agent},
  url          = {https://hermes-agent.nousresearch.com/docs},
  version      = {latest},
  year         = {2026},
}

@software{karpathy_autoresearch,
  author       = {Karpathy, Andrej},
  title        = {AutoResearch: Autonomous ML Research Agent},
  url          = {https://github.com/karpathy/autoresearch},
  year         = {2025},
}

Built for researchers who refuse to compromise on productivity, privacy, and control.

Mahidol University AI Center Β· Hermes Agent Β· SLURM + Singularity