Hermes Remote HPC + Slack β Research Acceleration Guide#
Authors: Snit Sanghlao, Qwen, Hermes
Published by: Mahidol University AI Center
Reference: aicenter.mahidol.ac.th/hermes-remote-hpc-slack
Inspired by: Andrej Karpathyβs AutoResearch β an autonomous research agent that iterates on experiments, analysis, and reporting without manual intervention. We adapt this vision for cluster-based ML research using Hermes as the local orchestrator.
Hermes documentation: https://hermes-agent.nousresearch.com/docs
Table of Contents#
Executive Summary
Prerequisites
SSH Setup
Install Hermes Locally
Configure Hermes SSH Backend
Slack Gateway Setup
SLURM Job Templates + Singularity
Simple CNN Demo
Automated Research Loop via Hermes Cron
Security Hardening
Troubleshooting
Citations
1. Executive Summary#
Machine learning research on HPC clusters should not require a terminal session glued to your screen. This guide shows how to drive experiments entirely through Slack using:
Component |
Role |
|---|---|
Hermes (local machine) |
AI-powered orchestrator β parses Slack commands, manages workflows, coordinates cluster jobs |
SSH |
Secure, key-based tunnel from your workstation to the HPC login node |
SLURM + Singularity (remote cluster) |
Batch scheduling and containerized GPU training environments |
Slack |
Natural-language interface for submitting runs, checking logs, and reviewing results |
The goal is a closed-loop research workflow where you describe an idea in Slack and receive plots, metrics, and the next research direction β all without opening a terminal.
The Research Benefit Loop#
IDEA (Slack) βββΊ HERMES (Local) βββΊ SSH βββΊ SLURM (Cluster) βββΊ GPU TRAINING
β
RESULTS (Logs/Plots) βββ SLURM βββ HERMES βββββββββ βββββββββββββββ β
β β ITERATE ββββ
SLACK REPORT + NEXT IDEA βββββββββββββββββββββββββββββββββββββββββββ
You type an idea in Slack β Hermes translates it into a workflow β the cluster trains β results flow back β Hermes summarizes and suggests the next step β repeat.
The Convergence Principle#
The loop above is an instance of a general research iteration model:
Ideal-Starter(Human | LLM) β Validator(any abstraction level) = Residual
β
Auto Research Loop βββββββββ
β
Residual Converges?
Yes β No
β ββββΊ iterate
STOP
Term |
Meaning |
|---|---|
Ideal-Starter |
The initial hypothesis or experiment design β provided by a human, generated by an LLM, or both |
Validator |
Any evaluation signal: loss curves, accuracy metrics, statistical tests, human review, or a critic model β operating at any abstraction level |
Residual |
The gap between the current result and the ideal β what remains to be explained or improved |
Converge |
Residual falls below an acceptable threshold (e.g., validation loss plateaus, accuracy target met, human approves result) |
The auto research loop driven by Hermes operationalizes this model: each iteration reduces the residual until convergence, at which point the loop stops and the final result is reported.
2. Prerequisites#
Both the local workstation (where Hermes runs) and the remote HPC cluster must meet the requirements listed below.
Requirement |
Home Machine (Local) |
Remote Cluster (HPC) |
|---|---|---|
OS |
Ubuntu 20.04+ / macOS 12+ / WSL2 |
Linux (CentOS 7+, Rocky 8+, Ubuntu 20+) |
Python |
3.10+ (for Hermes + helpers) |
3.9+ (inside Singularity containers) |
Node.js |
18+ (for Slack Bolt SDK) |
Not required |
SSH Key |
Yes β public key installed on cluster |
Yes β |
SLURM |
No |
Yes β |
Singularity/Apptainer |
No |
Yes β containerized ML environments |
Hermes |
Yes β installed locally |
No (runs locally only) |
Note: If your cluster uses a different scheduler (PBS/Torque, LSF), adapt the SLURM commands accordingly. The architecture remains the same.
3. Step 1: SSH Setup#
Hermes communicates with the HPC cluster over SSH. A password-less, key-based configuration is essential for fully automated workflows.
3.1. Generate an SSH Key#
ssh-keygen -t ed25519 -C "your_email@example.com"
# Private key: /home/snitsan/.ssh/id_ed25519 (keep secret)
# Public key: /home/snitsan/.ssh/id_ed25519.pub (copy to cluster)
3.2. Install the Public Key on the Cluster#
# Copy the public key to the cluster
ssh-copy-id YOUR_USERNAME@YOUR_CLUSTER_ADDRESS
# Alternatively, paste the contents of id_ed25519.pub into
# ~/.ssh/authorized_keys on the cluster login node.
3.3. Configure ~/.ssh/config#
Add a named host entry. Hermes will reference this alias throughout the workflow.
Host hpc
HostName YOUR_CLUSTER_ADDRESS
User YOUR_USERNAME
IdentityFile ~/.ssh/id_ed25519
Port 22
# Uncomment if your key is on a hardware token (YubiKey, etc.)
# ForwardAgent yes
Replace
YOUR_CLUSTER_ADDRESSandYOUR_USERNAMEwith your actual values.
3.4. Test the Connection#
ssh hpc
whoami # Expected: YOUR_USERNAME
exit
If you reach the cluster without a password prompt, the SSH tunnel is ready.
4. Step 2: Install Hermes Locally#
Hermes is the local AI agent that bridges Slack and your HPC cluster. It receives Slack commands, translates them into terminal / SLURM / file operations, and returns results. Think of it as an always-on research assistant running on your machine.
4.1. Install Hermes#
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
This installs the Hermes CLI and supporting tools into ~/.hermes.
4.2. Verify the Installation#
hermes doctor
All health-check items should pass:
β
Python .......... 3.11.x (OK)
β
Node.js ......... 18.x (OK)
β
Config folder ... ~/.hermes (OK)
β
Disk space ...... 45 GB free (OK)
Resolve any flagged issues before proceeding. See the Hermes documentation for troubleshooting guidance.
4.3. Hermes Capabilities at a Glance#
Capability |
Description |
|---|---|
Command execution |
Runs shell commands locally and over SSH |
File management |
Reads, writes, and diffs files on local and remote systems |
Process orchestration |
Manages long-running background jobs |
Slack integration |
Bolt SDK app that maps Slack messages to Hermes actions |
Self-reasoning |
Breaks complex tasks into steps, executes, and verifies outcomes |
5. Step 3: Configure Hermes SSH Backend#
Hermes must know how to reach the cluster. Rather than hardcoding hostnames in scripts, you set a single backend directive that points to the SSH alias defined in ~/.ssh/config.
5.1. Set the Backend and Host#
# Use SSH as the terminal backend
hermes config set terminal.backend ssh
# Point to the SSH alias defined in ~/.ssh/config
hermes config set terminal.host hpc
The hpc alias must match a Host hpc stanza in your SSH config. Hermes resolves the hostname, user, and identity file automatically.
5.2. Set the Remote Working Directory#
hermes config set terminal.remote_dir /scratch/YOUR_USERNAME
Use
/scratch(not/home) for large datasets and model checkpoints to avoid quota issues.
5.3. Verify the Config File#
# ~/.hermes/config.yaml
terminal:
backend: ssh
host: hpc
remote_dir: /scratch/YOUR_USERNAME
5.4. Run a Connectivity Test#
hermes chat -q "whoami; hostname; sinfo"
Expected output:
ββ output βββββββββββββββββββββββββββββ
β YOUR_USERNAME β
β login-node-01.cluster.edu β
β PARTITION AVAIL TIMELIMIT β
β gpu_normal up 7-00:00 β
β gpu_dev up 1-00:00 β
βββββββββββββββββββββββββββββββββββββββ
Troubleshooting: If the test hangs, verify password-less login with plain
ssh hpcfirst. Hermes reuses your SSH key β it does not manage authentication itself.
6. Step 4: Slack Gateway Setup#
The Slack gateway turns your workspace into a natural-language control panel for Hermes. This section covers creating a Slack app, wiring it to Hermes, and running the gateway as a system service.
6.1. Create a Slack App#
Go to https://api.slack.com/apps and click Create New App β From scratch.
Name the app (e.g.,
hermes-research-bot) and select your workspace.Under Features β Bot Tokens, add the following OAuth scopes:
channels:history,channels:read,channels:writechat:write,chat:write.customizegroups:history,im:history,im:writefiles:read,files:writempim:history,users:read
Under OAuth & Permissions, install the app to your workspace and copy the Bot User OAuth Token (begins with
xoxb-).
These scopes allow the bot to read channel history, send formatted messages and file attachments, and inspect user information. Request only the scopes your workflow requires.
6.2. Generate an App-Level Token#
Go to Settings β Basic Information β App-Level Tokens.
Create a token with the
connections:writescope (required for Socket Mode).Copy the token (begins with
xapp-).
6.3. Configure Hermes Credentials#
Run the interactive setup wizard:
hermes gateway setup
# [?] Select gateway type: β Slack
# [?] Enable Socket Mode?: β Yes
Store credentials in ~/.hermes/.env β never in config.yaml:
# ~/.hermes/.env β never commit to version control
SLACK_BOT_TOKEN=xoxb-YOUR_SLACK_BOT_TOKEN
SLACK_APP_TOKEN=xapp-YOUR_SLACK_APP_TOKEN
Reference the environment variables in config:
hermes config set gateway.slack.token '${SLACK_BOT_TOKEN}'
hermes config set gateway.slack.app_token '${SLACK_APP_TOKEN}'
Your ~/.hermes/config.yaml should show variable references, not raw secrets:
gateway:
type: slack
socket_mode: true
slack:
token: ${SLACK_BOT_TOKEN}
app_token: ${SLACK_APP_TOKEN}
Security note: Hermes loads
~/.hermes/.envautomatically at startup. If you paste a rawxoxb-token directly intoconfig.yaml,hermes doctorwill warn you.
6.4. Install and Start the Gateway#
hermes gateway install # Register as a systemd service (runs as your user)
hermes gateway start # Start the service
hermes gateway status # Verify it is running
Expected status output:
β
Service active: hermes-gateway.service
β
Socket Mode: connected
β
Listening on workspace: your-team.slack.com
6.5. Send Your First Command#
Open Slack and message the bot directly:
you β @hermes-research-bot: whoami; hostname; sinfo
bot β [cluster output formatted as a table]
Or mention it in a channel:
you β #research-gpu: @hermes-research-bot show me active jobs on gpu_normal
bot β [squeue output formatted as a table]
Hermes parses natural language, routes the request through SSH, executes it on the cluster, and returns formatted results β entirely within Slack.
7. Step 5: SLURM Job Templates + Singularity#
Hermes submits training jobs to the cluster via SLURM sbatch. The standard workflow is:
Hermes writes a SLURM script on your local machine
It SCPs the script to the cluster
It runs
sbatchvia SSHIt monitors job status and tails logs in real time
7.1. SLURM Script Template#
Save the following as cnn_train.slurm:
#!/bin/bash
#SBATCH --job-name=cnn_cifar10
#SBATCH --partition=gpu_normal
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --output=/scratch/%u/cnn_%j.out
#SBATCH --error=/scratch/%u/cnn_%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=YOUR_EMAIL
# Load required modules
module load singularity
# Path to the Singularity container
CONTAINER=/opt/containers/pytorch_2.0.sif
# Run training inside the container
singularity exec \
--bind /scratch:/data,/home:/home \
--nv \
$CONTAINER \
python /data/cnn_train.py \
--epochs 50 \
--batch_size 128 \
--lr 0.01 \
--data_dir /data/cifar10 \
--output_dir /data/cnn_results
7.2. Submit via Hermes#
# Hermes handles this automatically; the equivalent manual steps are:
scp cnn_train.slurm hpc:/scratch/YOUR_USERNAME/
ssh hpc "sbatch /scratch/YOUR_USERNAME/cnn_train.slurm"
7.3. Monitor via Hermes#
# Check job status
ssh hpc "squeue -u YOUR_USERNAME --format='%i %P %j %S %R'"
# Stream live output
ssh hpc "tail -f /scratch/YOUR_USERNAME/cnn_JOB_ID.out"
8. Step 6: Simple CNN Demo#
A ready-to-use PyTorch CNN script (train_cnn.py) is included in this repository. It:
Trains a 4-layer CNN on CIFAR-10
Accepts
--epochs,--batch_size,--lr,--data_dir,--output_dirargumentsSaves
metrics.csv(per-epoch train/val loss and accuracy)Saves
best_model.pthcheckpointOutputs
summary.jsonwith the best validation accuracy
# Quick local test (CPU, small dataset)
python train_cnn.py \
--epochs 3 --batch_size 64 --lr 0.001 \
--data_dir ./data --output_dir ./results
# Submit to cluster via SLURM + Singularity
ssh hpc "sbatch cnn_train.slurm"
See train_cnn.py in this directory for the full source code.
9. Step 7: Automated Research Loop via Hermes Cron#
Hermes can proactively monitor your experiments and Slack-DM you with updates β no manual polling required.
9.1. Create a Monitoring Cron Job#
hermes cron create "every 5m" \
--prompt "SSH to hpc, run squeue -u YOUR_USERNAME, tail the latest CNN training log,
parse the latest epoch metrics, and report train/val accuracy and loss.
If training is complete, summarize results and suggest next experiments."
9.2. Manage Cron Jobs#
hermes cron list # View all scheduled jobs
hermes cron pause ID # Pause a specific job
hermes cron resume ID # Resume a paused job
hermes cron remove ID # Delete a job permanently
The benefit in practice: start a 24-hour training run, close your laptop, and go home. Every 5 minutes, Hermes checks the log, parses metrics, and messages you in Slack. When training finishes, you receive a summary and a suggested next step, for example:
βVal accuracy plateaued at 82% after epoch 12. Suggest: increase lr to 0.01 or add data augmentation.β
10. Step 8: Security Hardening#
10.1. SLURM Best Practices#
Practice |
Directive |
Purpose |
|---|---|---|
Resource limits |
|
Prevent runaway jobs |
Account chargeback |
|
Track compute costs per group |
Node constraints |
|
Pin jobs to a specific GPU type |
Job arrays |
|
Run 10 sweeps with max 3 concurrent |
Node exclusion |
|
Skip known problematic nodes |
10.2. Singularity / Apptainer Best Practices#
Practice |
Flag |
Purpose |
|---|---|---|
Clean environment |
|
Prevent credential leakage from the host |
Minimal bind mounts |
|
Mount only required directories |
GPU passthrough |
|
NVIDIA GPU access without full host privileges |
Unprivileged containers |
|
User-namespace isolation |
Image verification |
|
Validate signatures before execution |
10.3. SSH Hardening#
# ~/.ssh/config
Host hpc
ServerAliveInterval 60
ServerAliveCountMax 3
ForwardAgent no # Do not expose your SSH agent to the cluster
StrictHostKeyChecking yes # Always verify host keys
11. Troubleshooting#
Symptom |
Diagnostic Command |
Resolution |
|---|---|---|
Slack bot not responding |
|
|
SSH connection refused |
|
Check |
SLURM job stuck in PD |
|
Verify partition and GPU availability |
|
Inspect job log |
Ensure |
Hermes cannot write to |
Check permissions |
|
Cron job not triggering |
|
Verify schedule format and enabled status |
12. Citations#
@misc{hermes_remote_hpc_slack,
author = {Sanghlao, Snit },
title = {Hermes Remote HPC + Slack: Research Acceleration Guide},
year = {2026},
howpublished = {Mahidol University AI Center},
note = {aicenter.mahidol.ac.th/hermes-remote-hpc-slack},
}
@software{hermes_agent,
author = {{Nous Research}},
title = {Hermes Agent},
url = {https://hermes-agent.nousresearch.com/docs},
version = {latest},
year = {2026},
}
@software{karpathy_autoresearch,
author = {Karpathy, Andrej},
title = {AutoResearch: Autonomous ML Research Agent},
url = {https://github.com/karpathy/autoresearch},
year = {2025},
}
Built for researchers who refuse to compromise on productivity, privacy, and control.
Mahidol University AI Center Β· Hermes Agent Β· SLURM + Singularity