← Back to blog

Hermes Agent Joins Your Discord VC and Talks Back - Here's How

hermesdiscordvoicettssttwhisper
Hermes Agent Joins Your Discord VC and Talks Back - Here's How

A tweet from @hermesagenttips this week pointed out something that does not get much attention in the Hermes Agent documentation: the bot can join a Discord voice channel and hold a live conversation.

nobody told me Hermes Agent could just... join your Discord VC and talk back

It can. The feature has been live for months. Here is how it works and how to set it up.

What Happens When Hermes Joins a VC

The bot joins your voice channel, listens to each user's audio independently, detects speech boundaries (1.5 seconds of silence after at least 0.5 seconds of speech), transcribes the audio through Whisper, runs the full agent pipeline - tools, memory, reasoning - and speaks the reply back into the channel via TTS.

Transcripts also appear in the associated text channel as [Voice] @user: what you said, and the agent's text response posts alongside the spoken audio.

Only users listed in DISCORD_ALLOWED_USERS can interact via voice. Other users' audio is silently ignored.

Three Commands

All voice channel control happens through slash commands in any text channel where the bot is present:

/voice join      Bot joins your current voice channel
/voice leave     Bot disconnects from voice channel
/voice status    Show voice mode and connected channel

You must already be in a voice channel before running /voice join. The bot joins the same VC you are in.

Setup

Prerequisites

Install with voice support:

pip install "hermes-agent[messaging]"

The messaging extra pulls in discord.py[voice], which includes PyNaCl (voice encryption) and opus bindings. These are required for voice channel support.

Discord Bot Permissions

Your bot needs these permissions in the server:

Permission Purpose
Connect Join voice channels
Speak Play TTS audio in voice channels
Use Voice Activity Detect when users are speaking

The combined permission integer for text and voice is 274881432640. Message Content Intent must be enabled in the Discord Developer Portal under your application's Bot settings.

Environment Variables

# Discord (already configured for text)
DISCORD_BOT_TOKEN=your-bot-token
DISCORD_ALLOWED_USERS=your-user-id

# STT - pick one
# Option 1: Local (free, no API key)
# pip install faster-whisper

# Option 2: Groq Whisper (fast, free tier)
GROQ_API_KEY=...

# Option 3: OpenAI Whisper (paid)
VOICE_TOOLS_OPENAI_KEY=...

# TTS - pick one
# Free: Edge TTS (built-in, no key needed)
# Free: NeuTTS (pip install neutts[all])
# Premium: ElevenLabs
ELEVENLABS_API_KEY=...

Local STT via faster-whisper requires no API key and runs entirely on your machine. Groq Whisper offers a free tier with faster turnaround. For TTS, Edge TTS is built in and free; ElevenLabs produces higher quality output.

The docs recommend getting text and basic voice replies working before attempting VC mode. Start with /voice tts to test text-to-speech in a text channel. Then test /voice join in a dedicated testing channel.

The Stack

The pipeline uses standard components. On the STT side: faster-whisper (local CTranslate2 Whisper), Groq Whisper API, or OpenAI Whisper API. On the TTS side: Microsoft Edge TTS (edge-tts), NeuTTS (local neural TTS), or ElevenLabs.

Audio flows through Discord's voice WebSocket. discord.py[voice] handles encryption via PyNaCl/libnacl and opus encoding. Speech detection uses Discord's SPEAKING opcode to map SSRC to user IDs, independent of the Server Members Intent.

Why Use It

Voice mode fills gaps that text does not. Hands-free use while coding or walking around. Live back-and-forth debugging sessions where speaking is faster than typing. Quick idea capture without switching context.

The Hermes docs lay out a suggested first-week progression: get text Hermes working, install voice extras, test CLI voice mode with local STT and Edge TTS, enable voice replies in Discord, and only then try VC mode. The feature works without much configuration once the basics are in place.

The full voice documentation covers edge cases around multiple speakers, silence detection tuning, and provider selection.[^1]

[^1]: Hermes Agent. "Use Voice Mode with Hermes." hermes-agent.nousresearch.com.

Termagotchi
_

Ryan Underdown

Autodidact. Rarely listens to advice.

Follow on X @catamarammed or GitHub @underdown