How to Make a Two-Voice AI Podcast With ElevenLabs — Full 2026 Workflow
A step-by-step workflow to turn any script into a two-host AI podcast using ElevenLabs. Voice pairing, dialogue tagging, multi-speaker generation, post-production, and real cost estimates.
Context above, deep read below. Use the TOC to move section by section without losing the thread.
How to Make a Two-Voice AI Podcast With ElevenLabs (Full Workflow)
If you've ever pasted a script into ElevenLabs hoping to get a two-host podcast, you've probably hit the same wall: the UI is set up for a single narrator, the dialogue-tagging conventions are scattered across blog posts, and nobody tells you the small post-production tricks that make synthetic voices stop sounding like synthetic voices.
This walkthrough is the workflow I'd hand a teammate on day one. By the end you'll have a 10-20 minute, two-host episode that's listenable — not just "technically TTS."
What you'll get out of this:
- A repeatable script → MP3 pipeline (UI and code paths shown)
- Five voice-pairing rules that actually matter
- A working Python script you can fork
- Honest cost numbers, so you can decide whether to subscribe
Time: 30-45 minutes for a 10-minute episode once you've done it twice. Skill floor: comfortable copy-pasting Python or willing to click through a web UI.
1. Prepare the script
A clean script saves you 80% of your post-production work. Two formats work; pick one and stay consistent.
Format A — simple speaker tags
Host A: Welcome back to AI Decoded. Today we're tackling
the thing I've been getting asked about non-stop.
Host B: Let me guess — pricing on the new Qwen models?
Host A: Exactly. So let's start with what changed last week.
This is the easiest format to parse programmatically and the easiest to assign voices to in the ElevenLabs UI.
Format B — inline markup for nuance
ElevenLabs supports a small set of inline tags for emphasis, pauses, and emotion. The currently reliable ones, especially well-supported by the v3 model:
<break time="0.5s" />— explicit pause- Bracketed cues like
[laughs],[sigh],[clears throat](v3 interprets these much more naturally than v2)
Check the official prompting guide for the exhaustive list — they expand the supported cues with each model release.
Scripting rules that change the listening experience
| Rule | Why it matters |
|---|---|
| 130–160 words per minute | Natural conversational pace; anything faster sounds like a news read |
Add one [laughs], [sigh], or filler ("uh", "right?") every 2–3 minutes |
Single most effective humanizer |
End each speaker's turn on a "handoff" word (right?, exactly, so…) |
Gives the next voice a natural in |
| Break monologues longer than 90 seconds | Otherwise it audibly collapses into single-narrator TTS |
| Read it out loud yourself first | If you can't say it naturally, the model won't either |
2. Pick two contrasting voices
The single most common mistake on first attempt: picking two voices that sound vaguely similar. Listeners then can't tell who's talking, and the whole "two-host" illusion collapses.
Use this contrast checklist when auditioning:
- Perceived gender contrast (M/F is easiest, but M/M and F/F work if the other axes are strong)
- Tempo contrast — one slightly faster, one slightly slower
- Pitch register contrast — one in the lower register, one higher
- "Warmth" contrast — conversational/warm vs. analytical/clipped
Five voice pairs that work
Names below are from the long-standing default ElevenLabs voice set — they've been stable for years, but always preview before committing.
| Pair name | Host A | Host B | Best for |
|---|---|---|---|
| Analyst × curious cohost | Adam | Rachel | tech, AI, finance shows |
| Storyteller × interviewer | Antoni | Bella | narrative and feature |
| Calm explainer × quick reactor | Drew | Domi | tutorials, how-to |
| British × American | Daniel | Sarah | global / general audience |
| Custom clone × library voice | Your clone | Any library voice | branded shows |
You audition voices in the ElevenLabs Voice Library — preview 30 seconds of each before committing. If you're on Eleven v3, also try the Auto-assign voices feature in Studio, which detects speakers in your script and proposes matching voices automatically.
3. Generate the audio — three methods
Method A — ElevenLabs Studio (no code)
Studio is ElevenLabs' long-form, multi-speaker editor. You'll find it in the left nav of the dashboard (third item, below Home and Voices).
- Open Studio from the left nav.
- Create a new project and paste your tagged script.
- Either assign a voice per paragraph manually, or enable Auto-assign voices (Alpha) — it detects characters from the script and proposes matching voices for each.
- Click Generate. Studio renders and merges the clips.
- Export as MP3 or WAV.
Pros: zero code; regenerate single sentences with one click; ideal for non-engineers. Cons: large projects can take several minutes to render. Heads-up: each subscription tier caps the number of concurrent Studio Projects you can keep (3 on Free, 20 on Starter, 1,000 on Creator) — archive or delete old ones if you bump the limit. Studio and the API share the same monthly credit pool, and unused credits roll over up to two months on paid plans.
Method B — Python script via the API
This is the path I prefer once you're past episode 2 — it scales, it's diffable, and you can re-render single turns programmatically.
# pip install elevenlabs pydub
import os
from elevenlabs.client import ElevenLabs
from pydub import AudioSegment
# Reads ELEVENLABS_API_KEY from your environment
client = ElevenLabs()
VOICES = {
"Host A": "VOICE_ID_FROM_LIBRARY",
"Host B": "VOICE_ID_FROM_LIBRARY",
}
# Parse "Speaker: line" script into [(speaker, text), ...]
turns = []
with open("script.txt") as f:
for line in f:
if ":" in line:
speaker, text = line.split(":", 1)
speaker, text = speaker.strip(), text.strip()
if speaker in VOICES and text:
turns.append((speaker, text))
# Generate each turn
os.makedirs("out", exist_ok=True)
clips = []
for i, (speaker, text) in enumerate(turns):
audio = client.text_to_speech.convert(
text=text,
voice_id=VOICES[speaker],
model_id="eleven_v3", # flagship since Feb 2026
output_format="mp3_44100_128",
)
path = f"out/{i:03d}_{speaker.replace(' ', '_')}.mp3"
with open(path, "wb") as out:
for chunk in audio:
out.write(chunk)
clips.append(AudioSegment.from_mp3(path))
# Stitch with a brief gap between turns
gap = AudioSegment.silent(duration=300) # 300ms feels conversational
final = clips[0]
for clip in clips[1:]:
final = final + gap + clip
final.export("podcast.mp3", format="mp3", bitrate="128k")
print(f"✅ Done. {len(turns)} turns rendered into podcast.mp3")
eleven_v3 went GA in February 2026 and is the current flagship. For real-time / low-latency use cases, swap model_id to eleven_flash_v2_5 or eleven_turbo_v2_5.
Pros: full control, batchable, diffable in git, easy to re-render single turns. Cons: SDK breaking changes happen; need a Python environment.
Method C — third-party orchestrators
If you want the "paste a URL, get a podcast" experience and don't mind less control:
- Google NotebookLM — wraps a different engine but produces remarkable two-host output from any document
- Wondercraft, Podcastle, Resemble AI — wrap ElevenLabs (or compete with it) with built-in dialogue UIs
Use these if you don't want to write code and don't want to click line-by-line. Quality is good; customization is limited.
4. Post-production (this is what hides the "AI" tell)
Even great voices need cleanup. Skip this and listeners can tell within 30 seconds.
- Normalize loudness → target -16 LUFS (podcast standard). Audacity:
Effect → Normalize → Loudness Normalization. - Trim dead air → cut any silence longer than ~1.2 seconds. Descript does this in one click; Audacity has
Effect → Truncate Silence. - Add a light music bed → 8-12 second intro, then duck to roughly -25 dB under the voices. Free sources: YouTube Audio Library, Pixabay Music.
- Soft compression → ratio 2:1, threshold around -18 dB. This glues two different voices into one perceived "show sound."
- Light EQ → roll off everything below 80 Hz to remove room rumble carried in by the model.
If you only do one of these five, make it #1 (loudness). Inconsistent loudness is the #1 reason listeners drop off in the first minute.
5. What this actually costs
ElevenLabs bills in credits, drawn from a single monthly pool shared between Studio and the API. The credit-to-character ratio depends on which model you pick — the high-quality eleven_v3 consumes more credits per character than the faster eleven_turbo_v2_5. Always check the pricing page for the current multiplier before quoting a client.
Useful baseline math (using a 1:1 credit-to-character ratio for v2-tier models; multiply by your model's actual rate):
- 1 minute of natural-pace speech ≈ 900–1,000 characters of source text
- A 20-minute episode ≈ ~19,000 characters of source text
Current tiers (pulled from ElevenLabs' pricing page, May 2026):
| Tier | Monthly price | Monthly credits | Studio Projects | Best for |
|---|---|---|---|---|
| Free | $0 | 10,000 | 3 | Trying it out (≈ 10 min v2 audio) |
| Starter | $6 | 30,000 | 20 | Hobby creator; commercial license, instant voice cloning, Dubbing studio |
| Creator | $22 (first month $11) | 121,000 | 1,000 | Sweet spot for a weekly 20-min show; professional voice cloning, 192 kbps audio, pay-as-you-go overage |
| Pro | $99 | 600,000 | 3,000 | Daily show or multiple weekly shows; 44.1 kHz PCM via API |
| Scale | $299 | 1.8 M | 9,000 | Small team (3 workspace seats), team collaboration |
| Business | $990 | 6 M | — | Larger team (10 seats), low-latency TTS option |
| Enterprise | Custom | Custom | — | Custom terms, priority support |
Practical takeaway: for a weekly 20-minute show generated with v3, Creator covers it comfortably with headroom. Don't grind on Free — overage rates are punitive and Starter pays for itself with the commercial license alone.
Paid-tier credits roll over up to two months, so an occasional heavy week doesn't immediately bump you up a tier.
6. When ElevenLabs isn't the right answer
ElevenLabs is currently the strongest option for English and major European-language emotional TTS. It's not always the answer.
- Chinese-heavy show → Xunfei (iFlytek) or MiniMax handle Mandarin prosody better. See our deep-dive: /zh-CN/vs/elevenlabs-vs-xunfei-tts
- Self-host on a budget → Coqui XTTS gives you 80% of the quality at $0/character, if you can run a GPU
- Heavy voice cloning needs → Resemble AI has stronger few-shot cloning controls
- Compared to OpenAI's voice API → see /vs/elevenlabs-openai-tts for the side-by-side
Recap — the 6-step workflow
- Write the script in
Host A: ... / Host B: ...format - Pick two voices with strong contrast on at least two axes
- Generate via Studio (UI) or the Python script above
- Run post-production: loudness, silence trim, music bed, compression
- Budget Creator tier ($22/mo, 121K credits) — comfortably covers a weekly 20-minute show on
eleven_v3 - Compare alternatives if you're in a non-English or self-host scenario
Two-voice AI podcasts have crossed the "good enough that listeners don't notice" line in the last 18 months. The remaining work is no longer in the model — it's in the script structure, the voice pairing, and the 15 minutes of post-production most people skip.
Related on XScanHub:
Jump to a section
Pass this article along
Send it to your preferred platform or copy the link.
Before you move on
Next step
Finished reading? Continue comparing tools in the directory.
Browse tools