Articles/How to Make a Two-Voice AI Podcast With ElevenLabs — Full 2026 Workflow
Workflow Best Practices

How to Make a Two-Voice AI Podcast With ElevenLabs — Full 2026 Workflow

A step-by-step workflow to turn any script into a two-host AI podcast using ElevenLabs. Voice pairing, dialogue tagging, multi-speaker generation, post-production, and real cost estimates.

May 16, 2026Read time: 9 min0 topic signals
Reading runway

Context above, deep read below. Use the TOC to move section by section without losing the thread.

Workflow Best Practices14 sections

How to Make a Two-Voice AI Podcast With ElevenLabs (Full Workflow)

If you've ever pasted a script into ElevenLabs hoping to get a two-host podcast, you've probably hit the same wall: the UI is set up for a single narrator, the dialogue-tagging conventions are scattered across blog posts, and nobody tells you the small post-production tricks that make synthetic voices stop sounding like synthetic voices.

This walkthrough is the workflow I'd hand a teammate on day one. By the end you'll have a 10-20 minute, two-host episode that's listenable — not just "technically TTS."

What you'll get out of this:

  • A repeatable script → MP3 pipeline (UI and code paths shown)
  • Five voice-pairing rules that actually matter
  • A working Python script you can fork
  • Honest cost numbers, so you can decide whether to subscribe

Time: 30-45 minutes for a 10-minute episode once you've done it twice. Skill floor: comfortable copy-pasting Python or willing to click through a web UI.


1. Prepare the script

A clean script saves you 80% of your post-production work. Two formats work; pick one and stay consistent.

Format A — simple speaker tags

Host A: Welcome back to AI Decoded. Today we're tackling
        the thing I've been getting asked about non-stop.
Host B: Let me guess — pricing on the new Qwen models?
Host A: Exactly. So let's start with what changed last week.

This is the easiest format to parse programmatically and the easiest to assign voices to in the ElevenLabs UI.

Format B — inline markup for nuance

ElevenLabs supports a small set of inline tags for emphasis, pauses, and emotion. The currently reliable ones, especially well-supported by the v3 model:

  • <break time="0.5s" /> — explicit pause
  • Bracketed cues like [laughs], [sigh], [clears throat] (v3 interprets these much more naturally than v2)

Check the official prompting guide for the exhaustive list — they expand the supported cues with each model release.

Scripting rules that change the listening experience

Rule Why it matters
130–160 words per minute Natural conversational pace; anything faster sounds like a news read
Add one [laughs], [sigh], or filler ("uh", "right?") every 2–3 minutes Single most effective humanizer
End each speaker's turn on a "handoff" word (right?, exactly, so…) Gives the next voice a natural in
Break monologues longer than 90 seconds Otherwise it audibly collapses into single-narrator TTS
Read it out loud yourself first If you can't say it naturally, the model won't either

2. Pick two contrasting voices

The single most common mistake on first attempt: picking two voices that sound vaguely similar. Listeners then can't tell who's talking, and the whole "two-host" illusion collapses.

Use this contrast checklist when auditioning:

  • Perceived gender contrast (M/F is easiest, but M/M and F/F work if the other axes are strong)
  • Tempo contrast — one slightly faster, one slightly slower
  • Pitch register contrast — one in the lower register, one higher
  • "Warmth" contrast — conversational/warm vs. analytical/clipped

Five voice pairs that work

Names below are from the long-standing default ElevenLabs voice set — they've been stable for years, but always preview before committing.

Pair name Host A Host B Best for
Analyst × curious cohost Adam Rachel tech, AI, finance shows
Storyteller × interviewer Antoni Bella narrative and feature
Calm explainer × quick reactor Drew Domi tutorials, how-to
British × American Daniel Sarah global / general audience
Custom clone × library voice Your clone Any library voice branded shows

You audition voices in the ElevenLabs Voice Library — preview 30 seconds of each before committing. If you're on Eleven v3, also try the Auto-assign voices feature in Studio, which detects speakers in your script and proposes matching voices automatically.


3. Generate the audio — three methods

Method A — ElevenLabs Studio (no code)

Studio is ElevenLabs' long-form, multi-speaker editor. You'll find it in the left nav of the dashboard (third item, below Home and Voices).

  1. Open Studio from the left nav.
  2. Create a new project and paste your tagged script.
  3. Either assign a voice per paragraph manually, or enable Auto-assign voices (Alpha) — it detects characters from the script and proposes matching voices for each.
  4. Click Generate. Studio renders and merges the clips.
  5. Export as MP3 or WAV.

Pros: zero code; regenerate single sentences with one click; ideal for non-engineers. Cons: large projects can take several minutes to render. Heads-up: each subscription tier caps the number of concurrent Studio Projects you can keep (3 on Free, 20 on Starter, 1,000 on Creator) — archive or delete old ones if you bump the limit. Studio and the API share the same monthly credit pool, and unused credits roll over up to two months on paid plans.

Method B — Python script via the API

This is the path I prefer once you're past episode 2 — it scales, it's diffable, and you can re-render single turns programmatically.

# pip install elevenlabs pydub
import os
from elevenlabs.client import ElevenLabs
from pydub import AudioSegment

# Reads ELEVENLABS_API_KEY from your environment
client = ElevenLabs()

VOICES = {
    "Host A": "VOICE_ID_FROM_LIBRARY",
    "Host B": "VOICE_ID_FROM_LIBRARY",
}

# Parse "Speaker: line" script into [(speaker, text), ...]
turns = []
with open("script.txt") as f:
    for line in f:
        if ":" in line:
            speaker, text = line.split(":", 1)
            speaker, text = speaker.strip(), text.strip()
            if speaker in VOICES and text:
                turns.append((speaker, text))

# Generate each turn
os.makedirs("out", exist_ok=True)
clips = []
for i, (speaker, text) in enumerate(turns):
    audio = client.text_to_speech.convert(
        text=text,
        voice_id=VOICES[speaker],
        model_id="eleven_v3",                  # flagship since Feb 2026
        output_format="mp3_44100_128",
    )
    path = f"out/{i:03d}_{speaker.replace(' ', '_')}.mp3"
    with open(path, "wb") as out:
        for chunk in audio:
            out.write(chunk)
    clips.append(AudioSegment.from_mp3(path))

# Stitch with a brief gap between turns
gap = AudioSegment.silent(duration=300)  # 300ms feels conversational
final = clips[0]
for clip in clips[1:]:
    final = final + gap + clip

final.export("podcast.mp3", format="mp3", bitrate="128k")
print(f"✅ Done. {len(turns)} turns rendered into podcast.mp3")

eleven_v3 went GA in February 2026 and is the current flagship. For real-time / low-latency use cases, swap model_id to eleven_flash_v2_5 or eleven_turbo_v2_5.

Pros: full control, batchable, diffable in git, easy to re-render single turns. Cons: SDK breaking changes happen; need a Python environment.

Method C — third-party orchestrators

If you want the "paste a URL, get a podcast" experience and don't mind less control:

  • Google NotebookLM — wraps a different engine but produces remarkable two-host output from any document
  • Wondercraft, Podcastle, Resemble AI — wrap ElevenLabs (or compete with it) with built-in dialogue UIs

Use these if you don't want to write code and don't want to click line-by-line. Quality is good; customization is limited.


4. Post-production (this is what hides the "AI" tell)

Even great voices need cleanup. Skip this and listeners can tell within 30 seconds.

  1. Normalize loudness → target -16 LUFS (podcast standard). Audacity: Effect → Normalize → Loudness Normalization.
  2. Trim dead air → cut any silence longer than ~1.2 seconds. Descript does this in one click; Audacity has Effect → Truncate Silence.
  3. Add a light music bed → 8-12 second intro, then duck to roughly -25 dB under the voices. Free sources: YouTube Audio Library, Pixabay Music.
  4. Soft compression → ratio 2:1, threshold around -18 dB. This glues two different voices into one perceived "show sound."
  5. Light EQ → roll off everything below 80 Hz to remove room rumble carried in by the model.

If you only do one of these five, make it #1 (loudness). Inconsistent loudness is the #1 reason listeners drop off in the first minute.


5. What this actually costs

ElevenLabs bills in credits, drawn from a single monthly pool shared between Studio and the API. The credit-to-character ratio depends on which model you pick — the high-quality eleven_v3 consumes more credits per character than the faster eleven_turbo_v2_5. Always check the pricing page for the current multiplier before quoting a client.

Useful baseline math (using a 1:1 credit-to-character ratio for v2-tier models; multiply by your model's actual rate):

  • 1 minute of natural-pace speech ≈ 900–1,000 characters of source text
  • A 20-minute episode ≈ ~19,000 characters of source text

Current tiers (pulled from ElevenLabs' pricing page, May 2026):

Tier Monthly price Monthly credits Studio Projects Best for
Free $0 10,000 3 Trying it out (≈ 10 min v2 audio)
Starter $6 30,000 20 Hobby creator; commercial license, instant voice cloning, Dubbing studio
Creator $22 (first month $11) 121,000 1,000 Sweet spot for a weekly 20-min show; professional voice cloning, 192 kbps audio, pay-as-you-go overage
Pro $99 600,000 3,000 Daily show or multiple weekly shows; 44.1 kHz PCM via API
Scale $299 1.8 M 9,000 Small team (3 workspace seats), team collaboration
Business $990 6 M Larger team (10 seats), low-latency TTS option
Enterprise Custom Custom Custom terms, priority support

Practical takeaway: for a weekly 20-minute show generated with v3, Creator covers it comfortably with headroom. Don't grind on Free — overage rates are punitive and Starter pays for itself with the commercial license alone.

Paid-tier credits roll over up to two months, so an occasional heavy week doesn't immediately bump you up a tier.


6. When ElevenLabs isn't the right answer

ElevenLabs is currently the strongest option for English and major European-language emotional TTS. It's not always the answer.

  • Chinese-heavy show → Xunfei (iFlytek) or MiniMax handle Mandarin prosody better. See our deep-dive: /zh-CN/vs/elevenlabs-vs-xunfei-tts
  • Self-host on a budgetCoqui XTTS gives you 80% of the quality at $0/character, if you can run a GPU
  • Heavy voice cloning needs → Resemble AI has stronger few-shot cloning controls
  • Compared to OpenAI's voice API → see /vs/elevenlabs-openai-tts for the side-by-side

Recap — the 6-step workflow

  1. Write the script in Host A: ... / Host B: ... format
  2. Pick two voices with strong contrast on at least two axes
  3. Generate via Studio (UI) or the Python script above
  4. Run post-production: loudness, silence trim, music bed, compression
  5. Budget Creator tier ($22/mo, 121K credits) — comfortably covers a weekly 20-minute show on eleven_v3
  6. Compare alternatives if you're in a non-English or self-host scenario

Two-voice AI podcasts have crossed the "good enough that listeners don't notice" line in the last 18 months. The remaining work is no longer in the model — it's in the script structure, the voice pairing, and the 15 minutes of post-production most people skip.


Related on XScanHub:

Share this article

Article overview

Before you move on

Category
Workflow Best Practices
Read time
9 min
Mentioned tools
0
Back to all articles →

Next step

Finished reading? Continue comparing tools in the directory.

Browse tools