---
title: Multi-speaker dialogue — Fish Audio S2.1 Pro | Runware Docs
url: https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/multi-speaker-dialogue
description: How to generate two-speaker audio in a single request using S2-Pro's inline speaker tags. Covers speaker tag syntax, voice mapping, emotion control per speaker, and practical dialogue patterns.
---
### [Introduction](https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/multi-speaker-dialogue#introduction)

Most text-to-speech models produce one voice per request. A two-person conversation requires separate API calls for each speaker, then stitching the audio together downstream. S2-Pro handles this differently: you write both sides of the dialogue in a single text input, tag each turn with a speaker index, and the model renders **both voices in one audio file** with natural turn-taking.

[Listen to audio](https://runware.ai/docs/assets/hero.Do_4Kexo.mp3)

> **Prompt**: <|speaker:0|>So we shipped the redesign last Thursday. <|speaker:1|>[excited] And the conversion rate went up twelve percent overnight. <|speaker:0|>Twelve percent. On a Thursday launch. <|speaker:1|>[laughs] Nobody launches on a Thursday.

That sample puts two voices and two emotion tags across four speaker turns, all generated in a single request. This guide covers the speaker tag syntax, voice mapping, per-speaker emotion control, and practical patterns for common dialogue scenarios.

### [Speaker tags](https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/multi-speaker-dialogue#speaker-tags)

Multi-speaker dialogue uses **inline speaker tags** to mark where each voice begins. The tags follow the format `<|speaker:N|>`, where `N` is a zero-based index that maps to the `speech.voices` array:

```text
<|speaker:0|>First speaker's line.
<|speaker:1|>Second speaker's line.
<|speaker:0|>First speaker again.
```

Each tag switches the voice for all text that follows until the next speaker tag. S2-Pro supports **two speakers per request** (indices `0` and `1`). The `speech.voices` array assigns a voice model ID to each index.

> [!NOTE]
> Speaker tags and [emotion tags](https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/emotion-and-expression) work independently. `<|speaker:0|>` sets who is talking. `[excited]` sets how they sound. You can use both in the same line.

### [Voice mapping](https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/multi-speaker-dialogue#voice-mapping)

The `speech.voices` array **maps speaker indices to voice model IDs**. Index `0` in the array corresponds to `<|speaker:0|>` in the text, index `1` to `<|speaker:1|>`:

**Request**:

```json
[
  {
    "taskType": "audioInference",
    "taskUUID": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
    "model": "fishaudio:s2.1@pro",
    "speech": {
      "text": "<|speaker:0|>The deployment went through without any issues. <|speaker:1|>[excited] Finally! That pipeline has been flaky for weeks.",
      "voices": [
        "536d3a5e000945adb7038665781a4aca",
        "933563129e564b19a115bedd57b7406a"
      ]
    }
  }
]
```

**Response**:

```json
{
  "data": [
    {
      "taskType": "audioInference",
      "taskUUID": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
      "audioUUID": "a2b3c4d5-e6f7-8901-2345-678901abcdef",
      "audioURL": "https://am.runware.ai/audio/os/a14d18/ws/2/ai/a2b3c4d5-e6f7-8901-2345-678901abcdef.mp3"
    }
  ]
}
```[Listen to audio](https://runware.ai/docs/assets/request-demo.BAVCIBOT.mp3)

> **Prompt**: <|speaker:0|>The deployment went through without any issues. <|speaker:1|>[excited] Finally! That pipeline has been flaky for weeks.

> [!WARNING]
> `speech.voices` and `speech.voice` are mutually exclusive. Use `speech.voice` (singular) for single-speaker requests and `speech.voices` (array) for multi-speaker dialogue. Sending both causes a validation error.

### [Single voice vs. two voices](https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/multi-speaker-dialogue#single-voice-vs-two-voices)

Dialogue text without speaker tags gets read by one voice. The same text with speaker tags and a `voices` array produces distinct speakers:

[Listen to audio](https://runware.ai/docs/assets/compare-single.BkoHiGv_.mp3)

*Single voice: all lines read by one speaker*

> **Prompt**: Have you tested the new endpoint? Yes, it returns data in under fifty milliseconds. That's faster than what we had before.

[Listen to audio](https://runware.ai/docs/assets/compare-multi.CdTtiKYr.mp3)

*Two voices: each turn assigned to a different speaker*

> **Prompt**: <|speaker:0|>Have you tested the new endpoint? <|speaker:1|>Yes, it returns data in under fifty milliseconds. <|speaker:0|>That is faster than what we had before.

The multi-speaker version **distinguishes who is saying what**. The model handles **turn-taking pacing automatically**, with natural pauses between speakers.

### [Emotion tags per speaker](https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/multi-speaker-dialogue#emotion-tags-per-speaker)

Bracket emotion tags apply to the **current speaker** only. Each speaker can carry different emotional delivery in the same passage. The full catalog of tags, free-form expressions, and paralanguage cues is covered in the [Emotion and expression](https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/emotion-and-expression) guide:

```text
<|speaker:0|>[excited] Guess what? We got accepted into the accelerator program!
<|speaker:1|>[surprised] Wait, are you serious? That is incredible news.
```

[Listen to audio](https://runware.ai/docs/assets/emotion-per-speaker.Bvux60JL.mp3)

> **Prompt**: <|speaker:0|>[excited] Guess what? We got accepted into the accelerator program! <|speaker:1|>[surprised] Wait, are you serious? That is incredible news.

The first speaker carries excitement. The second carries surprise. The tags **don't bleed across speakers**. Each voice **performs its own direction independently**.

### [Dialogue patterns](https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/multi-speaker-dialogue#dialogue-patterns)

#### [Technical discussion](https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/multi-speaker-dialogue#technical-discussion)

Clean back-and-forth between two colleagues. No emotion tags needed when the content speaks for itself:

[Listen to audio](https://runware.ai/docs/assets/basic-turns.CEsf4_vy.mp3)

> **Prompt**: <|speaker:0|>Did you get a chance to review the pull request? <|speaker:1|>I did. The caching layer looks solid, but I have questions about the invalidation logic. <|speaker:0|>Fair enough. Want to walk through it after standup?

#### [Podcast conversation](https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/multi-speaker-dialogue#podcast-conversation)

Longer turns with natural topic development. Good for content where each speaker contributes multiple sentences:

[Listen to audio](https://runware.ai/docs/assets/scenario-podcast.BoMcOzAF.mp3)

> **Prompt**: <|speaker:0|>Welcome back to the show. Today we are talking about the state of open-source AI. <|speaker:1|>It has been a wild year. Three months ago, nobody expected the licensing landscape to shift this fast. <|speaker:0|>Right. And the tooling has caught up in ways that actually matter for production.

#### [Interview](https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/multi-speaker-dialogue#interview)

One speaker asks, the other answers at length. The asymmetry in turn length works well because S2-Pro **adjusts pacing per speaker**:

[Listen to audio](https://runware.ai/docs/assets/scenario-interview.nl1JTScU.mp3)

> **Prompt**: <|speaker:0|>Can you tell me about a time you had to make a difficult technical decision under pressure? <|speaker:1|>Sure. Last quarter we had a production outage that lasted six hours. I had to decide between rolling back to a known-good state or pushing a hotfix forward. I chose the rollback. It cost us a deploy cycle, but it was the safer call.

#### [Narrator and character](https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/multi-speaker-dialogue#narrator-and-character)

One voice narrates in third person. The other speaks in first person as the character. Emotion tags on the character voice **add texture without affecting the narrator**:

[Listen to audio](https://runware.ai/docs/assets/scenario-story.mC-CYCCW.mp3)

> **Prompt**: <|speaker:0|>The lab was dark except for a single monitor. Dr. Chen stared at the results. <|speaker:1|>[whispering] That cannot be right. Run it again. <|speaker:0|>She pressed enter. The second run confirmed what the first had shown.

### [Formatting multi-speaker text](https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/multi-speaker-dialogue#formatting-multi-speaker-text)

The model **doesn't require line breaks** between speaker turns. The `<|speaker:N|>` tags are the only markers it needs. All of these formats produce the same output:

```text
<|speaker:0|>Line one. <|speaker:1|>Line two.
```

```text
<|speaker:0|>Line one.
<|speaker:1|>Line two.
```

Both work. Use whichever format makes your source text easier to read. In JSON payloads, **everything goes in a single `speech.text` string**, so line breaks are just `\n` characters.

### [Tips](https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/multi-speaker-dialogue#tips)

1. **Pick voices that contrast.** Two voices with similar pitch and cadence sound like one person talking to themselves. Pair a lower-register narrator with a higher-register speaker for the clearest separation.
    
2. **Keep turns long enough to establish the voice.** A two-word turn doesn't give the model enough context to differentiate the speaker. Aim for at least one full sentence per turn.
    
3. **Use emotion tags on the speaker who needs them.** You don't have to tag every turn. If one speaker is neutral and the other is emotional, tag only the emotional speaker. The contrast makes the emotion more noticeable.
    
4. **Place speaker tags at natural dialogue boundaries.** A speaker tag in the middle of a sentence produces an unnatural voice switch. Tag at the start of a new sentence or thought, not mid-clause.
    
5. **Start every text with a speaker tag.** The model defaults to speaker 0 if no tag is present at the start. Be explicit by opening with `<|speaker:0|>` to avoid ambiguity.