MODEL ID fishaudio:s2.1@pro
coming-soon

Fish Audio S2.1 Pro

Fish Audio
by Fish Audio

Fish Audio S2.1 Pro is a flagship text-to-speech model built for highly expressive, low-latency speech generation. It supports natural-language bracket cues for emotion and delivery control, multi-speaker dialogue in a single generation, 80+ languages with automatic language detection, and realtime streaming with very fast time to first audio.

Fish Audio S2.1 Pro

Multi-speaker dialogue

How to generate two-speaker audio in a single request using S2-Pro's inline speaker tags. Covers speaker tag syntax, voice mapping, emotion control per speaker, and practical dialogue patterns.

Introduction

Most text-to-speech models produce one voice per request. A two-person conversation requires separate API calls for each speaker, then stitching the audio together downstream. S2-Pro handles this differently: you write both sides of the dialogue in a single text input, tag each turn with a speaker index, and the model renders both voices in one audio file with natural turn-taking.

0:00

<|speaker:0|>So we shipped the redesign last Thursday. <|speaker:1|>[excited] And the conversion rate went up twelve percent overnight. <|speaker:0|>Twelve percent. On a Thursday launch. <|speaker:1|>[laughs] Nobody launches on a Thursday.

That sample puts two voices and two emotion tags across four speaker turns, all generated in a single request. This guide covers the speaker tag syntax, voice mapping, per-speaker emotion control, and practical patterns for common dialogue scenarios.

Speaker tags

Multi-speaker dialogue uses inline speaker tags to mark where each voice begins. The tags follow the format <|speaker:N|>, where N is a zero-based index that maps to the speech.voices array:

<|speaker:0|>First speaker's line.
<|speaker:1|>Second speaker's line.
<|speaker:0|>First speaker again.

Each tag switches the voice for all text that follows until the next speaker tag. S2-Pro supports two speakers per request (indices 0 and 1). The speech.voices array assigns a voice model ID to each index.

Speaker tags and emotion tags work independently. <|speaker:0|> sets who is talking. [excited] sets how they sound. You can use both in the same line.

Voice mapping

The speech.voices array maps speaker indices to voice model IDs. Index 0 in the array corresponds to <|speaker:0|> in the text, index 1 to <|speaker:1|>:

[
  {
    "taskType": "audioInference",
    "taskUUID": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
    "model": "fishaudio:s2.1@pro",
    "speech": {
      "text": "<|speaker:0|>The deployment went through without any issues. <|speaker:1|>[excited] Finally! That pipeline has been flaky for weeks.",
      "voices": [
        "536d3a5e000945adb7038665781a4aca",
        "933563129e564b19a115bedd57b7406a"
      ]
    }
  }
]
{
  "data": [
    {
      "taskType": "audioInference",
      "taskUUID": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
      "audioUUID": "a2b3c4d5-e6f7-8901-2345-678901abcdef",
      "audioURL": "https://am.runware.ai/audio/os/a14d18/ws/2/ai/a2b3c4d5-e6f7-8901-2345-678901abcdef.mp3"
    }
  ]
}
0:00

<|speaker:0|>The deployment went through without any issues. <|speaker:1|>[excited] Finally! That pipeline has been flaky for weeks.

speech.voices and speech.voice are mutually exclusive. Use speech.voice (singular) for single-speaker requests and speech.voices (array) for multi-speaker dialogue. Sending both causes a validation error.

Single voice vs. two voices

Dialogue text without speaker tags gets read by one voice. The same text with speaker tags and a voices array produces distinct speakers:

0:00

Have you tested the new endpoint? Yes, it returns data in under fifty milliseconds. That's faster than what we had before.

Single voice: all lines read by one speaker
0:00

<|speaker:0|>Have you tested the new endpoint? <|speaker:1|>Yes, it returns data in under fifty milliseconds. <|speaker:0|>That is faster than what we had before.

Two voices: each turn assigned to a different speaker

The multi-speaker version distinguishes who is saying what. The model handles turn-taking pacing automatically, with natural pauses between speakers.

Emotion tags per speaker

Bracket emotion tags apply to the current speaker only. Each speaker can carry different emotional delivery in the same passage. The full catalog of tags, free-form expressions, and paralanguage cues is covered in the Emotion and expression guide:

<|speaker:0|>[excited] Guess what? We got accepted into the accelerator program!
<|speaker:1|>[surprised] Wait, are you serious? That is incredible news.
0:00

<|speaker:0|>[excited] Guess what? We got accepted into the accelerator program! <|speaker:1|>[surprised] Wait, are you serious? That is incredible news.

The first speaker carries excitement. The second carries surprise. The tags don't bleed across speakers. Each voice performs its own direction independently.

Dialogue patterns

Technical discussion

Clean back-and-forth between two colleagues. No emotion tags needed when the content speaks for itself:

0:00

<|speaker:0|>Did you get a chance to review the pull request? <|speaker:1|>I did. The caching layer looks solid, but I have questions about the invalidation logic. <|speaker:0|>Fair enough. Want to walk through it after standup?

Podcast conversation

Longer turns with natural topic development. Good for content where each speaker contributes multiple sentences:

0:00

<|speaker:0|>Welcome back to the show. Today we are talking about the state of open-source AI. <|speaker:1|>It has been a wild year. Three months ago, nobody expected the licensing landscape to shift this fast. <|speaker:0|>Right. And the tooling has caught up in ways that actually matter for production.

Interview

One speaker asks, the other answers at length. The asymmetry in turn length works well because S2-Pro adjusts pacing per speaker:

0:00

<|speaker:0|>Can you tell me about a time you had to make a difficult technical decision under pressure? <|speaker:1|>Sure. Last quarter we had a production outage that lasted six hours. I had to decide between rolling back to a known-good state or pushing a hotfix forward. I chose the rollback. It cost us a deploy cycle, but it was the safer call.

Narrator and character

One voice narrates in third person. The other speaks in first person as the character. Emotion tags on the character voice add texture without affecting the narrator:

0:00

<|speaker:0|>The lab was dark except for a single monitor. Dr. Chen stared at the results. <|speaker:1|>[whispering] That cannot be right. Run it again. <|speaker:0|>She pressed enter. The second run confirmed what the first had shown.

Formatting multi-speaker text

The model doesn't require line breaks between speaker turns. The <|speaker:N|> tags are the only markers it needs. All of these formats produce the same output:

<|speaker:0|>Line one. <|speaker:1|>Line two.
<|speaker:0|>Line one.
<|speaker:1|>Line two.

Both work. Use whichever format makes your source text easier to read. In JSON payloads, everything goes in a single speech.text string, so line breaks are just \n characters.

Tips

  1. Pick voices that contrast. Two voices with similar pitch and cadence sound like one person talking to themselves. Pair a lower-register narrator with a higher-register speaker for the clearest separation.

  2. Keep turns long enough to establish the voice. A two-word turn doesn't give the model enough context to differentiate the speaker. Aim for at least one full sentence per turn.

  3. Use emotion tags on the speaker who needs them. You don't have to tag every turn. If one speaker is neutral and the other is emotional, tag only the emotional speaker. The contrast makes the emotion more noticeable.

  4. Place speaker tags at natural dialogue boundaries. A speaker tag in the middle of a sentence produces an unnatural voice switch. Tag at the start of a new sentence or thought, not mid-clause.

  5. Start every text with a speaker tag. The model defaults to speaker 0 if no tag is present at the start. Be explicit by opening with <|speaker:0|> to avoid ambiguity.