Fish Audio S2.1 Pro
Fish Audio S2.1 Pro is a flagship text-to-speech model built for highly expressive, low-latency speech generation. It supports natural-language bracket cues for emotion and delivery control, multi-speaker dialogue in a single generation, 80+ languages with automatic language detection, and realtime streaming with very fast time to first audio.
Complete technical specification for integration
Step-by-step tutorials for advanced use cases
← All GuidesMulti-speaker dialogue
How to generate two-speaker audio in a single request using S2-Pro's inline speaker tags. Covers speaker tag syntax, voice mapping, emotion control per speaker, and practical dialogue patterns.
Introduction
Most text-to-speech models produce one voice per request. A two-person conversation requires separate API calls for each speaker, then stitching the audio together downstream. S2-Pro handles this differently: you write both sides of the dialogue in a single text input, tag each turn with a speaker index, and the model renders both voices in one audio file with natural turn-taking.
<|speaker:0|>So we shipped the redesign last Thursday. <|speaker:1|>[excited] And the conversion rate went up twelve percent overnight. <|speaker:0|>Twelve percent. On a Thursday launch. <|speaker:1|>[laughs] Nobody launches on a Thursday.
That sample puts two voices and two emotion tags across four speaker turns, all generated in a single request. This guide covers the speaker tag syntax, voice mapping, per-speaker emotion control, and practical patterns for common dialogue scenarios.
Voice mapping
The speech.voices array maps speaker indices to voice model IDs. Index 0 in the array corresponds to <|speaker:0|> in the text, index 1 to <|speaker:1|>:
[
{
"taskType": "audioInference",
"taskUUID": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
"model": "fishaudio:s2.1@pro",
"speech": {
"text": "<|speaker:0|>The deployment went through without any issues. <|speaker:1|>[excited] Finally! That pipeline has been flaky for weeks.",
"voices": [
"536d3a5e000945adb7038665781a4aca",
"933563129e564b19a115bedd57b7406a"
]
}
}
]{
"data": [
{
"taskType": "audioInference",
"taskUUID": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
"audioUUID": "a2b3c4d5-e6f7-8901-2345-678901abcdef",
"audioURL": "https://am.runware.ai/audio/os/a14d18/ws/2/ai/a2b3c4d5-e6f7-8901-2345-678901abcdef.mp3"
}
]
}<|speaker:0|>The deployment went through without any issues. <|speaker:1|>[excited] Finally! That pipeline has been flaky for weeks.
speech.voices and speech.voice are mutually exclusive. Use speech.voice (singular) for single-speaker requests and speech.voices (array) for multi-speaker dialogue. Sending both causes a validation error.
Single voice vs. two voices
Dialogue text without speaker tags gets read by one voice. The same text with speaker tags and a voices array produces distinct speakers:
Have you tested the new endpoint? Yes, it returns data in under fifty milliseconds. That's faster than what we had before.
<|speaker:0|>Have you tested the new endpoint? <|speaker:1|>Yes, it returns data in under fifty milliseconds. <|speaker:0|>That is faster than what we had before.
The multi-speaker version distinguishes who is saying what. The model handles turn-taking pacing automatically, with natural pauses between speakers.
Dialogue patterns
Technical discussion
Clean back-and-forth between two colleagues. No emotion tags needed when the content speaks for itself:
<|speaker:0|>Did you get a chance to review the pull request? <|speaker:1|>I did. The caching layer looks solid, but I have questions about the invalidation logic. <|speaker:0|>Fair enough. Want to walk through it after standup?
Podcast conversation
Longer turns with natural topic development. Good for content where each speaker contributes multiple sentences:
<|speaker:0|>Welcome back to the show. Today we are talking about the state of open-source AI. <|speaker:1|>It has been a wild year. Three months ago, nobody expected the licensing landscape to shift this fast. <|speaker:0|>Right. And the tooling has caught up in ways that actually matter for production.
Interview
One speaker asks, the other answers at length. The asymmetry in turn length works well because S2-Pro adjusts pacing per speaker:
<|speaker:0|>Can you tell me about a time you had to make a difficult technical decision under pressure? <|speaker:1|>Sure. Last quarter we had a production outage that lasted six hours. I had to decide between rolling back to a known-good state or pushing a hotfix forward. I chose the rollback. It cost us a deploy cycle, but it was the safer call.
Narrator and character
One voice narrates in third person. The other speaks in first person as the character. Emotion tags on the character voice add texture without affecting the narrator:
<|speaker:0|>The lab was dark except for a single monitor. Dr. Chen stared at the results. <|speaker:1|>[whispering] That cannot be right. Run it again. <|speaker:0|>She pressed enter. The second run confirmed what the first had shown.
Formatting multi-speaker text
The model doesn't require line breaks between speaker turns. The <|speaker:N|> tags are the only markers it needs. All of these formats produce the same output:
<|speaker:0|>Line one. <|speaker:1|>Line two.<|speaker:0|>Line one.
<|speaker:1|>Line two.Both work. Use whichever format makes your source text easier to read. In JSON payloads, everything goes in a single speech.text string, so line breaks are just \n characters.
Tips
-
Pick voices that contrast. Two voices with similar pitch and cadence sound like one person talking to themselves. Pair a lower-register narrator with a higher-register speaker for the clearest separation.
-
Keep turns long enough to establish the voice. A two-word turn doesn't give the model enough context to differentiate the speaker. Aim for at least one full sentence per turn.
-
Use emotion tags on the speaker who needs them. You don't have to tag every turn. If one speaker is neutral and the other is emotional, tag only the emotional speaker. The contrast makes the emotion more noticeable.
-
Place speaker tags at natural dialogue boundaries. A speaker tag in the middle of a sentence produces an unnatural voice switch. Tag at the start of a new sentence or thought, not mid-clause.
-
Start every text with a speaker tag. The model defaults to speaker 0 if no tag is present at the start. Be explicit by opening with
<|speaker:0|>to avoid ambiguity.