MODEL ID fishaudio:s2.1@pro
coming-soon

Fish Audio S2.1 Pro

Fish Audio
by Fish Audio

Fish Audio S2.1 Pro is a flagship text-to-speech model built for highly expressive, low-latency speech generation. It supports natural-language bracket cues for emotion and delivery control, multi-speaker dialogue in a single generation, 80+ languages with automatic language detection, and realtime streaming with very fast time to first audio.

Fish Audio S2.1 Pro

Emotion and expression control

How to use S2-Pro's bracket tag system to control vocal delivery. Covers core tags, free-form natural language expressions, tag combining, paralanguage cues, and phoneme-level pronunciation overrides.

Introduction

S2-Pro controls vocal delivery through bracket tags: short instructions placed in square brackets alongside the text. You write [excited] before a line and the model reads it with energy. You write [whispering] and the voice drops. Any descriptive phrase inside brackets works as a direction, from single keywords like [laughs] to full descriptions like [laughing nervously while trying to keep composure].

0:00

You know what I love about this city? [excited] The food scene is unreal. [sigh] But the rent... the rent is something else entirely.

That sample uses two tags and a sigh cue in one passage. Each tag steers the voice for the text that follows it, and the model shifts naturally between them. This guide covers the full tag system, free-form expressions, paralanguage cues for pacing, and phoneme-level pronunciation control.

How bracket tags work

A bracket tag is a word or phrase enclosed in [square brackets], placed before the text it applies to. The model reads the tag, adjusts its delivery accordingly, then speaks everything that follows until it hits the next tag or the end of the input.

[instruction] Text to speak in that style.

Without any tags, S2-Pro reads text in a neutral tone. Adding a tag transforms the delivery. Compare the same line spoken flat versus with an [excited] tag:

0:00

I just found out I got the job. I can't believe it.

No tag
0:00

[excited] I just found out I got the job. I can't believe it!

With [excited]

The tagged version carries energy and pacing that match the content. The flat version reads the words correctly but misses the emotional context.

Here is the full API request for the tagged version:

[
  {
    "taskType": "audioInference",
    "taskUUID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "model": "fishaudio:s2.1@pro",
    "speech": {
      "text": "[excited] The quarterly numbers are in and they're outstanding. [laughs] Even the finance team was smiling.",
      "voice": "933563129e564b19a115bedd57b7406a"
    }
  }
]
{
  "data": [
    {
      "taskType": "audioInference",
      "taskUUID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "audioUUID": "f1e2d3c4-b5a6-7890-1234-567890abcdef",
      "audioURL": "https://am.runware.ai/audio/os/a14d18/ws/2/ai/f1e2d3c4-b5a6-7890-1234-567890abcdef.mp3"
    }
  ]
}
0:00

[excited] The quarterly numbers are in and they're outstanding. [laughs] Even the finance team was smiling.

Core tags

S2-Pro ships with a set of built-in tags that work reliably across all voices. These cover the most common delivery modes:

  • Paralanguage: [whisper], [laugh], [emphasis], [sigh], [gasp], [pause]
  • Emotions: [angry], [excited], [sad], [surprised]
  • Breath cues: [inhale], [exhale]

A tag at the start of a line shapes the entire line. A tag placed mid-sentence, like [laughs] after a punchline, creates a momentary shift at that point. Both placements work.

Free-form natural language expressions

The bracket system goes beyond the core tags above. S2-Pro accepts any descriptive phrase inside brackets. If you can describe how you want the line delivered, the model attempts to perform it:

[whispers sweetly]
[laughing nervously]
[with gentle warmth]
[speaking through tears]
[as if confiding a secret]

Free-form tags give you more control than single-keyword emotions. [whispers] drops volume. [whispers sweetly] drops volume and adds warmth. The additional context in the tag shapes the nuance.

Free-form tags are more expressive but less predictable than the core set. If a specific tag produces inconsistent results across regenerations, fall back to a simpler phrasing or one of the built-in tags.

Combining tags

You can stack multiple tags at the same point or apply them sequentially across sentences. Compatible pairs compound naturally:

0:00

[sad][whispering] I don't think he's coming back this time.

[sad] + [whispering]
0:00

[excited] We actually did it! [laughs] I told you it would work!

[excited] leading into [laughs]

Compatible pairs like [sad] + [whispering] or [excited] + [laughs] work because they describe deliveries a person could plausibly perform at the same time. Conflicting pairs like [whispering] + [angry][shouting] send the model in two directions at once and produce unstable output. If the delivery you want combines mood and manner, stack them. If the two tags fight each other physically, pick one.

Emotion transitions

For longer passages, tags work as scene directions that shift the read across sentences. The model handles gradual transitions between contrasting emotions:

0:00

[excited] I got the promotion! [pause] But it means relocating across the country. [sad] I'll miss everyone here. [with quiet resolve] I'm going to make it work.

Each tag resets the delivery for the text that follows. The model doesn't blend one tag into the next automatically. The transition happens at the boundary. For a smooth arc, space the emotional shifts across enough text that each one has room to land.

Paralanguage

Beyond emotion tags, S2-Pro supports paralanguage cues that control timing and vocal texture. These use parentheses instead of brackets:

(break)         Short pause
(long-break)    Extended pause
(breath)        Audible inhale
(laugh)         Inline laugh sound
(cough)         Cough sound
(sigh)          Sigh sound
(lip-smacking)  Lip-smacking sound

Paralanguage cues require settings.normalize set to false. With normalization enabled (the default), the model may strip or misinterpret parenthesized tokens.

0:00

The results are in. (break) We passed every benchmark.

(break) between sentences
0:00

Let me think about this for a second. (breath) Okay. Here's what we do.

(breath) for a natural inhale

Use paralanguage when you need precise control over where a pause or sound lands in the output. Bracket tags steer the overall mood. Paralanguage cues insert specific audio events at specific positions.

Pronunciation overrides

S2-Pro supports phoneme-level pronunciation control for cases where the model mispronounces a word or you need a specific reading of an ambiguous term. This is useful for homographs (words spelled the same but pronounced differently), brand names, technical jargon, and foreign loan words.

The syntax wraps a CMU Arpabet phoneme sequence between <|phoneme_start|> and <|phoneme_end|> tags, replacing the word you want to control:

I am an <|phoneme_start|>EH1 N JH AH0 N IH1 R<|phoneme_end|>.

Phoneme control requires settings.normalize set to false. Each phoneme tag replaces exactly one word. Place punctuation after the closing tag, not inside it.

The most common use case is homograph disambiguation. The word "read" has two pronunciations depending on tense:

0:00

The <|phoneme_start|>R IY1 D<|phoneme_end|> endpoint returns the current state.

"read" as /riːd/ (present tense)
0:00

The book was <|phoneme_start|>R EH1 D<|phoneme_end|> yesterday.

"read" as /rɛd/ (past tense)

Technical terms and proper nouns also benefit from explicit pronunciation:

0:00

Deploy with <|phoneme_start|>K UW2 B ER0 N EH1 T IY0 Z<|phoneme_end|> for container orchestration.

Kubernetes with correct stress pattern

CMU Arpabet uses uppercase phoneme codes with stress numbers: 0 for unstressed, 1 for primary stress, 2 for secondary stress. The full phoneme inventory is published at cmudict.symbols . You can look up any English word in the CMU Pronouncing Dictionary .

Tips

  1. Start with core tags before reaching for free-form. The built-in tags ([excited], [sad], [whisper], [laughs]) produce the most consistent results across voices. Use free-form expressions when you need nuance that the core set doesn't cover.

  2. Stack at most two compatible tags. [sad][whispering] works. Piling on four tags at the same position dilutes each one. If you need a complex delivery, use a single free-form tag that describes it in natural language: [speaking softly with sadness].

  3. Give each emotion enough text to land. A tag applied to three words doesn't have room to develop. A tag applied to a full sentence lets the model build the delivery. For transitions, write at least one complete sentence per emotional beat.

  4. Use paralanguage for timing, tags for mood. A (break) controls where a pause happens. An [excited] controls how the voice sounds. They solve different problems and can be used together.

  5. Reserve phoneme control for genuine mispronunciations. Most words don't need overrides. Use <|phoneme_start|> only when the model consistently mispronounces a specific word, like a brand name or a homograph in the wrong tense.

  6. Set normalize to false for paralanguage and phoneme tags. Both features rely on literal token parsing. Text normalization can interfere with parenthesized cues and phoneme delimiters.