Fish Audio S2.1 Pro
Fish Audio S2.1 Pro is a flagship text-to-speech model built for highly expressive, low-latency speech generation. It supports natural-language bracket cues for emotion and delivery control, multi-speaker dialogue in a single generation, 80+ languages with automatic language detection, and realtime streaming with very fast time to first audio.
Complete technical specification for integration
Step-by-step tutorials for advanced use cases
← All GuidesEmotion and expression control
How to use S2-Pro's bracket tag system to control vocal delivery. Covers core tags, free-form natural language expressions, tag combining, paralanguage cues, and phoneme-level pronunciation overrides.
Introduction
S2-Pro controls vocal delivery through bracket tags: short instructions placed in square brackets alongside the text. You write [excited] before a line and the model reads it with energy. You write [whispering] and the voice drops. Any descriptive phrase inside brackets works as a direction, from single keywords like [laughs] to full descriptions like [laughing nervously while trying to keep composure].
You know what I love about this city? [excited] The food scene is unreal. [sigh] But the rent... the rent is something else entirely.
That sample uses two tags and a sigh cue in one passage. Each tag steers the voice for the text that follows it, and the model shifts naturally between them. This guide covers the full tag system, free-form expressions, paralanguage cues for pacing, and phoneme-level pronunciation control.
Free-form natural language expressions
The bracket system goes beyond the core tags above. S2-Pro accepts any descriptive phrase inside brackets. If you can describe how you want the line delivered, the model attempts to perform it:
[whispers sweetly]
[laughing nervously]
[with gentle warmth]
[speaking through tears]
[as if confiding a secret][whispers sweetly] Close your eyes. I have a surprise for you.
[laughing nervously] Yeah, I totally knew that was going to happen. Definitely. For sure.
[with gentle warmth] Take your time. There's no rush at all.
Free-form tags give you more control than single-keyword emotions. [whispers] drops volume. [whispers sweetly] drops volume and adds warmth. The additional context in the tag shapes the nuance.
Free-form tags are more expressive but less predictable than the core set. If a specific tag produces inconsistent results across regenerations, fall back to a simpler phrasing or one of the built-in tags.
Emotion transitions
For longer passages, tags work as scene directions that shift the read across sentences. The model handles gradual transitions between contrasting emotions:
[excited] I got the promotion! [pause] But it means relocating across the country. [sad] I'll miss everyone here. [with quiet resolve] I'm going to make it work.
Each tag resets the delivery for the text that follows. The model doesn't blend one tag into the next automatically. The transition happens at the boundary. For a smooth arc, space the emotional shifts across enough text that each one has room to land.
Paralanguage
Beyond emotion tags, S2-Pro supports paralanguage cues that control timing and vocal texture. These use parentheses instead of brackets:
(break) Short pause
(long-break) Extended pause
(breath) Audible inhale
(laugh) Inline laugh sound
(cough) Cough sound
(sigh) Sigh sound
(lip-smacking) Lip-smacking soundParalanguage cues require settings.normalize set to false. With normalization enabled (the default), the model may strip or misinterpret parenthesized tokens.
The results are in. (break) We passed every benchmark.
Let me think about this for a second. (breath) Okay. Here's what we do.
Use paralanguage when you need precise control over where a pause or sound lands in the output. Bracket tags steer the overall mood. Paralanguage cues insert specific audio events at specific positions.
Pronunciation overrides
S2-Pro supports phoneme-level pronunciation control for cases where the model mispronounces a word or you need a specific reading of an ambiguous term. This is useful for homographs (words spelled the same but pronounced differently), brand names, technical jargon, and foreign loan words.
The syntax wraps a CMU Arpabet phoneme sequence between <|phoneme_start|> and <|phoneme_end|> tags, replacing the word you want to control:
I am an <|phoneme_start|>EH1 N JH AH0 N IH1 R<|phoneme_end|>.Phoneme control requires settings.normalize set to false. Each phoneme tag replaces exactly one word. Place punctuation after the closing tag, not inside it.
The most common use case is homograph disambiguation. The word "read" has two pronunciations depending on tense:
The <|phoneme_start|>R IY1 D<|phoneme_end|> endpoint returns the current state.
The book was <|phoneme_start|>R EH1 D<|phoneme_end|> yesterday.
Technical terms and proper nouns also benefit from explicit pronunciation:
Deploy with <|phoneme_start|>K UW2 B ER0 N EH1 T IY0 Z<|phoneme_end|> for container orchestration.
CMU Arpabet uses uppercase phoneme codes with stress numbers: 0 for unstressed, 1 for primary stress, 2 for secondary stress. The full phoneme inventory is published at cmudict.symbols . You can look up any English word in the CMU Pronouncing Dictionary .
Tips
-
Start with core tags before reaching for free-form. The built-in tags (
[excited],[sad],[whisper],[laughs]) produce the most consistent results across voices. Use free-form expressions when you need nuance that the core set doesn't cover. -
Stack at most two compatible tags.
[sad][whispering]works. Piling on four tags at the same position dilutes each one. If you need a complex delivery, use a single free-form tag that describes it in natural language:[speaking softly with sadness]. -
Give each emotion enough text to land. A tag applied to three words doesn't have room to develop. A tag applied to a full sentence lets the model build the delivery. For transitions, write at least one complete sentence per emotional beat.
-
Use paralanguage for timing, tags for mood. A
(break)controls where a pause happens. An[excited]controls how the voice sounds. They solve different problems and can be used together. -
Reserve phoneme control for genuine mispronunciations. Most words don't need overrides. Use
<|phoneme_start|>only when the model consistently mispronounces a specific word, like a brand name or a homograph in the wrong tense. -
Set
normalizetofalsefor paralanguage and phoneme tags. Both features rely on literal token parsing. Text normalization can interfere with parenthesized cues and phoneme delimiters.