---
title: Emotion and expression control — Fish Audio S2.1 Pro | Runware Docs
url: https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/emotion-and-expression
description: How to use S2-Pro's bracket tag system to control vocal delivery. Covers core tags, free-form natural language expressions, tag combining, paralanguage cues, and phoneme-level pronunciation overrides.
---
### [Introduction](https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/emotion-and-expression#introduction)

S2-Pro controls vocal delivery through **bracket tags**: short instructions placed in square brackets alongside the text. You write `[excited]` before a line and the model reads it with energy. You write `[whispering]` and the voice drops. Any descriptive phrase inside brackets works as a direction, from single keywords like `[laughs]` to full descriptions like `[laughing nervously while trying to keep composure]`.

[Listen to audio](https://runware.ai/docs/assets/hero.DaguKqDy.mp3)

> **Prompt**: You know what I love about this city? [excited] The food scene is unreal. [sigh] But the rent... the rent is something else entirely.

That sample uses two tags and a sigh cue in one passage. Each tag steers the voice for the text that follows it, and the model shifts naturally between them. This guide covers the full tag system, free-form expressions, paralanguage cues for pacing, and phoneme-level pronunciation control.

### [How bracket tags work](https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/emotion-and-expression#how-bracket-tags-work)

A bracket tag is a word or phrase enclosed in `[square brackets]`, placed before the text it applies to. The model reads the tag, adjusts its delivery accordingly, then speaks everything that follows **until it hits the next tag** or the end of the input.

```text
[instruction] Text to speak in that style.
```

Without any tags, S2-Pro reads text in a neutral tone. Adding a tag transforms the delivery. Compare the same line spoken flat versus with an `[excited]` tag:

[Listen to audio](https://runware.ai/docs/assets/compare-plain.di7zTKOK.mp3)

*No tag*

> **Prompt**: I just found out I got the job. I can't believe it.

[Listen to audio](https://runware.ai/docs/assets/compare-tagged.B7-LVzu3.mp3)

*With [excited]*

> **Prompt**: [excited] I just found out I got the job. I can't believe it!

The tagged version **carries energy and pacing that match the content**. The flat version reads the words correctly but **misses the emotional context**.

Here is the full API request for the tagged version:

**Request**:

```json
[
  {
    "taskType": "audioInference",
    "taskUUID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "model": "fishaudio:s2.1@pro",
    "speech": {
      "text": "[excited] The quarterly numbers are in and they're outstanding. [laughs] Even the finance team was smiling.",
      "voice": "933563129e564b19a115bedd57b7406a"
    }
  }
]
```

**Response**:

```json
{
  "data": [
    {
      "taskType": "audioInference",
      "taskUUID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "audioUUID": "f1e2d3c4-b5a6-7890-1234-567890abcdef",
      "audioURL": "https://am.runware.ai/audio/os/a14d18/ws/2/ai/f1e2d3c4-b5a6-7890-1234-567890abcdef.mp3"
    }
  ]
}
```[Listen to audio](https://runware.ai/docs/assets/request-demo.ChuV076-.mp3)

> **Prompt**: [excited] The quarterly numbers are in and they're outstanding. [laughs] Even the finance team was smiling.

### [Core tags](https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/emotion-and-expression#core-tags)

S2-Pro ships with a set of built-in tags that work reliably across all voices. These cover the most common delivery modes:

- **Paralanguage:** `[whisper]`, `[laugh]`, `[emphasis]`, `[sigh]`, `[gasp]`, `[pause]`
- **Emotions:** `[angry]`, `[excited]`, `[sad]`, `[surprised]`
- **Breath cues:** `[inhale]`, `[exhale]`

[Listen to audio](https://runware.ai/docs/assets/tag-excited.DPvAu5ga.mp3)

*[excited]*

> **Prompt**: [excited] We just crossed a million active users. A million!

[Listen to audio](https://runware.ai/docs/assets/tag-whisper.xuM5qOPV.mp3)

*[whispering]*

> **Prompt**: [whispering] I wasn't supposed to tell you this, but the deal closed yesterday.

[Listen to audio](https://runware.ai/docs/assets/tag-angry.CBinvAs3.mp3)

*[angry]*

> **Prompt**: [angry] That is the third time this week. I specifically asked for this to be fixed by Monday.

[Listen to audio](https://runware.ai/docs/assets/tag-sad.bqOYGvak.mp3)

*[sad]*

> **Prompt**: [sad] I tried everything I could. It wasn't enough.

[Listen to audio](https://runware.ai/docs/assets/tag-laughs.BX084w1f.mp3)

*[laughs] mid-sentence*

> **Prompt**: And then she just walked out of the meeting. [laughs] Nobody said a word.

A tag at the start of a line **shapes the entire line**. A tag placed mid-sentence, like `[laughs]` after a punchline, **creates a momentary shift at that point**. Both placements work.

### [Free-form natural language expressions](https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/emotion-and-expression#free-form-natural-language-expressions)

The bracket system goes beyond the core tags above. S2-Pro accepts **any descriptive phrase** inside brackets. If you can describe how you want the line delivered, the model attempts to perform it:

```text
[whispers sweetly]
[laughing nervously]
[with gentle warmth]
[speaking through tears]
[as if confiding a secret]
```

[Listen to audio](https://runware.ai/docs/assets/freeform-whispers-sweetly.CiW3dFNf.mp3)

*[whispers sweetly]*

> **Prompt**: [whispers sweetly] Close your eyes. I have a surprise for you.

[Listen to audio](https://runware.ai/docs/assets/freeform-laughing-nervously.pOFwDgpC.mp3)

*[laughing nervously]*

> **Prompt**: [laughing nervously] Yeah, I totally knew that was going to happen. Definitely. For sure.

[Listen to audio](https://runware.ai/docs/assets/freeform-with-gentle-warmth.WVCUJ5X7.mp3)

*[with gentle warmth]*

> **Prompt**: [with gentle warmth] Take your time. There's no rush at all.

Free-form tags give you **more control than single-keyword emotions**. `[whispers]` drops volume. `[whispers sweetly]` drops volume and adds warmth. The additional context in the tag **shapes the nuance**.

> [!NOTE]
> Free-form tags are more expressive but less predictable than the core set. If a specific tag produces inconsistent results across regenerations, fall back to a simpler phrasing or one of the built-in tags.

### [Combining tags](https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/emotion-and-expression#combining-tags)

You can stack multiple tags at the same point or apply them sequentially across sentences. Compatible pairs compound naturally:

[Listen to audio](https://runware.ai/docs/assets/combo-sad-whisper.0K5Ebv70.mp3)

*[sad] + [whispering]*

> **Prompt**: [sad][whispering] I don't think he's coming back this time.

[Listen to audio](https://runware.ai/docs/assets/combo-excited-laughs.BSnpDmhG.mp3)

*[excited] leading into [laughs]*

> **Prompt**: [excited] We actually did it! [laughs] I told you it would work!

**Compatible pairs** like `[sad]` + `[whispering]` or `[excited]` + `[laughs]` work because they describe deliveries a person could plausibly perform at the same time. **Conflicting pairs** like `[whispering]` + `[angry][shouting]` send the model in two directions at once and produce unstable output. If the delivery you want combines mood and manner, stack them. If the two tags fight each other physically, pick one.

### [Emotion transitions](https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/emotion-and-expression#emotion-transitions)

For longer passages, tags work as **scene directions** that shift the read across sentences. The model handles gradual transitions between contrasting emotions:

[Listen to audio](https://runware.ai/docs/assets/transition-arc.ohCbnVdC.mp3)

> **Prompt**: [excited] I got the promotion! [pause] But it means relocating across the country. [sad] I'll miss everyone here. [with quiet resolve] I'm going to make it work.

Each tag **resets the delivery** for the text that follows. The model **doesn't blend one tag into the next automatically**. The transition happens at the boundary. For a smooth arc, space the emotional shifts across enough text that each one has room to land.

### [Paralanguage](https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/emotion-and-expression#paralanguage)

Beyond emotion tags, S2-Pro supports **paralanguage cues** that control timing and vocal texture. These use parentheses instead of brackets:

```text
(break)         Short pause
(long-break)    Extended pause
(breath)        Audible inhale
(laugh)         Inline laugh sound
(cough)         Cough sound
(sigh)          Sigh sound
(lip-smacking)  Lip-smacking sound
```

> [!WARNING]
> Paralanguage cues require `settings.normalize` set to `false`. With normalization enabled (the default), the model may strip or misinterpret parenthesized tokens.

[Listen to audio](https://runware.ai/docs/assets/paralanguage-break.DD7R_6xA.mp3)

*(break) between sentences*

> **Prompt**: The results are in. (break) We passed every benchmark.

[Listen to audio](https://runware.ai/docs/assets/paralanguage-breath.DWPDY4jY.mp3)

*(breath) for a natural inhale*

> **Prompt**: Let me think about this for a second. (breath) Okay. Here's what we do.

Use paralanguage when you need precise control over **where a pause or sound lands** in the output. Bracket tags steer the overall mood. Paralanguage cues insert specific audio events at specific positions.

### [Pronunciation overrides](https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/emotion-and-expression#pronunciation-overrides)

S2-Pro supports **phoneme-level pronunciation control** for cases where the model mispronounces a word or you need a specific reading of an ambiguous term. This is useful for homographs (words spelled the same but pronounced differently), brand names, technical jargon, and foreign loan words.

The syntax wraps a CMU Arpabet phoneme sequence between `<|phoneme_start|>` and `<|phoneme_end|>` tags, replacing the word you want to control:

```text
I am an <|phoneme_start|>EH1 N JH AH0 N IH1 R<|phoneme_end|>.
```

> [!WARNING]
> Phoneme control requires `settings.normalize` set to `false`. Each phoneme tag replaces exactly one word. Place punctuation after the closing tag, not inside it.

The most common use case is **homograph disambiguation**. The word "read" has two pronunciations depending on tense:

[Listen to audio](https://runware.ai/docs/assets/phoneme-read-verb.CHCvprUr.mp3)

*"read" as /riːd/ (present tense)*

> **Prompt**: The <|phoneme_start|>R IY1 D<|phoneme_end|> endpoint returns the current state.

[Listen to audio](https://runware.ai/docs/assets/phoneme-read-past.6g7QwyM_.mp3)

*"read" as /rɛd/ (past tense)*

> **Prompt**: The book was <|phoneme_start|>R EH1 D<|phoneme_end|> yesterday.

Technical terms and proper nouns also benefit from explicit pronunciation:

[Listen to audio](https://runware.ai/docs/assets/phoneme-kubernetes.BcKILhW4.mp3)

*Kubernetes with correct stress pattern*

> **Prompt**: Deploy with <|phoneme_start|>K UW2 B ER0 N EH1 T IY0 Z<|phoneme_end|> for container orchestration.

CMU Arpabet uses **uppercase phoneme codes with stress numbers**: `0` for unstressed, `1` for primary stress, `2` for secondary stress. The full phoneme inventory is published at [cmudict.symbols](https://github.com/cmusphinx/cmudict/blob/master/cmudict.symbols) . You can look up any English word in the [CMU Pronouncing Dictionary](http://www.speech.cs.cmu.edu/cgi-bin/cmudict) .

### [Tips](https://runware.ai/docs/models/fish-audio-s2-1-pro/guides/emotion-and-expression#tips)

1. **Start with core tags before reaching for free-form.** The built-in tags (`[excited]`, `[sad]`, `[whisper]`, `[laughs]`) produce the most consistent results across voices. Use free-form expressions when you need nuance that the core set doesn't cover.
    
2. **Stack at most two compatible tags.** `[sad][whispering]` works. Piling on four tags at the same position dilutes each one. If you need a complex delivery, use a single free-form tag that describes it in natural language: `[speaking softly with sadness]`.
    
3. **Give each emotion enough text to land.** A tag applied to three words doesn't have room to develop. A tag applied to a full sentence lets the model build the delivery. For transitions, write at least one complete sentence per emotional beat.
    
4. **Use paralanguage for timing, tags for mood.** A `(break)` controls where a pause happens. An `[excited]` controls how the voice sounds. They solve different problems and can be used together.
    
5. **Reserve phoneme control for genuine mispronunciations.** Most words don't need overrides. Use `<|phoneme_start|>` only when the model consistently mispronounces a specific word, like a brand name or a homograph in the wrong tense.
    
6. **Set `normalize` to `false` for paralanguage and phoneme tags.** Both features rely on literal token parsing. Text normalization can interfere with parenthesized cues and phoneme delimiters.