MODEL IDfishaudio:s2.1@pro

live

Fish Audio S2.1 Pro

by Fish AudioJune 1, 2026

Fish Audio S2.1 Pro is a flagship text-to-speech model built for highly expressive, low-latency speech generation. It supports natural-language bracket cues for emotion and delivery control, multi-speaker dialogue in a single generation, 80+ languages with automatic language detection, and realtime streaming with very fast time to first audio.

Emotion and expression control

How to control vocal delivery in Fish Audio S2-Pro with bracket tags. The tag system steers emotion, expression, paralanguage, and phoneme-level pronunciation in one inline syntax.

Introduction

S2-Pro controls vocal delivery through bracket tags: short instructions placed in square brackets alongside the text. You write [excited] before a line and the model reads it with energy. You write [whispering] and the voice drops. Any descriptive phrase inside brackets works as a direction, from single keywords like [laughs] to full descriptions like [laughing nervously while trying to keep composure].

0:00

You know what I love about this city? [excited] The food scene is unreal. [sigh] But the rent... the rent is something else entirely.

That sample uses two tags and a sigh cue in one passage. Each tag steers the voice for the text that follows it, and the model shifts naturally between them. This guide covers the full tag system, free-form expressions, paralanguage cues for pacing, and phoneme-level pronunciation control.

How bracket tags work

A bracket tag is a word or phrase enclosed in [square brackets], placed before the text it applies to. The model reads the tag, adjusts its delivery accordingly, then speaks everything that follows until it hits the next tag or the end of the input.

[instruction] Text to speak in that style.

Without any tags, S2-Pro reads text in a neutral tone. Adding a tag transforms the delivery. Compare the same line spoken flat versus with an [excited] tag:

0:00

I just found out I got the job. I can't believe it.

No tag

0:00

[excited] I just found out I got the job. I can't believe it!

With [excited]

The tagged version carries energy and pacing that match the content. The flat version reads the words correctly but misses the emotional context.

Here is the full API request for the tagged version:

import { createClient } from '@runware/sdk'

const client = await createClient({ apiKey: process.env.RUNWARE_API_KEY })
await client.connect()

const [result] = await client.run({
  model: 'fishaudio:s2.1@pro',
  speech: {
    text: '[excited] The quarterly numbers are in and they\'re outstanding. [laughs] Even the finance team was smiling.',
    voice: '933563129e564b19a115bedd57b7406a'
  }
})

import asyncio
import os

from runware import Runware


async def main():
    async with Runware(api_key=os.environ["RUNWARE_API_KEY"]) as client:
        results = await client.run({
            "model": "fishaudio:s2.1@pro",
            "speech": {
                "text": "[excited] The quarterly numbers are in and they're outstanding. [laughs] Even the finance team was smiling.",
                "voice": "933563129e564b19a115bedd57b7406a"
            }
        })


asyncio.run(main())

curl https://api.runware.ai/v1 \
  -H "Authorization: Bearer $RUNWARE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '[
    {
      "taskType": "audioInference",
      "taskUUID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "model": "fishaudio:s2.1@pro",
      "speech": {
        "text": "[excited] The quarterly numbers are in and they're outstanding. [laughs] Even the finance team was smiling.",
        "voice": "933563129e564b19a115bedd57b7406a"
      }
    }
  ]'

runware run fishaudio:s2.1@pro \
  speech.text="[excited] The quarterly numbers are in and they're outstanding. [laughs] Even the finance team was smiling." \
  speech.voice=933563129e564b19a115bedd57b7406a

{
  "taskType": "audioInference",
  "taskUUID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "model": "fishaudio:s2.1@pro",
  "speech": {
    "text": "[excited] The quarterly numbers are in and they're outstanding. [laughs] Even the finance team was smiling.",
    "voice": "933563129e564b19a115bedd57b7406a"
  }
}

Response

{
  "data": [
    {
      "taskType": "audioInference",
      "taskUUID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "audioUUID": "f1e2d3c4-b5a6-7890-1234-567890abcdef",
      "audioURL": "https://am.runware.ai/audio/os/a14d18/ws/2/ai/f1e2d3c4-b5a6-7890-1234-567890abcdef.mp3"
    }
  ]
}

0:00

[excited] The quarterly numbers are in and they're outstanding. [laughs] Even the finance team was smiling.

Core tags

S2-Pro ships with a set of built-in tags that work reliably across all voices. These cover the most common delivery modes: