Fish Audio S2.1 Pro
Fish Audio S2.1 Pro is a flagship text-to-speech model built for highly expressive, low-latency speech generation. It supports natural-language bracket cues for emotion and delivery control, multi-speaker dialogue in a single generation, 80+ languages with automatic language detection, and realtime streaming with very fast time to first audio.
Complete technical specification for integration
Ready-to-use code snippets for common workflows
Step-by-step tutorials for advanced use cases
-
Emotion and expression control How to control vocal delivery in Fish Audio S2-Pro with bracket tags. The tag system steers emotion, expression, paralanguage, and phoneme-level pronunciation in one inline syntax.
-
Multi-speaker dialogue How to generate two-speaker dialogue audio in a single request to Fish Audio S2-Pro using inline speaker tags. One call, two voices, full per-speaker emotion control.