Fish Audio S2.1 Pro

Fish Audio S2.1 Pro is a flagship text-to-speech model built for highly expressive, low-latency speech generation. It supports natural-language bracket cues for emotion and delivery control, multi-speaker dialogue in a single generation, 80+ languages with automatic language detection, and realtime streaming with very fast time to first audio.

Complete technical specification for integration
Ready-to-use code snippets for common workflows
Step-by-step tutorials for advanced use cases
Emotion and expression control How to control vocal delivery in Fish Audio S2-Pro with bracket tags. The tag system steers emotion, expression, paralanguage, and phoneme-level pronunciation in one inline syntax.
Multi-speaker dialogue How to generate two-speaker dialogue audio in a single request to Fish Audio S2-Pro using inline speaker tags. One call, two voices, full per-speaker emotion control.