Fish Audio S2.1 Pro
Fish Audio S2.1 Pro is a flagship text-to-speech model built for highly expressive, low-latency speech generation. It supports natural-language bracket cues for emotion and delivery control, multi-speaker dialogue in a single generation, 80+ languages with automatic language detection, and realtime streaming with very fast time to first audio.
Complete technical specification for integration
Step-by-step tutorials for advanced use cases
-
Emotion and expression control How to use S2-Pro's bracket tag system to control vocal delivery. Covers core tags, free-form natural language expressions, tag combining, paralanguage cues, and phoneme-level pronunciation overrides.
-
Multi-speaker dialogue How to generate two-speaker audio in a single request using S2-Pro's inline speaker tags. Covers speaker tag syntax, voice mapping, emotion control per speaker, and practical dialogue patterns.