Inworld Realtime TTS-2

Inworld Realtime TTS-2 is a conversational text-to-speech model built for realtime voice interaction rather than static narration. It supports free-form voice direction, carries tone and pacing forward from prior audio in a session, preserves one voice identity across 100+ languages, and is designed for expressive, low-latency speech in assistants, characters, support agents, and interactive products.

Complete technical specification for integration
Ready-to-use code snippets for common workflows
Step-by-step tutorials for advanced use cases
Formatting LLM output for speech How to write LLM system prompts that produce text TTS-2 can synthesize naturally, with normalization, filler words, and emphasis cues handled before the audio call.
Controlling voice delivery with steering tags How to use natural-language steering tags to control emotion, pacing, volume, and vocal style in TTS-2 speech output.