
Inworld Realtime TTS-2
Conversational text-to-speech with realtime voice direction and audio-aware delivery
Inworld Realtime TTS-2
Conversational text-to-speech with realtime voice direction and audio-aware delivery
Inworld Realtime TTS-2 Overview
Inworld Realtime TTS-2 is a conversational text-to-speech model built for realtime voice interaction rather than static narration. It supports free-form voice direction, carries tone and pacing forward from prior audio in a session, preserves one voice identity across 100+ languages, and is designed for expressive, low-latency speech in assistants, characters, support agents, and interactive products.
How to Use Inworld Realtime TTS-2
Overview
Inworld Realtime TTS-2 is a conversational text-to-speech model designed for live voice interaction rather than one-shot narration.
It is built for applications where the voice needs to react to context, preserve identity across languages, follow natural-language delivery direction, and sound like a person participating in the exchange instead of reading a script in isolation.
Strengths
Built for Realtime Conversation
Realtime TTS-2 is designed around conversational use cases where the model speaks in the flow of an exchange. It is a stronger fit for agents, companions, characters, and support experiences than models optimized mainly for narration or voiceover.
Free-Form Voice Direction
The model can take delivery cues in natural language, which makes it possible to steer pacing, emotion, emphasis, and style without being limited to a small preset list. This is useful when the same voice needs different reads across different contexts.
Conversational Awareness
In realtime sessions, the model conditions on prior audio from the exchange, not only the text of the current line. That helps it carry forward tone, pacing, and emotional state in a more context-aware way.
Crosslingual Voice Identity
Realtime TTS-2 preserves a single voice identity across more than 100 languages, including language switches inside the same generation. This makes it useful for multilingual assistants, teachers, and characters that need to stay recognizably the same speaker.
Expressive Delivery Tools
The model supports inline non-verbal markers such as laughs, sighs, and breaths, along with disfluency patterns and delivery shaping that help speech feel less robotic and more conversational.
Voice Design and Cloning Compatibility
Realtime TTS-2 works with reusable voices created through prompt-based voice design and cloned voices, which makes it easier to pair the model with custom speaker identities in production systems.
Capabilities
Text-to-Audio
Realtime TTS-2 generates speech from text and is optimized for responsive, expressive output in conversational products.
Realtime Session Output
The model is designed to operate in realtime systems where audio context from prior turns helps shape the next response.
Multilingual Conversational Speech
The model is well suited to applications that need the same speaker identity across many languages without managing separate per-language voice inventories.
Input and Output
- AIR ID:
inworld:tts@2 - Input: text with optional natural-language delivery cues, plus session-level audio context in realtime workflows
- Output: synthesized speech audio
- Language coverage: 100+ languages with preserved voice identity across switches
Best Fit
- Realtime voice agents
- Conversational assistants
- Interactive characters and companions
- Multilingual spoken interfaces
- Support and productivity tools that need expressive, responsive speech
More models from Inworld AI
Inworld TTS-1.5 Mini is a lightweight text-to-speech model designed for real-time voice experiences with ultra-low latency and efficient performance. It delivers natural, expressive audio suitable for interactive agents, voice assistants, and conversational applications where responsiveness is critical. The Mini variant balances speed and quality, enabling responsive speech output even under constrained compute conditions.
Inworld TTS-1.5 Max is a high-fidelity text-to-speech model engineered for expressive voice synthesis with rich prosody, nuanced emotional range, and broadcast-ready audio quality. It supports a wide set of languages and delivers more natural pronunciation and expressive variation suitable for narration, content creation, and immersive character voices. The Max variant prioritizes audio quality and expressiveness while still supporting responsive generation.

