Inworld AI
Inworld AI

Inworld Realtime TTS-2

Conversational text-to-speech with realtime voice direction and audio-aware delivery

Text to Audio

Inworld Realtime TTS-2 Overview

Inworld Realtime TTS-2 is a conversational text-to-speech model built for realtime voice interaction rather than static narration. It supports free-form voice direction, carries tone and pacing forward from prior audio in a session, preserves one voice identity across 100+ languages, and is designed for expressive, low-latency speech in assistants, characters, support agents, and interactive products.

How to Use Inworld Realtime TTS-2

Overview

Inworld Realtime TTS-2 is a conversational text-to-speech model designed for live voice interaction rather than one-shot narration.

It is built for applications where the voice needs to react to context, preserve identity across languages, follow natural-language delivery direction, and sound like a person participating in the exchange instead of reading a script in isolation.

Strengths

Built for Realtime Conversation

Realtime TTS-2 is designed around conversational use cases where the model speaks in the flow of an exchange. It is a stronger fit for agents, companions, characters, and support experiences than models optimized mainly for narration or voiceover.

Free-Form Voice Direction

The model can take delivery cues in natural language, which makes it possible to steer pacing, emotion, emphasis, and style without being limited to a small preset list. This is useful when the same voice needs different reads across different contexts.

Conversational Awareness

In realtime sessions, the model conditions on prior audio from the exchange, not only the text of the current line. That helps it carry forward tone, pacing, and emotional state in a more context-aware way.

Crosslingual Voice Identity

Realtime TTS-2 preserves a single voice identity across more than 100 languages, including language switches inside the same generation. This makes it useful for multilingual assistants, teachers, and characters that need to stay recognizably the same speaker.

Expressive Delivery Tools

The model supports inline non-verbal markers such as laughs, sighs, and breaths, along with disfluency patterns and delivery shaping that help speech feel less robotic and more conversational.

Voice Design and Cloning Compatibility

Realtime TTS-2 works with reusable voices created through prompt-based voice design and cloned voices, which makes it easier to pair the model with custom speaker identities in production systems.

Capabilities

Text-to-Audio

Realtime TTS-2 generates speech from text and is optimized for responsive, expressive output in conversational products.

Realtime Session Output

The model is designed to operate in realtime systems where audio context from prior turns helps shape the next response.

Multilingual Conversational Speech

The model is well suited to applications that need the same speaker identity across many languages without managing separate per-language voice inventories.

Input and Output

  • AIR ID: inworld:tts@2
  • Input: text with optional natural-language delivery cues, plus session-level audio context in realtime workflows
  • Output: synthesized speech audio
  • Language coverage: 100+ languages with preserved voice identity across switches

Best Fit

  • Realtime voice agents
  • Conversational assistants
  • Interactive characters and companions
  • Multilingual spoken interfaces
  • Support and productivity tools that need expressive, responsive speech

More models from Inworld AI

Inworld TTS-1.5 Mini is a lightweight text-to-speech model designed for real-time voice experiences with ultra-low latency and efficient performance. It delivers natural, expressive audio suitable for interactive agents, voice assistants, and conversational applications where responsiveness is critical. The Mini variant balances speed and quality, enabling responsive speech output even under constrained compute conditions.

Inworld TTS-1.5 Max is a high-fidelity text-to-speech model engineered for expressive voice synthesis with rich prosody, nuanced emotional range, and broadcast-ready audio quality. It supports a wide set of languages and delivers more natural pronunciation and expressive variation suitable for narration, content creation, and immersive character voices. The Max variant prioritizes audio quality and expressiveness while still supporting responsive generation.