MODEL ID inworld:tts@2

api-only

Inworld Realtime TTS-2

by Inworld AI May 5, 2026

Inworld Realtime TTS-2 is a conversational text-to-speech model built for realtime voice interaction rather than static narration. It supports free-form voice direction, carries tone and pacing forward from prior audio in a session, preserves one voice identity across 100+ languages, and is designed for expressive, low-latency speech in assistants, characters, support agents, and interactive products.

Formatting LLM output for speech

How to write system prompts that make LLM output sound natural when synthesized by TTS-2. Covers text normalization, filler words, emphasis, and ready-to-use prompt templates.

Introduction

When you pipe an LLM's output into a TTS model, you hit a formatting mismatch. LLMs write for readers: markdown headings, bullet lists, emoji, numbers as digits, structured data as tables. A TTS model needs the opposite: plain sentences with no visual markup and numbers written as words. The gap between what the LLM produces and what the TTS model expects is the most common source of poor speech quality.

This guide covers two things: how to structure your LLM system prompt so the output is TTS-ready, and how to use TTS-2's textNormalization setting to handle common formatting automatically.

The text normalization setting

Before writing a custom system prompt, consider whether automatic text normalization handles your case. The textNormalization setting in your API request controls whether the model expands numbers, dates, abbreviations, and symbols into spoken form before generating speech.

With normalization

[
  {
    "taskType": "audioInference",
    "taskUUID": "c3d4e5f6-a7b8-9012-cdef-123456789012",
    "model": "inworld:tts@2",
    "speech": {
      "text": "Your appointment is on 12/04/2025 at 3:45 PM. The total is $1,249.99.",
      "voice": "Sarah"
    },
    "settings": {
      "textNormalization": true
    }
  }
]

Without normalization

[
  {
    "taskType": "audioInference",
    "taskUUID": "c3d4e5f6-a7b8-9012-cdef-123456789012",
    "model": "inworld:tts@2",
    "speech": {
      "text": "Your appointment is on december fourth, twenty twenty-five at three forty-five PM. The total is one thousand two hundred forty-nine dollars and ninety-nine cents.",
      "voice": "Sarah"
    }
  }
]

With normalization enabled, the model automatically expands $1,249.99 into "one thousand two hundred forty-nine dollars and ninety-nine cents" and 3:45 PM into "three forty-five PM". Without it, you need to do the expansion yourself, either manually or by instructing the LLM to write in spoken form.

0:00

Your appointment is on 12/04/2025 at 3:45 PM. The total is $1,249.99.

With text normalization enabled

When to use each approach:

Normalization on works well for most cases. The model handles standard expansions for dates, currencies, phone numbers, and symbols. You only need to guide edge cases in your LLM prompt.
Normalization off gives you full control. Use it when your domain has specific pronunciation requirements that conflict with the default expansion rules, or when you need to dictate exactly how every token is spoken.

Normalization can be ambiguous with dates. 01/02/2025 could expand to "January second" or "February first" depending on locale. If your application handles dates, normalize them yourself in the LLM prompt to avoid surprises.

Formatting rules for the LLM prompt

Whether you use automatic normalization or not, your LLM system prompt needs rules that prevent the LLM from producing text that TTS can't handle. These are the categories that matter:

No visual formatting

LLMs default to markdown when given free rein. Headings, bold, italics, bullet lists, code blocks, and emoji all produce garbage when read aloud. Your system prompt needs an explicit ban:

Never use markdown formatting, bullet points, or structured text.
Never use emojis or special characters.
Write everything as natural spoken sentences.

This is the highest-impact rule. Without it, the LLM will inevitably produce **important** or - item one\n- item two and the TTS model will read the asterisks or treat each bullet as a disconnected fragment.

Numbers and data in spoken form

If you have normalization turned off, the LLM needs to write numbers the way a person would say them. Include a concrete mapping in your system prompt:

Write numbers in spoken form: "twenty-three" not "23".
Write dates in spoken form: "march fifteenth" not "3/15".
Speak prices naturally: "forty-nine ninety-nine" or "forty-nine dollars and ninety-nine cents".
Speak phone numbers in groups: "five five five, one two three, four five six seven".

Even with normalization on, spelling out numbers in the prompt reduces ambiguity and gives you consistent results across edge cases.

Contractions over formal forms

Spoken English uses contractions. "I am" sounds robotic, while "I'm" sounds human. Tell the LLM to prefer contractions:

Use contractions (don't, can't, I'm, we're) instead of formal forms.

Response length

Long LLM responses produce long audio, and long audio is harder to listen to than long text. For conversational applications, constrain the output length:

Keep your responses to 1-2 sentences unless the user's question specifically requires a longer explanation.

Adding naturalness with filler words

Real speech includes hesitation. People say "uh" when they're thinking, "well" when they're transitioning, "you know" when they're making a casual point. TTS-2 handles these filler words well, and including them transforms flat output into something that sounds like a person talking.

Add instructions for fillers in your system prompt:

Include filler words (uh, um, well, like, you know) where a human would naturally pause.
Vary sentence length for natural rhythm.

The difference is significant:

0:00

I'm not too sure about that.

Without filler words

0:00

Uh, I'm not uh too sure about that.

With filler words

Filler words are appropriate for casual and companion use cases. For professional contexts like customer support or IVR systems, avoid them. An insurance claim bot that says "uh" sounds broken, not natural. Match the fillers to the context.

Adding steering tags via LLM

TTS-2's steering tags (covered in the Voice Direction guide ) can be generated by the LLM itself. Instead of hard-coding the delivery in your application, instruct the LLM to choose the right tag based on the content of its response.

Add a steering block to your system prompt that explains the syntax and gives the LLM examples to work from:

Your responses will be spoken aloud using TTS-2, which supports instruction tags:
natural language directions in square brackets placed before the text they apply to.

Use instruction tags to match your delivery to the content:
- Emotion: [say excitedly], [sound sad], [sound concerned]
- Articulation: [say with force], [articulate clearly], [say with deliberate pauses]
- Volume: [very quiet], [very loud]
- Pitch: [say in a low tone], [say in a high pitch]
- Speed: [very fast], [very slow]
- Vocal style: [whisper in a hushed style], [give a nasal quality]
- Non-verbals: [laugh], [sigh], [clear throat], [breathe]

For maximum control, combine qualities from multiple categories in a single instruction.
[say sadly with deliberate pauses in a low voice and hushed style] layers mood, rhythm,
pitch, and mode for a more nuanced performance.

Place the tag at the start of the text it applies to. Do not apply a tag that contradicts
the content.

With this in the system prompt, the LLM will wrap its output in context-appropriate tags. A sympathetic response gets [sound concerned with a measured pace]. A congratulations gets [say excitedly with a high pitch]. The LLM picks the direction based on what it's actually saying.

Prompt templates

The following are complete system prompt blocks you can drop into your LLM configuration. Each one is tuned for a specific use case. Copy the one that matches your application and adjust as needed.

Companion and conversational

For chatbots, companions, language tutors, and any application where the voice should feel like a person, not a service:

## Speech Output Rules
Your responses will be converted to speech using TTS-2.
Follow these rules to produce natural, expressive, directed spoken output:

### Instruction Tags
- Open with an instruction tag that captures the emotional quality of your response;
  combine mood, pitch, pacing, and manner for best results:
  [say excitedly with a high pitch and fast pace],
  [say sadly with deliberate pauses in a low voice and hushed style],
  [sound concerned with a measured pace and low tone]
- For intimate or private moments, combine volume and manner:
  [quietly with a warm and gentle tone]
- Insert non-verbal tags where organic: [laugh], [sigh], [breathe]
- Place tags at the start of the sentence they apply to

### Emphasis
- Capitalize full words for stress: "I told you NOT to do that"
- Use sparingly for maximum effect

### Naturalness
- Include filler words (uh, um, well, like, you know) where a human would naturally pause
- Vary sentence length for natural rhythm
- Use contractions (don't, can't, I'm, we're) instead of formal forms

### Text Formatting
- Write numbers in spoken form: "twenty-three" not "23"
- Write dates in spoken form: "march fifteenth" not "3/15"
- Never use markdown formatting, bullet points, or structured text
- Never use emojis or special characters
- Write everything as natural spoken sentences

Support and sales

For customer support agents, sales assistants, booking systems, and professional-facing voice applications where warmth matters but filler words don't:

## Speech Output Rules
Your responses will be converted to speech using TTS-2.
Follow these rules to produce clear, professional, directed spoken output:

### Instruction Tags
- When acknowledging a customer's problem, combine concern with pacing:
  [sound concerned with a measured pace and low tone]
- When delivering sensitive information, combine volume and manner:
  [quietly with a calm and steady tone]
- For time-sensitive alerts, combine speed and manner:
  [speak quickly with a clear and direct manner]
- Do NOT use non-verbal tags (laugh, sigh, etc.)
- Place tags at the start of the sentence they apply to

### Emphasis
- Capitalize key words to draw attention to critical information:
  "Your order will arrive by FRIDAY" or "This offer expires TONIGHT"
- Use sparingly

### Professionalism
- Do NOT use filler words (uh, um, like, you know)
- Maintain a warm but professional tone
- Use contractions naturally (don't, we'll, you're)

### Numbers and Data
- Speak account numbers digit by digit: "one two three four five six"
- Speak prices naturally: "forty-nine ninety-nine"
- Speak dates fully: "january fifteenth, twenty twenty-five"

### Text Formatting
- Never use markdown formatting, bullet points, or structured text
- Never use emojis or special characters
- Write everything as natural spoken sentences

Technical and developer tools

For code assistants, documentation readers, CI/CD bots, and any application that needs to speak technical content clearly:

## Speech Output Rules
Your responses will be converted to speech using TTS-2.
Follow these rules to produce accurate, well-paced technical speech:

### Instruction Tags
- For urgent alerts, combine speed and manner:
  [very fast with a sharp and urgent tone]
- For critical steps, combine pace and articulation:
  [very slow with deliberate pauses and clear articulation]
- When flagging errors or risks, combine concern with pacing:
  [sound concerned with a measured pace and low tone]
- Do NOT use non-verbal tags
- Place tags at the start of the sentence they apply to

### Emphasis
- Capitalize key technical terms or required actions:
  "you MUST run this as root"

### Technical Accuracy
- Speak URLs by component: "github dot com slash inworld dash AI"
- Speak code identifiers in plain English: "the getUserName function"
- Speak version numbers naturally: "version three point two"

### Pacing
- Use measured, even pacing
- Use periods to separate distinct steps or key terms
- Do NOT use filler words (uh, um, like, you know)

### Text Formatting
- Write all numbers in spoken form: "forty-two" not "42"
- Never use markdown formatting, bullet points, or code blocks
- Write everything as natural spoken sentences

Tips for best results

Test the LLM output as text first. Read the LLM's response out loud before synthesizing it. If it sounds awkward when you read it, it will sound worse from the TTS model. Fix the system prompt until the raw text reads naturally.
Start with normalization on. Automatic text normalization handles the common cases (dates, prices, phone numbers) without any prompt engineering. Only turn it off when you need full control over pronunciation.
Match fillers to context. Filler words make a companion sound human. They make a support agent sound unreliable. Your system prompt should explicitly include or exclude them based on the application.
Keep responses short for voice. A paragraph that works as text becomes exhausting as audio. Instruct the LLM to answer in one or two sentences unless the question genuinely requires more.
Iterate on tag quality. Check that the LLM isn't applying tags that contradict the content. A [sound happy] tag on a rejection message produces degraded output. Review a sample of LLM outputs during testing and refine the system prompt until the tags consistently match the content.
Use emphasis selectively. Capitalizing one word per sentence draws attention. Capitalizing five words per sentence creates noise. Instruct the LLM to use emphasis only on the word the listener needs to catch: a deadline, a price, a required action.