MODEL ID inworld:tts@2
live

Inworld Realtime TTS-2

Inworld AI
by Inworld AI

Inworld Realtime TTS-2 is a conversational text-to-speech model built for realtime voice interaction rather than static narration. It supports free-form voice direction, carries tone and pacing forward from prior audio in a session, preserves one voice identity across 100+ languages, and is designed for expressive, low-latency speech in assistants, characters, support agents, and interactive products.

Inworld Realtime TTS-2

Controlling voice delivery with steering tags

How to use natural-language steering tags to control emotion, pacing, volume, and vocal style in TTS-2 speech output.

Introduction

Most text-to-speech models read words without considering how those words should sound. A farewell and a punchline come out with the same pitch, the same pacing, the same flat affect. The model knows what to say but has no concept of how to say it.

TTS-2 fixes this with voice direction: natural-language instructions, written in square brackets, that tell the model how to deliver a line. You write them the way you'd write a stage direction for an actor. There are no preset emotion slots, no dropdown menus, no sliders. If you can describe the delivery in plain English, the model can perform it.

0:00

[speak warmly, as if greeting an old friend] Hey, it's been a while. How have you been?

How steering tags work

A steering tag is a natural-language instruction enclosed in square brackets, placed before the text it applies to. The model reads the tag, adjusts its delivery, and speaks the text that follows.

The syntax is straightforward:

[instruction] Text to speak.

The tag applies forward until the model reaches the next tag or the end of the input. You don't need to close it or reset it. If the mood shifts mid-text, drop a new tag at the transition point:

[speak cheerfully] Great news, the build passed! [lower voice, more serious] But we need to talk about the memory leak.

Here's the full API request for that example:

[
  {
    "taskType": "audioInference",
    "taskUUID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "model": "inworld:tts@2",
    "speech": {
      "text": "[speak cheerfully] Great news, the build passed! [lower voice, more serious] But we need to talk about the memory leak.",
      "voice": "Sarah"
    }
  }
]
{
  "data": [
    {
      "taskType": "audioInference",
      "audioUUID": "f1e2d3c4-b5a6-7890-1234-567890abcdef",
      "taskUUID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "audioURL": "https://aud.runware.ai/audio/f1e2d3c4-b5a6-7890-1234-567890abcdef.mp3"
    }
  ]
}
0:00

[speak cheerfully] Great news, the build passed! [lower voice, more serious] But we need to talk about the memory leak.

Writing effective directions

Without any steering tags, the model reads text in a neutral, default tone. Adding a direction tag transforms the delivery. Compare the same line read flat versus with a detailed direction:

0:00

It's over. There's nothing left to say.

No direction
0:00

[speak with deep sadness, voice low and hollow, as if holding back tears, long pauses between sentences] It's over. There's nothing left to say.

With direction

The directed version layers mood, rhythm, pitch, and vocal mode into a single instruction. Think of it as the difference between handing an actor a script versus giving them a scene brief. The more dimensions you describe, the more specific the performance.

You're writing stage directions, not keywords. The model responds to prose. Write the instruction the way you'd describe a read to a voice actor: "say this like you're confiding in a close friend late at night" works better than "intimate".

Direction categories

Steering tags aren't limited to emotions. You can control multiple dimensions of delivery:

Emotion

Sets the overall mood. Can be a single word or a description:

[say excitedly] We just hit a million users!
[sound concerned, speaking carefully] I'm not sure that's the right approach.
[sound terrified, barely holding it together] Did you hear that?
0:00

[say excitedly] We just hit a million users!

0:00

[sound terrified, barely holding it together] Did you hear that?

Articulation and pacing

Controls how clearly and at what speed words are delivered:

[articulate clearly, with deliberate pauses] The API key goes in the Authorization header.
[very fast, urgent] We need to push the hotfix now, production is down.
[very slow, measured] Let me explain this one more time.
0:00

[articulate clearly, with deliberate pauses] The API key goes in the Authorization header.

0:00

[very fast, urgent] We need to push the hotfix now, production is down.

Volume and pitch

Shapes the intensity and register of the voice:

[very quiet, almost a whisper] Don't tell anyone, but the feature ships tomorrow.
[say in a low tone, gravelly] This is your last warning.
[say in a high pitch, bright and cheerful] Good morning! Ready to get started?
0:00

[very quiet, almost a whisper] Don't tell anyone, but the feature ships tomorrow.

Vocal style

Changes the mode of delivery itself:

[whisper in a hushed style] I think someone's listening.
[give a nasal quality, slightly annoyed] Ugh, another meeting.
0:00

[whisper in a hushed style] I think someone's listening.

Combining dimensions

The best results come from combining qualities across multiple categories in a single tag. Layer mood, pitch, pacing, and manner together:

0:00

[say sadly with deliberate pauses in a low voice and hushed style] I don't think he's coming back.

0:00

[speak quickly with excitement and a high pitch] Oh my god, you won't believe what just happened!

0:00

[calm and steady, with a warm tone and measured pace] Take a deep breath. We'll figure this out together.

Inline non-verbals

You can also insert non-verbal sounds at specific points in the text. These are placed inline, exactly where the sound should occur:

Wait, you actually did that? [laugh] That's wild.
[sigh] I don't know. It's been one of those weeks.
Okay, let me think about this for a second. [breathe] Right. Here's what we do.

The available non-verbals are [laugh], [sigh], [breathe], [clear throat], [cough], and [yawn]. They work as audio events, not pronounced words. The model inserts the sound at that position and continues speaking.

0:00

Wait, you actually did that? [laugh] That's wild.

0:00

[sigh] I don't know. It's been one of those weeks.

You can combine non-verbals with steering tags. The tag sets the delivery, the non-verbal inserts a specific moment:

0:00

[speak tiredly, end-of-day energy] [sigh] I don't know. It's, uh, it's been one of those weeks where you just kind of... lose the thread.

Emphasis and pacing

Two text-level techniques complement steering tags and work in any TTS-2 request, with or without bracket tags.

Capitalization for stress

Capitalize a word to make the model stress it. This works the way bold or italics would in written text, drawing the listener's ear to the important word:

I told you NOT to do that.
Your order will arrive by FRIDAY.
That is AbsoLUTEly correct.

Full-word capitalization stresses the entire word. Partial capitalization (like "AbsoLUTEly") stresses a specific syllable. Use both sparingly. If everything is emphasized, nothing is.

0:00

I told you NOT to do that.

Punctuation for pacing

The model responds to punctuation the same way a reader does. Periods create full pauses. Commas create shorter breaks. Ellipses create a trailing, lingering pause, and work well for hesitation or uncertainty:

I thought it would work, but... I'm not so sure anymore.
Stop. Think. Then act.
Well, when you put it that way, I suppose you might have a point.
0:00

I thought it would work, but... I'm not so sure anymore.

Common mistakes

Contradicting the content

A steering tag should match the text it applies to. Tagging [sound happy and excited] on a line about a funeral degrades the output because the model is pulled in two directions. If the text is somber, the tag should reflect that.

Combining opposing directions

Keep each tag internally consistent. Pairing [whisper in a hushed style] with [very loud] in the same instruction sends conflicting signals. Pick one direction per tag. If you need the mood to shift, use a new tag at the transition point.

Over-tagging

You don't need a tag on every sentence. A single tag carries forward until the next one. If five consecutive sentences should all feel warm and conversational, one tag at the start is enough. Changing direction every sentence produces a jittery, unnatural read.

Tips for best results

  1. Write stage directions, not keywords. "Say this like you're apologizing to someone you care about" outperforms "apologetic". The model responds to prose, so give it a scene.

  2. Combine at least two dimensions. Mood alone is generic. Mood plus pacing, or mood plus volume, produces a more specific delivery. "Sad" is vague. "Sad, quiet, with long pauses" is a read.

  3. Place non-verbals where a human would do them. A [sigh] at the start of a resigned sentence, a [laugh] after a punchline. Don't scatter them randomly.

  4. Use capitalization for one or two words per sentence at most. Emphasis works because it's selective. Capitalizing half the sentence removes the contrast that makes it effective.

  5. Test with the same voice. Different voices respond to the same direction with different intensity. A direction that sounds perfect on Sarah might feel understated on Ethan. Pick your voice first, then tune the directions to it.

  6. Let the model breathe. Longer input with natural punctuation sounds better than short fragments stitched together. Give the model a full thought to work with, not a word at a time.