---
title: Directing voice with audio tags — Eleven v3 | Runware Docs
url: https://runware.ai/docs/models/elevenlabs-v3/guides/directing-with-audio-tags
description: How to use Eleven v3's audio tag system. Covers emotion, sound-effect, and experimental tags, voice-character constraints, punctuation and capitalization, and multi-speaker dialogue conventions.
---
### [Introduction](https://runware.ai/docs/models/elevenlabs-v3/guides/directing-with-audio-tags#introduction)

Eleven v3 takes direction through **audio tags**: short bracketed keywords placed alongside the text, like `[whispers]`, `[laughs]`, `[applause]`. The tag tells the model what kind of delivery or sound to produce. The text after it gets read in that mode. Unlike a prose stage direction, you don't describe the read, you pick the keyword.

[Listen to audio](https://runware.ai/docs/assets/hero.YmA5BHhH.mp3)

> **Prompt**: [curious] Hey, you ever wonder what makes a good story? [excited] It's when somebody says something true and somebody else feels it. [laughs] Simple.

That sample uses three tags from different categories in a single passage. Each one steers the voice for the chunk of text that follows it. This guide walks through every tag category, the constraints that decide whether a tag actually fires, and the punctuation and capitalization tools that work alongside the tag system.

### [Tag categories](https://runware.ai/docs/models/elevenlabs-v3/guides/directing-with-audio-tags#tag-categories)

The tag library splits into three groups by what they do.

#### [Emotion and delivery](https://runware.ai/docs/models/elevenlabs-v3/guides/directing-with-audio-tags#emotion-and-delivery)

These are the most common tags. They shape the **mood and physicality** of the voice. The text after the tag gets read with that affect:

```text
[laughs] [whispers] [sighs] [sarcastic] [curious] [excited] [crying] [snorts] [mischievously]
```

[Listen to audio](https://runware.ai/docs/assets/tag-laughs.CYr-t3VS.mp3)

*[laughs] mid-sentence*

> **Prompt**: Wait, you actually told her that? [laughs] That's so embarrassing.

[Listen to audio](https://runware.ai/docs/assets/tag-whispers.DRAEa1yP.mp3)

*[whispers] from the start*

> **Prompt**: [whispers] Listen, I'm not supposed to tell you this, but the meeting got cancelled.

[Listen to audio](https://runware.ai/docs/assets/tag-sigh.MjGjtuSp.mp3)

*[sighs] as a lead-in*

> **Prompt**: [sighs] Alright, let's go over the slides one more time.

[Listen to audio](https://runware.ai/docs/assets/tag-sarcastic.fp9JSA9S.mp3)

*[sarcastic] for tone, not volume*

> **Prompt**: [sarcastic] Oh, sure, that's exactly what I wanted to hear right now.

[Listen to audio](https://runware.ai/docs/assets/tag-excited.CIOB-VoL.mp3)

*[excited] with rising energy*

> **Prompt**: [excited] You won't believe this. We just hit a million users!

The tag applies forward until the next tag or the end of the input. A tag at the start of the line shapes the whole line. Mid-sentence tags work for momentary shifts like a `[laughs]` after a punchline.

#### [Sound effects](https://runware.ai/docs/models/elevenlabs-v3/guides/directing-with-audio-tags#sound-effects)

These tags don't shape the voice. They **insert an audio event** at the position of the tag while the surrounding text continues to be spoken normally:

```text
[applause] [clapping] [gunshot] [explosion] [swallows] [gulps]
```

[Listen to audio](https://runware.ai/docs/assets/sfx-applause.Ctq4rKnG.mp3)

*[applause] following the line*

> **Prompt**: And the winner is... our team! [applause]

[Listen to audio](https://runware.ai/docs/assets/sfx-explosion.DGlDR8va.mp3)

*[explosion] between sentences*

> **Prompt**: The reactor went critical. [explosion] We barely made it out.

Place the SFX tag at the moment the sound should happen, not at the start of the line. The model renders the effect at that point in the timeline.

#### [Experimental](https://runware.ai/docs/models/elevenlabs-v3/guides/directing-with-audio-tags#experimental)

These tags are documented as less consistent and **more voice-dependent**. They work some of the time, depending on the voice you've picked and the text:

```text
[strong French accent]   // parameterized: swap in any nationality
[strong Russian accent]
[sings]
[woo]
```

[Listen to audio](https://runware.ai/docs/assets/experimental-accent.BKcH8iN7.mp3)

*[strong French accent] over the entire line*

> **Prompt**: [strong French accent] Welcome to Paris. The croissants are over here, the existential crisis is over there.

[Listen to audio](https://runware.ai/docs/assets/experimental-sings.DCkWCYs6.mp3)

*[sings] over a melodic line*

> **Prompt**: [sings] Happy birthday to you, happy birthday to you...

Accent tags can shift pronunciation more reliably than full singing tags. Both benefit from running two or three generations and picking the best take. Consistency across regenerations is lower than the standard emotion tags.

### [Controlling tag effectiveness](https://runware.ai/docs/models/elevenlabs-v3/guides/directing-with-audio-tags#controlling-tag-effectiveness)

Tags don't always do what you ask. Three things affect whether a tag actually fires.

**The voice character.** v3 tags are constrained by the voice's training data. A voice that was trained on a calm, measured speaker won't suddenly shout if you tag a line `[excited]`. A voice trained on a boisterous speaker won't truly whisper. The tag bends the delivery in the requested direction, but only as far as the voice's range allows. If you need a wide emotional range, pick a voice with a broad baseline.

**The stability setting.** ElevenLabs voices expose a `stability` parameter on the request (`providerSettings.elevenlabs.textToSpeech.voiceSettings.stability`, 0 to 1). Lower stability gives the model more room to vary delivery, which means **tags are more responsive**. Higher stability produces consistent output but flattens tag-driven variation. If your tags feel like they're being ignored, lower stability.

**Combining tags.** Multiple tags can stack at the same point or apply sequentially. The model handles a few tags well. Piling on five tags in one bracket starts to confuse the output:

[Listen to audio](https://runware.ai/docs/assets/tag-combo.DRUxY02o.mp3)

*Two compatible tags layered ([whispers] + [curious])*

> **Prompt**: [whispers] [curious] Is it just me, or is that door slightly open?

**Compatible pairs** (whispers + curious, excited + laughs, sarcastic + sighs) compound predictably. **Conflicting pairs** like `[whispers] [excited]` produce unstable output because the model is being asked to do two opposite things at once. Stick to combinations that an actor could plausibly perform in a single moment.

> [!WARNING]
> v3 does **not** support SSML markup. `<break>`, `<phoneme>`, and the rest are ignored. Use audio tags, punctuation, and text structure (covered next) for pacing and prosody. Existing SSML pipelines won't carry over.

### [Punctuation and capitalization](https://runware.ai/docs/models/elevenlabs-v3/guides/directing-with-audio-tags#punctuation-and-capitalization)

Two text-level tools work alongside the tag system and are often the right first reach before adding a tag.

**Ellipses add weight and pauses.** Three dots create a lingering pause that feels different from a comma or period. Useful for hesitation, dawning realization, or emotional weight:

[Listen to audio](https://runware.ai/docs/assets/punctuation-ellipsis._vYwJ_fX.mp3)

*Ellipses for a weighted pause*

> **Prompt**: I thought... I thought we had more time.

**Capitalization stresses individual words.** Writing a word in all caps tells the model to emphasize it. This is the v3 equivalent of bold or italic markup, and it works on partial words too (`AbsoLUTEly` stresses just the middle syllable):

[Listen to audio](https://runware.ai/docs/assets/caps-emphasis.CxLPTbrM.mp3)

*Full-word capitalization for stress*

> **Prompt**: I told you NOT to press the button.

Use both sparingly. Ellipses on every other sentence become noise. Everything-capitalized lines stop stressing anything because there's no contrast to stress against. One or two per paragraph is the rough ceiling.

### [Multi-speaker dialogue](https://runware.ai/docs/models/elevenlabs-v3/guides/directing-with-audio-tags#multi-speaker-dialogue)

v3 generates speech from one voice per request. A two-person conversation requires **one request per speaker**, each with its own voice ID, then concatenated downstream into a single audio track. The convention for splitting the script is label-prefixed lines:

```text
Speaker 1 (voice: Mario):  [curious] So... how did the demo go?
Speaker 2 (voice: Ashley): [excited] Better than we expected! They loved it.
```

You'd send each line as its own `audioInference` request with the matching voice, then stitch the resulting MP3s in your application code. The two audio files below were generated separately and represent the dialogue side by side:

[Listen to audio](https://runware.ai/docs/assets/multi-speaker-1.BKuPyio7.mp3)

*Speaker 1: Mario*

> **Prompt**: [curious] So... how did the demo go?

[Listen to audio](https://runware.ai/docs/assets/multi-speaker-2.D5t-fNYP.mp3)

*Speaker 2: Ashley*

> **Prompt**: [excited] Better than we expected! They loved it.

For overlapping speech or interruptions, the documented convention is to use an em-dash at the cut-off point in the first speaker's line and pick up mid-thought in the second:

```text
Speaker 1: [starting to speak] So I was thinking we could —
Speaker 2: [jumping in] — test our new timing features?
```

The em-dash signals an unfinished phrase, and the model trails off at it. The second speaker's line picks up cleanly when stitched into the final track.

### [Tips](https://runware.ai/docs/models/elevenlabs-v3/guides/directing-with-audio-tags#tips)

1. **Reach for punctuation before tags.** A well-placed ellipsis or period often does what you'd reach for `[sighs]` or `[pauses]` to do, without the variance a tag introduces.
    
2. **Pick the voice first, then write to its range.** Tags only deliver what the voice can plausibly do. Audition two or three voices on a short test passage before committing. A voice mismatched to your script will fight every tag you add.
    
3. **Lower the `stability` setting if tags feel muted.** Default stability prioritizes consistency. If you've added `[excited]` and the line still sounds neutral, drop stability to 0.3–0.5 and regenerate.
    
4. **Stack at most two compatible tags.** `[whispers] [curious]` works. `[whispers] [excited]` doesn't. Tags that ask for opposite physical states confuse the model. Tags that share a baseline mood compound cleanly.
    
5. **Capitalize one word per sentence at most.** Emphasis is contrast. If half the sentence is in caps, none of it reads as emphasized.
    
6. **Don't try to remove narrative context from quoted speech.** When generating dialogue from a script that includes lines like *"she said angrily"*, leave the narrative phrase in. v3 reads the tone from the surrounding context, and stripping it makes the read flatter.
    
7. **For non-English content, set `speech.language`.** The same voice can read text in any of the ~75 supported language codes, and the language parameter steers pronunciation. Set it explicitly when the script is in a single language. For mixed-language passages, let the model auto-detect.