MODEL IDheygen:avatar@5

live

HeyGen Avatar V

by HeyGenMay 4, 2026

HeyGen Avatar V is an avatar video generation model for talking digital twins and other eligible registered avatar looks. It improves identity preservation, lip sync accuracy, facial expressiveness, and motion coherence across angle changes, scene changes, and long-form videos, making it well suited to presenter, training, and localization workflows where avatar stability matters.

Driving the avatar: text to speech or your own audio

How to choose between Avatar V's two input modes: generate the voice from a script, or drive the avatar with your own recorded audio.

Introduction

Avatar V can be driven two ways: you write a script and the model speaks it (text-to-speech), or you provide your own audio file and the model lip-syncs the avatar to it. You pick one or the other, not both. The choice changes everything downstream, from which parameters are available to how fast you can iterate.

This is Avatar V. One brief becomes a video presenter that speaks your script, in any voice, on any background, in any language.

{
  "taskType": "videoInference",
  "taskUUID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "model": "heygen:avatar@5",
  "inputs": {
    "avatar": "man_casual_young_adult"
  },
  "speech": {
    "text": "This is Avatar V. One brief becomes a video presenter that speaks your script, in any voice, on any background, in any language.",
    "voice": "chill_brian_male_english"
  },
  "width": 1280,
  "height": 720
}

This guide covers the avatar selection that opens every request, both input modes for driving the speech, and the tuning parameters (speed, pitch, language) that come with the TTS path.

Two input modes

Every Avatar V request needs an avatar. After that, you supply either a speech block (with text and voice) or an inputs.audio reference. Sending both produces a validation error.

// TTS path
{
  "inputs": { "avatar": "..." },
  "speech": { "text": "...", "voice": "..." }
}

// Audio path
{
  "inputs": { "avatar": "...", "audio": "..." }
}

Use the TTS path when you want fast iteration on copy and easy localization. Use the audio path when you already have the exact voice you want. Voice tuning parameters (speed, pitch, volume) only apply on the TTS path.

Picking an avatar

The inputs.avatar parameter is a string ID from a fixed catalog of registered looks. Each avatar looks and moves differently. Same script, same voice, four different presenters:

man_casual_young_adult

Welcome to our team. Here's a quick overview of what you'll cover this week.

{
  "taskType": "videoInference",
  "taskUUID": "b2c3d4e5-f6a7-8901-bcde-f23456789012",
  "model": "heygen:avatar@5",
  "inputs": {
    "avatar": "man_casual_young_adult"
  },
  "speech": {
    "text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
    "voice": "jenny_female_english"
  },
  "width": 1280,
  "height": 720
}

woman_business_office

Welcome to our team. Here's a quick overview of what you'll cover this week.

{
  "taskType": "videoInference",
  "taskUUID": "c3d4e5f6-a7b8-9012-cdef-345678901234",
  "model": "heygen:avatar@5",
  "inputs": {
    "avatar": "woman_business_office"
  },
  "speech": {
    "text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
    "voice": "jenny_female_english"
  },
  "width": 1280,
  "height": 720
}

woman_middle_aged_sitting

Welcome to our team. Here's a quick overview of what you'll cover this week.

{
  "taskType": "videoInference",
  "taskUUID": "d4e5f6a7-b8c9-0123-def0-456789012345",
  "model": "heygen:avatar@5",
  "inputs": {
    "avatar": "woman_middle_aged_sitting"
  },
  "speech": {
    "text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
    "voice": "jenny_female_english"
  },
  "width": 1280,
  "height": 720
}

casual_sitting_young_adult

Welcome to our team. Here's a quick overview of what you'll cover this week.

{
  "taskType": "videoInference",
  "taskUUID": "e5f6a7b8-c9d0-1234-ef01-567890123456",
  "model": "heygen:avatar@5",
  "inputs": {
    "avatar": "casual_sitting_young_adult"
  },
  "speech": {
    "text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
    "voice": "jenny_female_english"
  },
  "width": 1280,
  "height": 720
}

All four videos use the same script and the same voice (jenny_female_english). On the male avatar this produces an intentional voice/face mismatch, which is exactly the point: lip sync adapts to whatever face you pair the audio with. Voice and avatar are independent parameters, picked separately.

Avatar IDs are validated against an enum at request time. Check the API reference for the full list of available avatars. A registered avatar look is required for every request, including audio-path calls.

Driving with text

The TTS path takes a speech.text (the literal script) and a speech.voice (which voice reads it). Both are required together. Optional language defaults the voice's pronunciation to the target locale.

import { createClient } from '@runware/sdk'

const client = await createClient({ apiKey: process.env.RUNWARE_API_KEY })
await client.connect()

const [result] = await client.run({
  model: 'heygen:avatar@5',
  inputs: {
    avatar: 'man_casual_young_adult'
  },
  speech: {
    text: 'Welcome to our team. Here\'s a quick overview of what you\'ll cover this week.',
    voice: 'chill_brian_male_english'
  },
  width: 1280,
  height: 720
})

import asyncio
import os

from runware import Runware


async def main():
    async with Runware(api_key=os.environ["RUNWARE_API_KEY"]) as client:
        results = await client.run({
            "model": "heygen:avatar@5",
            "inputs": {
                "avatar": "man_casual_young_adult"
            },
            "speech": {
                "text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
                "voice": "chill_brian_male_english"
            },
            "width": 1280,
            "height": 720
        })


asyncio.run(main())

curl https://api.runware.ai/v1 \
  -H "Authorization: Bearer $RUNWARE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '[
    {
      "taskType": "videoInference",
      "taskUUID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "model": "heygen:avatar@5",
      "inputs": {
        "avatar": "man_casual_young_adult"
      },
      "speech": {
        "text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
        "voice": "chill_brian_male_english"
      },
      "width": 1280,
      "height": 720
    }
  ]'

runware run heygen:avatar@5 \
  inputs.avatar=man_casual_young_adult \
  speech.text="Welcome to our team. Here's a quick overview of what you'll cover this week." \
  speech.voice=chill_brian_male_english \
  width=1280 \
  height=720

{
  "taskType": "videoInference",
  "taskUUID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "model": "heygen:avatar@5",
  "inputs": {
    "avatar": "man_casual_young_adult"
  },
  "speech": {
    "text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
    "voice": "chill_brian_male_english"
  },
  "width": 1280,
  "height": 720
}

Swapping voices

The voice parameter has the largest impact on perceived performance. Same script, same avatar, four different voices:

chill_brian_male_english

Welcome to our team. Here's a quick overview of what you'll cover this week.

{
  "taskType": "videoInference",
  "taskUUID": "f6a7b8c9-d0e1-2345-f012-678901234567",
  "model": "heygen:avatar@5",
  "inputs": {
    "avatar": "man_casual_young_adult"
  },
  "speech": {
    "text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
    "voice": "chill_brian_male_english"
  },
  "width": 1280,
  "height": 720
}

baritone_ben_male_english

Welcome to our team. Here's a quick overview of what you'll cover this week.

{
  "taskType": "videoInference",
  "taskUUID": "a7b8c9d0-e1f2-3456-0123-789012345678",
  "model": "heygen:avatar@5",
  "inputs": {
    "avatar": "man_casual_young_adult"
  },
  "speech": {
    "text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
    "voice": "baritone_ben_male_english"
  },
  "width": 1280,
  "height": 720
}

expressive_evan_male_english_6638ff

Welcome to our team. Here's a quick overview of what you'll cover this week.

{
  "taskType": "videoInference",
  "taskUUID": "b8c9d0e1-f2a3-4567-1234-890123456789",
  "model": "heygen:avatar@5",
  "inputs": {
    "avatar": "man_casual_young_adult"
  },
  "speech": {
    "text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
    "voice": "expressive_evan_male_english_6638ff"
  },
  "width": 1280,
  "height": 720
}

professor_dean_male_english

Welcome to our team. Here's a quick overview of what you'll cover this week.

{
  "taskType": "videoInference",
  "taskUUID": "c9d0e1f2-a3b4-5678-2345-901234567890",
  "model": "heygen:avatar@5",
  "inputs": {
    "avatar": "man_casual_young_adult"
  },
  "speech": {
    "text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
    "voice": "professor_dean_male_english"
  },
  "width": 1280,
  "height": 720
}

The voice catalog covers a wide range of registers and styles. Pick the one that fits the persona, then keep it fixed while you iterate on copy.

Voice IDs aren't tied to a specific avatar. A masculine-named voice on a feminine-presenting avatar will sync correctly, but the audio/visual mismatch is usually jarring for viewers. Pair voices and avatars deliberately.

Driving with audio

When you already have the voice you want, send the audio directly. The inputs.audio parameter accepts a public URL or a UUID from any previously uploaded asset. The model extracts phonemes from the audio and animates the avatar to match.

The clip below was generated separately by Inworld TTS-2, then passed straight into Avatar V via its returned URL:

0:00

Source audio, generated separately

Avatar V driven by the audio above

Welcome to our team. Here's a quick overview of what you'll cover this week.

{
  "taskType": "videoInference",
  "taskUUID": "d0e1f2a3-b4c5-6789-3456-012345678901",
  "model": "heygen:avatar@5",
  "inputs": {
    "avatar": "man_casual_young_adult",
    "audio": "https://example.com/audio.mp3"
  },
  "width": 1280,
  "height": 720
}

The audio-path request omits the speech block entirely:

import { createClient } from '@runware/sdk'

const client = await createClient({ apiKey: process.env.RUNWARE_API_KEY })
await client.connect()

const [result] = await client.run({
  model: 'heygen:avatar@5',
  inputs: {
    avatar: 'man_casual_young_adult',
    audio: 'https://example.com/audio.mp3'
  },
  width: 1280,
  height: 720
})

import asyncio
import os

from runware import Runware


async def main():
    async with Runware(api_key=os.environ["RUNWARE_API_KEY"]) as client:
        results = await client.run({
            "model": "heygen:avatar@5",
            "inputs": {
                "avatar": "man_casual_young_adult",
                "audio": "https://example.com/audio.mp3"
            },
            "width": 1280,
            "height": 720
        })


asyncio.run(main())

curl https://api.runware.ai/v1 \
  -H "Authorization: Bearer $RUNWARE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '[
    {
      "taskType": "videoInference",
      "taskUUID": "b2c3d4e5-f6a7-8901-bcde-f23456789012",
      "model": "heygen:avatar@5",
      "inputs": {
        "avatar": "man_casual_young_adult",
        "audio": "https://example.com/audio.mp3"
      },
      "width": 1280,
      "height": 720
    }
  ]'

runware run heygen:avatar@5 \
  inputs.avatar=man_casual_young_adult \
  inputs.audio=https://example.com/audio.mp3 \
  width=1280 \
  height=720

{
  "taskType": "videoInference",
  "taskUUID": "b2c3d4e5-f6a7-8901-bcde-f23456789012",
  "model": "heygen:avatar@5",
  "inputs": {
    "avatar": "man_casual_young_adult",
    "audio": "https://example.com/audio.mp3"
  },
  "width": 1280,
  "height": 720
}

The audio path is the right call when:

You have a recording of a real human voice or output from a voice clone and want that delivery preserved exactly.
The audio is the source of truth and the visual is the wrapper (podcasts, interviews, narration, dubbed content).

You lose access to speech.speed, speech.pitch, speech.volume, and speech.language on this path. Adjust those upstream, in whatever produced the audio.

Voice tuning

When you're on the TTS path, three numeric parameters reshape delivery without changing the voice or script.

speech.speed ranges from 0.5 to 1.5, default 1.0. Below 0.85 the read feels deliberate. Above 1.15 it starts to feel rushed.
speech.pitch ranges from -50 to +50, default 0. Small adjustments (±5 to ±15) shift the voice's age and tone without making it sound processed. Larger values quickly cross into chipmunk or robot territory.
speech.volume ranges from 0.0 to 1.0, default 1.0. Most useful when you're mixing the avatar's voice into a track with background music or effects, where lowering the avatar's volume creates room for the mix.

Speed is the lever you'll reach for most often:

Welcome to our team. Here's a quick overview of what you'll cover this week.

{
  "taskType": "videoInference",
  "taskUUID": "e1f2a3b4-c5d6-7890-4567-123456789012",
  "model": "heygen:avatar@5",
  "inputs": {
    "avatar": "man_casual_young_adult"
  },
  "speech": {
    "text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
    "voice": "chill_brian_male_english",
    "speed": 0.7
  },
  "width": 1280,
  "height": 720
}

Welcome to our team. Here's a quick overview of what you'll cover this week.

{
  "taskType": "videoInference",
  "taskUUID": "f2a3b4c5-d6e7-8901-5678-234567890123",
  "model": "heygen:avatar@5",
  "inputs": {
    "avatar": "man_casual_young_adult"
  },
  "speech": {
    "text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
    "voice": "chill_brian_male_english",
    "speed": 1.0
  },
  "width": 1280,
  "height": 720
}

Welcome to our team. Here's a quick overview of what you'll cover this week.

{
  "taskType": "videoInference",
  "taskUUID": "a3b4c5d6-e7f8-9012-6789-345678901234",
  "model": "heygen:avatar@5",
  "inputs": {
    "avatar": "man_casual_young_adult"
  },
  "speech": {
    "text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
    "voice": "chill_brian_male_english",
    "speed": 1.3
  },
  "width": 1280,
  "height": 720
}

The slower read suits training content where every word matters. The faster read fits social cuts where attention drops off after a few seconds. The default is calibrated for general-purpose narration.

Multilingual delivery

The speech.language parameter accepts a BCP 47 locale code (en-US, es-ES, fr-FR, ja-JP, and roughly 180 others). When set, the same voice adapts its pronunciation to the target language. Translate the script once, set the language, send the request:

en-US

Welcome to our team. Here's a quick overview of what you'll cover this week.

{
  "taskType": "videoInference",
  "taskUUID": "b4c5d6e7-f8a9-0123-7890-456789012345",
  "model": "heygen:avatar@5",
  "inputs": {
    "avatar": "man_casual_young_adult"
  },
  "speech": {
    "text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
    "voice": "jenny_female_english",
    "language": "en-US"
  },
  "width": 1280,
  "height": 720
}

es-ES

Bienvenido a nuestro equipo. Aquí tienes un breve resumen de lo que verás esta semana.

{
  "taskType": "videoInference",
  "taskUUID": "c5d6e7f8-a9b0-1234-8901-567890123456",
  "model": "heygen:avatar@5",
  "inputs": {
    "avatar": "man_casual_young_adult"
  },
  "speech": {
    "text": "Bienvenido a nuestro equipo. Aquí tienes un breve resumen de lo que verás esta semana.",
    "voice": "jenny_female_english",
    "language": "es-ES"
  },
  "width": 1280,
  "height": 720
}

fr-FR

Bienvenue dans notre équipe. Voici un bref aperçu de ce que vous découvrirez cette semaine.

{
  "taskType": "videoInference",
  "taskUUID": "d6e7f8a9-b0c1-2345-9012-678901234567",
  "model": "heygen:avatar@5",
  "inputs": {
    "avatar": "man_casual_young_adult"
  },
  "speech": {
    "text": "Bienvenue dans notre équipe. Voici un bref aperçu de ce que vous découvrirez cette semaine.",
    "voice": "jenny_female_english",
    "language": "fr-FR"
  },
  "width": 1280,
  "height": 720
}

ja-JP

私たちのチームへようこそ。今週学ぶ内容の概要を簡単にご紹介します。

{
  "taskType": "videoInference",
  "taskUUID": "e7f8a9b0-c1d2-3456-0123-789012345678",
  "model": "heygen:avatar@5",
  "inputs": {
    "avatar": "man_casual_young_adult"
  },
  "speech": {
    "text": "私たちのチームへようこそ。今週学ぶ内容の概要を簡単にご紹介します。",
    "voice": "jenny_female_english",
    "language": "ja-JP"
  },
  "width": 1280,
  "height": 720
}

All four videos use the same avatar and the same voice (jenny_female_english). The only differences are the translated script and the locale code:

// English
"speech": { "text": "Welcome to our team...", "voice": "jenny_female_english", "language": "en-US" }

// Spanish
"speech": { "text": "Bienvenido a nuestro equipo...", "voice": "jenny_female_english", "language": "es-ES" }

This is the cheapest way to localize a video at scale. One script, one voice, one avatar, looped over a list of locale codes and translations.

The voice catalog is largely English-named but each voice can speak any supported language. If you need a voice built for a specific language, check the catalog for entries whose name reflects that language. Otherwise let the voice adapt via the language parameter.

Tips

Lock the avatar and voice before iterating on copy. Both are visible-in-the-output decisions. Changing them mid-iteration resets your sense of what the read sounds like and slows you down.
Use the TTS path for A/B testing copy. Two requests with different speech.text produce two videos in minutes. The same iteration on the audio path requires re-recording or re-generating audio first.
Use the audio path for brand voices. If your brand has a specific human voice associated with it, regenerate that voice upstream (clone, recording, separate TTS provider) and feed Avatar V the audio. The lip sync handles the rest.
Test the avatar/voice pairing on a short script first. A 10-second take renders faster and surfaces any avatar/voice mismatch just as clearly as a full minute. Once the pairing feels right, send the full script.
Translate, don't transliterate. When localizing, get a translation that reads naturally in the target language, then send the translated string as speech.text. The language code alone won't fix awkward source copy.