HeyGen Avatar V
HeyGen Avatar V is an avatar video generation model for talking digital twins and other eligible registered avatar looks. It improves identity preservation, lip sync accuracy, facial expressiveness, and motion coherence across angle changes, scene changes, and long-form videos, making it well suited to presenter, training, and localization workflows where avatar stability matters.
Complete technical specification for integration
Ready-to-use code snippets for common workflows
Step-by-step tutorials for advanced use cases
← All GuidesDriving the avatar: text to speech or your own audio
How to choose between the TTS path and the audio-input path when generating Avatar V videos. Covers avatar selection, voice swapping, speed tuning, and multilingual delivery from a single script.
Introduction
Avatar V can be driven two ways: you write a script and the model speaks it (text-to-speech), or you provide your own audio file and the model lip-syncs the avatar to it. You pick one or the other, not both. The choice changes everything downstream, from which parameters are available to how fast you can iterate.
This is Avatar V. One brief becomes a video presenter that speaks your script, in any voice, on any background, in any language.
{
"taskType": "videoInference",
"taskUUID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"model": "heygen:avatar@5",
"inputs": {
"avatar": "man_casual_young_adult"
},
"speech": {
"text": "This is Avatar V. One brief becomes a video presenter that speaks your script, in any voice, on any background, in any language.",
"voice": "chill_brian_male_english"
},
"width": 1280,
"height": 720
}This guide covers the avatar selection that opens every request, both input modes for driving the speech, and the tuning parameters (speed, pitch, language) that come with the TTS path.
Two input modes
Every Avatar V request needs an avatar. After that, you supply either a speech block (with text and voice) or an inputs.audio reference. Sending both produces a validation error.
// TTS path
{
"inputs": { "avatar": "..." },
"speech": { "text": "...", "voice": "..." }
}
// Audio path
{
"inputs": { "avatar": "...", "audio": "..." }
}Use the TTS path when you want fast iteration on copy and easy localization. Use the audio path when you already have the exact voice you want. Voice tuning parameters (speed, pitch, volume) only apply on the TTS path.
Picking an avatar
The inputs.avatar parameter is a string ID from a fixed catalog of registered looks. Each avatar looks and moves differently. Same script, same voice, four different presenters:
Welcome to our team. Here's a quick overview of what you'll cover this week.
{
"taskType": "videoInference",
"taskUUID": "b2c3d4e5-f6a7-8901-bcde-f23456789012",
"model": "heygen:avatar@5",
"inputs": {
"avatar": "man_casual_young_adult"
},
"speech": {
"text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
"voice": "jenny_female_english"
},
"width": 1280,
"height": 720
}Welcome to our team. Here's a quick overview of what you'll cover this week.
{
"taskType": "videoInference",
"taskUUID": "c3d4e5f6-a7b8-9012-cdef-345678901234",
"model": "heygen:avatar@5",
"inputs": {
"avatar": "woman_business_office"
},
"speech": {
"text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
"voice": "jenny_female_english"
},
"width": 1280,
"height": 720
}Welcome to our team. Here's a quick overview of what you'll cover this week.
{
"taskType": "videoInference",
"taskUUID": "d4e5f6a7-b8c9-0123-def0-456789012345",
"model": "heygen:avatar@5",
"inputs": {
"avatar": "woman_middle_aged_sitting"
},
"speech": {
"text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
"voice": "jenny_female_english"
},
"width": 1280,
"height": 720
}Welcome to our team. Here's a quick overview of what you'll cover this week.
{
"taskType": "videoInference",
"taskUUID": "e5f6a7b8-c9d0-1234-ef01-567890123456",
"model": "heygen:avatar@5",
"inputs": {
"avatar": "casual_sitting_young_adult"
},
"speech": {
"text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
"voice": "jenny_female_english"
},
"width": 1280,
"height": 720
}All four videos use the same script and the same voice (jenny_female_english). On the male avatar this produces an intentional voice/face mismatch, which is exactly the point: lip sync adapts to whatever face you pair the audio with. Voice and avatar are independent parameters, picked separately.
Avatar IDs are validated against an enum at request time. Check the API reference for the full list of available avatars. A registered avatar look is required for every request, including audio-path calls.
Driving with text
The TTS path takes a speech.text (the literal script) and a speech.voice (which voice reads it). Both are required together. Optional language defaults the voice's pronunciation to the target locale.
[
{
"taskType": "videoInference",
"taskUUID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"model": "heygen:avatar@5",
"inputs": {
"avatar": "man_casual_young_adult"
},
"speech": {
"text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
"voice": "chill_brian_male_english"
},
"width": 1280,
"height": 720
}
]Swapping voices
The voice parameter has the largest impact on perceived performance. Same script, same avatar, four different voices:
Welcome to our team. Here's a quick overview of what you'll cover this week.
{
"taskType": "videoInference",
"taskUUID": "f6a7b8c9-d0e1-2345-f012-678901234567",
"model": "heygen:avatar@5",
"inputs": {
"avatar": "man_casual_young_adult"
},
"speech": {
"text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
"voice": "chill_brian_male_english"
},
"width": 1280,
"height": 720
}Welcome to our team. Here's a quick overview of what you'll cover this week.
{
"taskType": "videoInference",
"taskUUID": "a7b8c9d0-e1f2-3456-0123-789012345678",
"model": "heygen:avatar@5",
"inputs": {
"avatar": "man_casual_young_adult"
},
"speech": {
"text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
"voice": "baritone_ben_male_english"
},
"width": 1280,
"height": 720
}Welcome to our team. Here's a quick overview of what you'll cover this week.
{
"taskType": "videoInference",
"taskUUID": "b8c9d0e1-f2a3-4567-1234-890123456789",
"model": "heygen:avatar@5",
"inputs": {
"avatar": "man_casual_young_adult"
},
"speech": {
"text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
"voice": "expressive_evan_male_english_6638ff"
},
"width": 1280,
"height": 720
}Welcome to our team. Here's a quick overview of what you'll cover this week.
{
"taskType": "videoInference",
"taskUUID": "c9d0e1f2-a3b4-5678-2345-901234567890",
"model": "heygen:avatar@5",
"inputs": {
"avatar": "man_casual_young_adult"
},
"speech": {
"text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
"voice": "professor_dean_male_english"
},
"width": 1280,
"height": 720
}The voice catalog covers a wide range of registers and styles. Pick the one that fits the persona, then keep it fixed while you iterate on copy.
Voice IDs aren't tied to a specific avatar. A masculine-named voice on a feminine-presenting avatar will sync correctly, but the audio/visual mismatch is usually jarring for viewers. Pair voices and avatars deliberately.
Driving with audio
When you already have the voice you want, send the audio directly. The inputs.audio parameter accepts a public URL or a UUID from any previously uploaded asset. The model extracts phonemes from the audio and animates the avatar to match.
The clip below was generated separately by Inworld TTS-2 , then passed straight into Avatar V via its returned URL:
Welcome to our team. Here's a quick overview of what you'll cover this week.
{
"taskType": "videoInference",
"taskUUID": "d0e1f2a3-b4c5-6789-3456-012345678901",
"model": "heygen:avatar@5",
"inputs": {
"avatar": "man_casual_young_adult",
"audio": "https://aud.runware.ai/audio/abc12345.mp3"
},
"width": 1280,
"height": 720
}The audio-path request omits the speech block entirely:
[
{
"taskType": "videoInference",
"taskUUID": "b2c3d4e5-f6a7-8901-bcde-f23456789012",
"model": "heygen:avatar@5",
"inputs": {
"avatar": "man_casual_young_adult",
"audio": "https://aud.runware.ai/audio/abc12345.mp3"
},
"width": 1280,
"height": 720
}
]The audio path is the right call when:
- You have a recording of a real human voice or output from a voice clone and want that delivery preserved exactly.
- The audio is the source of truth and the visual is the wrapper (podcasts, interviews, narration, dubbed content).
You lose access to speech.speed, speech.pitch, speech.volume, and speech.language on this path. Adjust those upstream, in whatever produced the audio.
Voice tuning
When you're on the TTS path, three numeric parameters reshape delivery without changing the voice or script.
-
speech.speedranges from0.5to1.5, default1.0. Below0.85the read feels deliberate. Above1.15it starts to feel rushed. -
speech.pitchranges from-50to+50, default0. Small adjustments (±5 to ±15) shift the voice's age and tone without making it sound processed. Larger values quickly cross into chipmunk or robot territory. -
speech.volumeranges from0.0to1.0, default1.0. Most useful when you're mixing the avatar's voice into a track with background music or effects, where lowering the avatar's volume creates room for the mix.
Speed is the lever you'll reach for most often:
Welcome to our team. Here's a quick overview of what you'll cover this week.
{
"taskType": "videoInference",
"taskUUID": "e1f2a3b4-c5d6-7890-4567-123456789012",
"model": "heygen:avatar@5",
"inputs": {
"avatar": "man_casual_young_adult"
},
"speech": {
"text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
"voice": "chill_brian_male_english",
"speed": 0.7
},
"width": 1280,
"height": 720
}Welcome to our team. Here's a quick overview of what you'll cover this week.
{
"taskType": "videoInference",
"taskUUID": "f2a3b4c5-d6e7-8901-5678-234567890123",
"model": "heygen:avatar@5",
"inputs": {
"avatar": "man_casual_young_adult"
},
"speech": {
"text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
"voice": "chill_brian_male_english",
"speed": 1.0
},
"width": 1280,
"height": 720
}Welcome to our team. Here's a quick overview of what you'll cover this week.
{
"taskType": "videoInference",
"taskUUID": "a3b4c5d6-e7f8-9012-6789-345678901234",
"model": "heygen:avatar@5",
"inputs": {
"avatar": "man_casual_young_adult"
},
"speech": {
"text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
"voice": "chill_brian_male_english",
"speed": 1.3
},
"width": 1280,
"height": 720
}The slower read suits training content where every word matters. The faster read fits social cuts where attention drops off after a few seconds. The default is calibrated for general-purpose narration.
Multilingual delivery
The speech.language parameter accepts a BCP 47 locale code (en-US, es-ES, fr-FR, ja-JP, and roughly 180 others). When set, the same voice adapts its pronunciation to the target language. Translate the script once, set the language, send the request:
Welcome to our team. Here's a quick overview of what you'll cover this week.
{
"taskType": "videoInference",
"taskUUID": "b4c5d6e7-f8a9-0123-7890-456789012345",
"model": "heygen:avatar@5",
"inputs": {
"avatar": "man_casual_young_adult"
},
"speech": {
"text": "Welcome to our team. Here's a quick overview of what you'll cover this week.",
"voice": "jenny_female_english",
"language": "en-US"
},
"width": 1280,
"height": 720
}Bienvenido a nuestro equipo. Aquí tienes un breve resumen de lo que verás esta semana.
{
"taskType": "videoInference",
"taskUUID": "c5d6e7f8-a9b0-1234-8901-567890123456",
"model": "heygen:avatar@5",
"inputs": {
"avatar": "man_casual_young_adult"
},
"speech": {
"text": "Bienvenido a nuestro equipo. Aquí tienes un breve resumen de lo que verás esta semana.",
"voice": "jenny_female_english",
"language": "es-ES"
},
"width": 1280,
"height": 720
}Bienvenue dans notre équipe. Voici un bref aperçu de ce que vous découvrirez cette semaine.
{
"taskType": "videoInference",
"taskUUID": "d6e7f8a9-b0c1-2345-9012-678901234567",
"model": "heygen:avatar@5",
"inputs": {
"avatar": "man_casual_young_adult"
},
"speech": {
"text": "Bienvenue dans notre équipe. Voici un bref aperçu de ce que vous découvrirez cette semaine.",
"voice": "jenny_female_english",
"language": "fr-FR"
},
"width": 1280,
"height": 720
}私たちのチームへようこそ。今週学ぶ内容の概要を簡単にご紹介します。
{
"taskType": "videoInference",
"taskUUID": "e7f8a9b0-c1d2-3456-0123-789012345678",
"model": "heygen:avatar@5",
"inputs": {
"avatar": "man_casual_young_adult"
},
"speech": {
"text": "私たちのチームへようこそ。今週学ぶ内容の概要を簡単にご紹介します。",
"voice": "jenny_female_english",
"language": "ja-JP"
},
"width": 1280,
"height": 720
}All four videos use the same avatar and the same voice (jenny_female_english). The only differences are the translated script and the locale code:
// English
"speech": { "text": "Welcome to our team...", "voice": "jenny_female_english", "language": "en-US" }
// Spanish
"speech": { "text": "Bienvenido a nuestro equipo...", "voice": "jenny_female_english", "language": "es-ES" }This is the cheapest way to localize a video at scale. One script, one voice, one avatar, looped over a list of locale codes and translations.
The voice catalog is largely English-named but each voice can speak any supported language. If you need a voice built for a specific language, check the catalog for entries whose name reflects that language. Otherwise let the voice adapt via the language parameter.
Tips
-
Lock the avatar and voice before iterating on copy. Both are visible-in-the-output decisions. Changing them mid-iteration resets your sense of what the read sounds like and slows you down.
-
Use the TTS path for A/B testing copy. Two requests with different
speech.textproduce two videos in minutes. The same iteration on the audio path requires re-recording or re-generating audio first. -
Use the audio path for brand voices. If your brand has a specific human voice associated with it, regenerate that voice upstream (clone, recording, separate TTS provider) and feed Avatar V the audio. The lip sync handles the rest.
-
Test the avatar/voice pairing on a short script first. A 10-second take renders faster and surfaces any avatar/voice mismatch just as clearly as a full minute. Once the pairing feels right, send the full script.
-
Translate, don't transliterate. When localizing, get a translation that reads naturally in the target language, then send the translated string as
speech.text. The language code alone won't fix awkward source copy.