MODEL IDprunaai:p-video@replace

live

P-Video-Replace

by Pruna AIJune 4, 2026

P-Video-Replace is a video transformation model that swaps the on-camera character in an existing video with the character from a reference image. It is built to preserve the original motion, timing, camera behavior, lighting, and background while changing who appears in the clip, making it useful for UGC ad variations, content localization, avatar or mascot insertion, and other scalable character-replacement workflows.

Replacing the character in a video

How to use Pruna P-Video-Replace to swap the on-camera character in an existing video with one from a reference image while preserving the original motion, timing, camera, lighting, and audio.

Introduction

Re-using the same video clip with different on-camera characters is awkward in most pipelines. Re-shooting or re-prompting the scene for each character changes the motion, the timing, the gestures, and the camera every time. General-purpose video editors can't re-cast a clip without distorting everything around the character.

P-Video-Replace skips that loop. You send a source video and one to three reference images, and the model returns a new video where the on-camera character has been swapped for the reference. The motion, timing, camera movement, lighting, audio, and background all carry through unchanged.

Source video

After replace: same scene, same gestures, same voice, different character

A sister model, P-Video-Animate, pairs an image and a video too, but the direction is reversed. Animate takes the image as the scene and animates it with the source video's motion, preserving the image's atmosphere. Replace takes the video as the scene and swaps the character from the image, preserving the video's atmosphere. Decide which side of the pair you want to keep before reaching for either model.

This guide covers the request shape, what makes a good reference image, how to send multiple references for tighter identity or multi-character scenes, what happens when the reference style isn't photoreal, the two audio knobs that control voice and lip sync, and the limits where the model starts to improvise.

Request shape

Each call takes a source video and 1 to 3 reference images, plus a small set of optional settings. Delivery is always async: the immediate response is an acknowledgment, and the finished video arrives via polling or webhook.

import { createClient } from '@runware/sdk'

const client = await createClient({ apiKey: process.env.RUNWARE_API_KEY })
await client.connect()

const [result] = await client.run({
  model: 'prunaai:p-video@replace',
  deliveryMethod: 'async',
  inputs: {
    video: 'https://example.com/source-podcast.mp4',
    referenceImages: [
      'https://example.com/ref-jordan.jpg'
    ]
  },
  resolution: '720p'
})

import asyncio
import os

from runware import Runware


async def main():
    async with Runware(api_key=os.environ["RUNWARE_API_KEY"]) as client:
        results = await client.run({
            "model": "prunaai:p-video@replace",
            "deliveryMethod": "async",
            "inputs": {
                "video": "https://example.com/source-podcast.mp4",
                "referenceImages": [
                    "https://example.com/ref-jordan.jpg"
                ]
            },
            "resolution": "720p"
        })


asyncio.run(main())

curl https://api.runware.ai/v1 \
  -H "Authorization: Bearer $RUNWARE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '[
    {
      "taskType": "videoInference",
      "taskUUID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "model": "prunaai:p-video@replace",
      "deliveryMethod": "async",
      "inputs": {
        "video": "https://example.com/source-podcast.mp4",
        "referenceImages": [
          "https://example.com/ref-jordan.jpg"
        ]
      },
      "resolution": "720p"
    }
  ]'

runware run prunaai:p-video@replace \
  deliveryMethod=async \
  inputs.video=https://example.com/source-podcast.mp4 \
  inputs.referenceImages.0=https://example.com/ref-jordan.jpg \
  resolution=720p

{
  "taskType": "videoInference",
  "taskUUID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "model": "prunaai:p-video@replace",
  "deliveryMethod": "async",
  "inputs": {
    "video": "https://example.com/source-podcast.mp4",
    "referenceImages": [
      "https://example.com/ref-jordan.jpg"
    ]
  },
  "resolution": "720p"
}

Response

[
  {
    "taskType": "videoInference",
    "taskUUID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "videoUUID": "f1e2d3c4-b5a6-7890-1234-567890abcdef",
    "videoURL": "https://vm.runware.ai/video/os/a14d18/ws/2/vi/f1e2d3c4-b5a6-7890-1234-567890abcdef.mp4",
    "seed": 837412938
  }
]

A few quirks worth knowing about the request shape:

fps accepts 24 or 48 only. Omit it to inherit the source video's frame rate, which is what most workflows want. Any other value is rejected.
positivePrompt is optional. For straightforward character swaps the model works without one. Prompts earn their place in ambiguous cases like multi-character scenes (covered below).

The reference image

On-camera identity is the only thing the reference image controls. The model lifts the face, hair, build, and clothing from it and renders that character into the source video's scene. Everything else (motion, lighting, background, audio) comes from the source.

A photoreal chest-up studio portrait of a man in his early thirties with sandy blond hair, hazel-green eyes, and a navy henley shirt, facing the camera directly against a plain pale grey background — The hero's reference image: chest-up, plain background, face clearly visible

What matters about a reference image:

The face is clearly visible. Side profiles, masks, sunglasses, or heavily occluded faces leave the model with less identity information to work with. A clear front-facing or three-quarter view is the safest bet.
Framing is flexible. Chest-up portraits are ideal, but the model handles candid full-body shots and varied lighting without dropping quality. You don't need a studio portrait to get a working swap.
Style carries through. A photoreal reference produces a photoreal output, a 3D rendered reference produces a 3D rendered output, an illustration produces an illustration. The reference's visual language is what the output inherits.

There's also an audio detail worth knowing up front: the source video's audio is preserved verbatim in the output by default. If the source has a male speaker and you send a female reference, the output has the female's face with the male's voice. Match the reference's gender to the source's audio voice when voice/face alignment matters to your audience, or flip the audio defaults (covered in Two audio knobs below).

Multiple references for the same character

The referenceImages array accepts up to 3 images, and the first pattern that buys you something is sending multiple angles of the same character. A front portrait plus a three-quarter left and a three-quarter right give the model more identity information to draw on, which tightens face fidelity when the source character turns through the frame.

Front-facing chest-up portrait of a woman in her late twenties with shoulder-length copper-chestnut hair and freckles, in a cream cable-knit sweater — Front

The same woman shown from a three-quarter left angle, same hair, same freckles, same sweater, eyes still meeting the camera — 3/4 left

The same woman from a three-quarter right angle, same hair, same freckles, same sweater, eyes still meeting the camera — 3/4 right

Sent through the model against the same source video, the trio holds identity more consistently than the single front portrait alone:

Source video

One front reference

Three angle references

The single-reference version still produces a perfectly usable result, but the face drifts slightly when the source's head turns away from camera and the model has to extrapolate. The three-reference version stays anchored through the same turns. The dance motion, the framing, and the lighting all carry through identically in both outputs.

Generating the side angles cleanly is the workflow's main friction. The reliable trick: generate the front portrait first, then use that portrait as a reference image when generating the side views, with prompts that explicitly call out the same identity ("the same exact woman from the reference image, identical face, identical hair, identical clothing"). Most modern image models support this image-to-image pattern. P-Image-Edit fits the job directly: edit the front portrait in place to produce side variants while keeping the identity intact, then pass the trio into the replace request.

Multiple references for multiple characters

The second pattern is sending one reference per on-camera character. When the source video has two people on screen, you can pass two distinct reference images and the model swaps each one.

Without a prompt, the model auto-assigns references to the on-camera characters based on visual cues like position, gender, and apparent age. When you need guaranteed placement, send a position-mapping prompt that names each reference by its index in the array and the corresponding position in the source frame.

Front-facing chest-up portrait of Riley, a woman in her early thirties with long jet-black hair in a low ponytail, wearing a burnt-orange knit turtleneck against a pale grey background — Reference image 1: Riley

Front-facing chest-up portrait of Sam, a man in his early thirties with short tightly curled black hair, a short beard, wearing a forest-green corduroy button-down against a pale grey background — Reference image 2: Sam

Source video: woman on the left, man on the right, holding a casual back-and-forth

No prompt: the model auto-assigned the two references to the two on-camera characters

With a position-mapping prompt, useful when you want explicit control over which reference lands where

Replace the woman on the left in the source video with the woman from reference image 1. Replace the man on the right in the source video with the man from reference image 2. Preserve the source video motion, audio, camera, and background.

Auto-assignment works without a prompt for most setups: the model uses visual cues to figure out which reference goes where, and the result usually lands correctly the first time. Reach for the position-mapping prompt when you want explicit control over which reference goes where, like when the references look similar to each other, when you're producing a batch of outputs that need consistent placement, or when a particular auto-assignment came back the wrong way around and you need to override it.

The prompt is doing two specific things: naming each reference by its order in referenceImages ("reference image 1", "reference image 2") and tying each one to a position in the source frame ("on the left", "on the right"). That's the documented multi-character pattern, and it works even when the references are visually similar.

Reference styles

Visual style is inherited from the reference, end to end. Photoreal in, photoreal out. 3D animation in, 3D animation out. The model adapts the source character's face and form to whatever style the reference uses, while keeping the source's motion, scene, and camera in place.

The four outputs below all use the same source video. Only the reference style changes.

Source video

3D animated humanoid reference (Pixar-style)

3D rendered robot mascot reference

2D anime illustration reference

Claymation stop-motion reference

The motion, the framing, the lighting, and the audio are identical across all four. Only the character's visual language changed, controlled entirely by the reference image's style. The same workflow drops a Pixar character, a robot mascot, an anime illustration, or a claymation character into the same source scene with no other input change.

Lip sync quality follows the reference's mouth geometry. Photoreal, 3D rendered, and claymation references give the model detailed mouth structure to drive lip motion, and the output reads as correctly synced to the audio. Highly stylized 2D references with simplified mouth shapes (anime, flat illustration) don't give the model enough mouth detail to drive, and the result may show lip motion that's visibly out of sync. If your reference style has minimal mouth detail and the output will be heard out loud, flip sourceAudioSync to false and re-sync the lip motion in post.

This makes Replace useful for scenarios where you need character variations of the same recording: A/B testing UGC ad variants, localising one video across markets with region-appropriate mascots, swapping creator avatars across a series, or producing personalised outputs from a single shoot.

Two audio knobs

The output's audio is governed by two settings, both default to true. They look similar at a glance but they do different things:

preserveAudio controls whether the source audio track makes it into the output file at all. Default true: the output plays the source's audio. Set to false: the output is silent but lip motion still plays.
sourceAudioSync controls whether the source audio drives the character's lip and motion sync. Default true: the character lip-syncs to the audio. Set to false: audio still plays, but the character's face is no longer driven by it.

The two are independent, which gives you four possible modes. The three useful ones look like this:

Default: both knobs on. Audio plays, lip motion driven by audio.

preserveAudio: false. Output is silent. Lip motion still plays.

sourceAudioSync: false. Audio plays, but the character's lip motion isn't driven by it.

The default (both on) is the right call for direct character swaps where the new character should appear to speak the source's lines. Flip preserveAudio to false when you intend to re-dub or re-score the output downstream and don't want the source audio bleeding through. Flip sourceAudioSync to false when the source audio is incidental (room tone, music, ambient noise) and shouldn't pull the character's face around to "match" it.

Limits

Two situations where Replace stops being predictable.

Extreme camera motion with no face anchor in the source. Whip pans, handheld chase shots, and footage where the on-camera character is mostly seen from behind give the model very little to track. The result is usually a scene rewrite: the model improvises a new action that fits the reference, often producing a clip that's plausible on its own but unrelated to the source.

Source video: handheld chase shot through an alley, the runner is seen from behind, no face visible to the camera

Output: the model improvised. Aiden runs toward the camera instead, in a different scene that fits the reference.

The fix is to use a source video with at least a brief moment of the character's face visible to the camera. If the entire source has no face anchor, expect improvisation, not preservation.

Multi-character ambiguity. When the source has multiple on-camera characters and the references look similar to each other (similar hair, age, gender, clothing), auto-assignment becomes unreliable. The fix is the position-mapping prompt described in Multiple references for multiple characters. Always reach for an explicit prompt when the references aren't visually distinct.

Tips

Match the reference's gender to the source's voice. The source audio is preserved by default, so a male reference into a female-voiced source produces a male face with a female voice. Gender-match when voice and face alignment matters, or flip preserveAudio to strip the source audio entirely.
Send a clear chest-up reference when you can. The model tolerates varied framing, but chest-up portraits with the face clearly visible give the most consistent identity retention with the least friction.
Use three angles when identity has to hold through head turns. Front plus 3/4 left plus 3/4 right is the canonical trio. One front reference is usually enough for static or near-static shots.
Use a position-mapping prompt for multi-character sources. Auto-assignment works when the references are visually distinct, but a prompt that names each reference by its position ("the woman on the left with reference image 1") removes ambiguity in every case.
Pick the resolution to match the destination. Output inherits the source's aspect ratio, so the only call is 720p versus 1080p. 1080p doubles the per-second cost. Reach for it when the output is going to a high-resolution destination.
Omit fps to inherit the source video's frame rate. The schema only accepts 24 or 48, and most workflows want the source's native rate carried through. Setting fps explicitly is only useful when re-targeting the output to a specific delivery format.