P-Video-Replace
P-Video-Replace is a video transformation model that swaps the on-camera character in an existing video with the character from a reference image. It is built to preserve the original motion, timing, camera behavior, lighting, and background while changing who appears in the clip, making it useful for UGC ad variations, content localization, avatar or mascot insertion, and other scalable character-replacement workflows.
Complete technical specification for integration
Ready-to-use code snippets for common workflows
Step-by-step tutorials for advanced use cases
← All GuidesReplacing the character in a video
How to use Pruna P-Video-Replace to swap the on-camera character in an existing video with one from a reference image while preserving the original motion, timing, camera, lighting, and audio.
Introduction
Re-using the same video clip with different on-camera characters is awkward in most pipelines. Re-shooting or re-prompting the scene for each character changes the motion, the timing, the gestures, and the camera every time. General-purpose video editors can't re-cast a clip without distorting everything around the character.
P-Video-Replace skips that loop. You send a source video and one to three reference images, and the model returns a new video where the on-camera character has been swapped for the reference. The motion, timing, camera movement, lighting, audio, and background all carry through unchanged.
A sister model, P-Video-Animate , pairs an image and a video too, but the direction is reversed. Animate takes the image as the scene and animates it with the source video's motion, preserving the image's atmosphere. Replace takes the video as the scene and swaps the character from the image, preserving the video's atmosphere. Decide which side of the pair you want to keep before reaching for either model.
This guide covers the request shape, what makes a good reference image, how to send multiple references for tighter identity or multi-character scenes, what happens when the reference style isn't photoreal, the two audio knobs that control voice and lip sync, and the limits where the model starts to improvise.
Request shape
Each call takes a source video and 1 to 3 reference images, plus a small set of optional settings. Delivery is always async: the immediate response is an acknowledgment, and the finished video arrives via polling or webhook.
[
{
"taskType": "videoInference",
"taskUUID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"model": "prunaai:p-video@replace",
"deliveryMethod": "async",
"inputs": {
"video": "https://example.com/source-podcast.mp4",
"referenceImages": ["https://example.com/ref-jordan.jpg"]
},
"resolution": "720p"
}
][
{
"taskType": "videoInference",
"taskUUID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"videoUUID": "f1e2d3c4-b5a6-7890-1234-567890abcdef",
"videoURL": "https://vm.runware.ai/video/os/a14d18/ws/2/vi/f1e2d3c4-b5a6-7890-1234-567890abcdef.mp4",
"seed": 837412938
}
]A few quirks worth knowing about the request shape:
-
fpsaccepts 24 or 48 only. Omit it to inherit the source video's frame rate, which is what most workflows want. Any other value is rejected. -
positivePromptis optional. For straightforward character swaps the model works without one. Prompts earn their place in ambiguous cases like multi-character scenes (covered below).
The reference image
On-camera identity is the only thing the reference image controls. The model lifts the face, hair, build, and clothing from it and renders that character into the source video's scene. Everything else (motion, lighting, background, audio) comes from the source.
A photoreal editorial portrait of a MAN in his early thirties named Jordan. Sandy blond hair styled in a short side-parted cut, warm hazel-green eyes, a small mole near his right cheekbone, light five-o-clock shadow, a friendly engaged expression with a slight closed-mouth smile. He is wearing a soft navy henley shirt with the top buttons undone. Plain pale grey studio background, soft even three-point lighting. He faces the camera directly, chin slightly tucked. Centered head-and-shoulders framing, photorealistic studio quality.
What matters about a reference image:
- The face is clearly visible. Side profiles, masks, sunglasses, or heavily occluded faces leave the model with less identity information to work with. A clear front-facing or three-quarter view is the safest bet.
- Framing is flexible. Chest-up portraits are ideal, but the model handles candid full-body shots and varied lighting without dropping quality. You don't need a studio portrait to get a working swap.
- Style carries through. A photoreal reference produces a photoreal output, a 3D rendered reference produces a 3D rendered output, an illustration produces an illustration. The reference's visual language is what the output inherits.
There's also an audio detail worth knowing up front: the source video's audio is preserved verbatim in the output by default. If the source has a male speaker and you send a female reference, the output has the female's face with the male's voice. Match the reference's gender to the source's audio voice when voice/face alignment matters to your audience, or flip the audio defaults (covered in Two audio knobs below).
Multiple references for the same character
The referenceImages array accepts up to 3 images, and the first pattern that buys you something is sending multiple angles of the same character. A front portrait plus a three-quarter left and a three-quarter right give the model more identity information to draw on, which tightens face fidelity when the source character turns through the frame.
A photoreal editorial portrait of a WOMAN in her late twenties named Maya. Shoulder-length copper-chestnut hair with a slight wave, warm hazel eyes, a small constellation of freckles across the bridge of her nose, full natural eyebrows, a friendly expression with a slight closed-mouth smile. She is wearing a soft cream cable-knit sweater with a wide ribbed collar. Plain pale grey studio background, soft even three-point lighting. She faces the camera directly, chin slightly tucked. Centered head-and-shoulders framing, photorealistic studio quality.
The same exact woman from the reference image, identical face, identical copper-chestnut hair, identical freckles, identical cream cable-knit sweater. She is turned about 30 degrees to her left so the camera sees her face from a three-quarter left angle. Her eyes still meet the camera. Plain pale grey studio background, soft even three-point lighting. Centered head-and-shoulders framing, photorealistic studio quality.
The same exact woman from the reference image, identical face, identical copper-chestnut hair, identical freckles, identical cream cable-knit sweater. She is turned about 30 degrees to her right so the camera sees her face from a three-quarter right angle. Her eyes still meet the camera. Plain pale grey studio background, soft even three-point lighting. Centered head-and-shoulders framing, photorealistic studio quality.
Sent through the model against the same source video, the trio holds identity more consistently than the single front portrait alone:
The single-reference version still produces a perfectly usable result, but the face drifts slightly when the source's head turns away from camera and the model has to extrapolate. The three-reference version stays anchored through the same turns. The dance motion, the framing, and the lighting all carry through identically in both outputs.
Generating the side angles cleanly is the workflow's main friction. The reliable trick: generate the front portrait first, then use that portrait as a reference image when generating the side views, with prompts that explicitly call out the same identity ("the same exact woman from the reference image, identical face, identical hair, identical clothing"). Most modern image models support this image-to-image pattern. P-Image-Edit fits the job directly: edit the front portrait in place to produce side variants while keeping the identity intact, then pass the trio into the replace request.
Multiple references for multiple characters
The second pattern is sending one reference per on-camera character. When the source video has two people on screen, you can pass two distinct reference images and the model swaps each one.
Without a prompt, the model auto-assigns references to the on-camera characters based on visual cues like position, gender, and apparent age. When you need guaranteed placement, send a position-mapping prompt that names each reference by its index in the array and the corresponding position in the source frame.
A photoreal editorial portrait of a WOMAN in her early thirties named Riley. Long jet-black hair pulled back into a sleek low ponytail, deep brown almond-shaped eyes, fine arched eyebrows, a small silver hoop in her left earlobe, a composed confident expression with a faint closed-mouth smile. She is wearing a warm burnt-orange knit turtleneck. Plain pale grey studio background, soft even three-point lighting. She faces the camera directly, chin slightly tucked. Centered head-and-shoulders framing, photorealistic studio quality.
A photoreal editorial portrait of a MAN in his early thirties named Sam. Short tightly curled black hair faded close on the sides, warm brown eyes, a neat well-groomed short beard, a small scar near his left eyebrow, a friendly thoughtful expression with a slight closed-mouth smile. He is wearing a forest-green corduroy button-down shirt with the collar open. Plain pale grey studio background, soft even three-point lighting. He faces the camera directly, chin slightly tucked. Centered head-and-shoulders framing, photorealistic studio quality.
Replace the woman on the left in the source video with the woman from reference image 1. Replace the man on the right in the source video with the man from reference image 2. Preserve the source video motion, audio, camera, and background.
Auto-assignment works without a prompt for most setups: the model uses visual cues to figure out which reference goes where, and the result usually lands correctly the first time. Reach for the position-mapping prompt when you want explicit control over which reference goes where, like when the references look similar to each other, when you're producing a batch of outputs that need consistent placement, or when a particular auto-assignment came back the wrong way around and you need to override it.
The prompt is doing two specific things: naming each reference by its order in referenceImages ("reference image 1", "reference image 2") and tying each one to a position in the source frame ("on the left", "on the right"). That's the documented multi-character pattern, and it works even when the references are visually similar.
Reference styles
Visual style is inherited from the reference, end to end. Photoreal in, photoreal out. 3D animation in, 3D animation out. The model adapts the source character's face and form to whatever style the reference uses, while keeping the source's motion, scene, and camera in place.
The four outputs below all use the same source video. Only the reference style changes.
The motion, the framing, the lighting, and the audio are identical across all four. Only the character's visual language changed, controlled entirely by the reference image's style. The same workflow drops a Pixar character, a robot mascot, an anime illustration, or a claymation character into the same source scene with no other input change.
Lip sync quality follows the reference's mouth geometry. Photoreal, 3D rendered, and claymation references give the model detailed mouth structure to drive lip motion, and the output reads as correctly synced to the audio. Highly stylized 2D references with simplified mouth shapes (anime, flat illustration) don't give the model enough mouth detail to drive, and the result may show lip motion that's visibly out of sync. If your reference style has minimal mouth detail and the output will be heard out loud, flip sourceAudioSync to false and re-sync the lip motion in post.
This makes Replace useful for scenarios where you need character variations of the same recording: A/B testing UGC ad variants, localising one video across markets with region-appropriate mascots, swapping creator avatars across a series, or producing personalised outputs from a single shoot.
Two audio knobs
The output's audio is governed by two settings, both default to true. They look similar at a glance but they do different things:
-
preserveAudiocontrols whether the source audio track makes it into the output file at all. Defaulttrue: the output plays the source's audio. Set tofalse: the output is silent but lip motion still plays. -
sourceAudioSynccontrols whether the source audio drives the character's lip and motion sync. Defaulttrue: the character lip-syncs to the audio. Set tofalse: audio still plays, but the character's face is no longer driven by it.
The two are independent, which gives you four possible modes. The three useful ones look like this:
The default (both on) is the right call for direct character swaps where the new character should appear to speak the source's lines. Flip preserveAudio to false when you intend to re-dub or re-score the output downstream and don't want the source audio bleeding through. Flip sourceAudioSync to false when the source audio is incidental (room tone, music, ambient noise) and shouldn't pull the character's face around to "match" it.
Limits
Two situations where Replace stops being predictable.
Extreme camera motion with no face anchor in the source. Whip pans, handheld chase shots, and footage where the on-camera character is mostly seen from behind give the model very little to track. The result is usually a scene rewrite: the model improvises a new action that fits the reference, often producing a clip that's plausible on its own but unrelated to the source.
The fix is to use a source video with at least a brief moment of the character's face visible to the camera. If the entire source has no face anchor, expect improvisation, not preservation.
Multi-character ambiguity. When the source has multiple on-camera characters and the references look similar to each other (similar hair, age, gender, clothing), auto-assignment becomes unreliable. The fix is the position-mapping prompt described in Multiple references for multiple characters . Always reach for an explicit prompt when the references aren't visually distinct.
Tips
- Match the reference's gender to the source's voice. The source audio is preserved by default, so a male reference into a female-voiced source produces a male face with a female voice. Gender-match when voice and face alignment matters, or flip
preserveAudioto strip the source audio entirely. - Send a clear chest-up reference when you can. The model tolerates varied framing, but chest-up portraits with the face clearly visible give the most consistent identity retention with the least friction.
- Use three angles when identity has to hold through head turns. Front plus 3/4 left plus 3/4 right is the canonical trio. One front reference is usually enough for static or near-static shots.
- Use a position-mapping prompt for multi-character sources. Auto-assignment works when the references are visually distinct, but a prompt that names each reference by its position ("the woman on the left with reference image 1") removes ambiguity in every case.
- Pick the resolution to match the destination. Output inherits the source's aspect ratio, so the only call is 720p versus 1080p. 1080p doubles the per-second cost. Reach for it when the output is going to a high-resolution destination.
- Omit
fpsto inherit the source video's frame rate. The schema only accepts 24 or 48, and most workflows want the source's native rate carried through. Settingfpsexplicitly is only useful when re-targeting the output to a specific delivery format.