
PixVerse Modify
Mask-aware video editing for swaps, removals, restyling, and prompt-driven scene changes
PixVerse Modify
Mask-aware video editing for swaps, removals, restyling, and prompt-driven scene changes
PixVerse Modify Overview
PixVerse Modify is a video-to-video editing model for changing existing footage with text instructions, optional reference images, and masks. It supports subject swapping, object addition and removal, free-form scene edits such as weather or lighting changes, in-video text replacement, and full-video style transfer while preserving the source clip structure.
Commercial use
How to Use PixVerse Modify
Overview
PixVerse Modify is a video editing model for changing existing footage with text instructions, optional masks, and optional reference images.
It is best suited to workflows where the source video already exists and the goal is to replace subjects, add or remove elements, change scene attributes, replace embedded text, or restyle the clip while preserving the original timing and overall structure.
Strengths
Mask-Aware Subject Swaps
PixVerse Modify supports targeted subject replacement using masks and reference images. This makes it useful for swapping one or several people or objects in a clip while keeping the underlying motion and scene timing intact.
Add and Remove Edits
The model can insert new elements into a video or remove existing elements and fill the background after removal. This is useful for product placements, cleanup, prop changes, and selective scene edits.
Free-Form Scene Modification
PixVerse Modify supports broader prompt-driven edits that change scene attributes such as lighting, weather, season, and other visual conditions. It works well when the goal is to transform the mood or setting without rebuilding the clip from scratch.
In-Video Text Replacement
The model can detect and replace embedded text inside the video. This makes it useful for versioning marketing clips, changing signage, or localizing short-form content.
Full-Clip Restyling
PixVerse Modify also supports style transfer across the video, including changing the clip into different visual styles such as comic, ink painting, 2D, or 3D-inspired looks.
Capabilities
Video-to-Video Editing
PixVerse Modify accepts an existing video as the source and generates a modified version of that clip. You can provide a previously generated PixVerse video ID or upload an external video file.
Reference-Guided Editing
Reference images can be used for swap and add workflows, letting the model insert or replace subjects based on uploaded visual targets.
Mask-Based Control
The model supports mask-driven editing for more targeted modifications. Masks can come from PixVerse's own mask workflow or from a user-provided mask image.
Keyframe-Aware Editing
Editing can be anchored to a specific keyframe so that masks and edit targets line up with the frame selected for modification.
Input and Output
- AIR ID:
pixverse:modify@1 - Input: source video, text prompt, and optional masks, keyframe selection, and reference images
- Output: edited video clip
- Input limits: up to 1920p, 100 MB, and 30 seconds for uploaded source videos
Best Fit
- Subject replacement and character swaps
- Object cleanup and selective removals
- Adding props or branded elements into existing footage
- Restyling an entire clip into a new visual look
- Changing text, weather, lighting, or scene mood in an existing video
More models from PixVerse
PixVerse V6 is a video generation model focused on multi-shot storytelling with native synchronized audio. It provides over 20 cinematic camera controls including focal length, aperture, depth of field, lens distortion, and vignetting. It features improved character consistency across shots using multi-image references, supports 1080p output at up to 15 seconds, and includes multilingual text rendering in frames.
PixVerse V5.6 is an upgraded video generation model that improves visual stability, motion clarity, and audio-visual alignment over previous versions. It supports text-to-video and image-to-video generation with optional native audio, delivering more accurate multi-character lip-sync, cleaner motion in complex scenes, and more natural speech and environmental sound for single-shot cinematic outputs.
PixVerse V5.5 is a director focused video model for story driven clips. It supports multi image fusion for character continuity, multi shot sequences, and native audio. It delivers smooth motion, refined cinematic control, and precise text guided video generation for complex scenes.
PixVerse V5 Fast is an optimized variant of PixVerse v5 designed for faster video generation and lower latency. It supports text to video and image to video workflows while prioritizing speed and responsiveness, making it suitable for rapid iteration and preview-focused pipelines where audio, templates, and advanced controls are not required.
PixVerse V5 generates high fidelity video from text prompts or single images. It delivers smooth motion and sharp cinematic frames with strong prompt alignment. Ideal for creators who need fast iteration, keyframe control, and consistent style across shots.
PixVerse LipSync generates accurate mouth motion from audio for characters and videos. It aligns lip movement with speech timing. It preserves facial expression context. Ideal for dubbing, character animation, and content localization workflows.
PixVerse V4.5 generates stylized cinematic video from text prompts or reference images. It adds refined camera motion control, multi image fusion, and faster modes for iteration. Ideal for creators who need dynamic shots, complex motion, and consistent stylized outputs.
PixVerse V3.5 provides basic text to video generation with support for visual effects and limited subject motion. It targets short clips for experiments or prototypes. Camera movement is not available, which simplifies control and integration in pipelines.
PixVerse V4 is a generative video model for text prompts or source images. It improves motion quality and complex camera movement. It adds motion modes, sound effect sync, and style transfer. Ideal for short cinematic clips and rapid creative iteration in production pipelines.








