PixVerse V5.6
PixVerse V5.6 is an upgraded video generation model that improves visual stability, motion clarity, and audio-visual alignment over previous versions. It supports text-to-video and image-to-video generation with optional native audio, delivering more accurate multi-character lip-sync, cleaner motion in complex scenes, and more natural speech and environmental sound for single-shot cinematic outputs.
API Options
Platform-level options for task execution and delivery.
-
taskType
string required value: videoInference -
Identifier for the type of task being performed
-
taskUUID
string required UUID v4 -
UUID v4 identifier for tracking tasks and matching async responses. Must be unique per task.
-
outputType
string default: URL -
Video output type.
Allowed values 1 value
-
outputFormat
string default: MP4 -
Specifies the file format of the generated output. The available values depend on the task type and the specific model's capabilities.
- `MP4`: Widely supported video container (H.264), recommended for general use.
- `WEBM`: Optimized for web delivery.
- `MOV`: QuickTime format, common in professional workflows (Apple ecosystem).
Allowed values 3 values
-
outputQuality
integer min: 20 max: 99 default: 95 -
Compression quality of the output. Higher values preserve quality but increase file size.
-
webhookURL
string URI -
Specifies a webhook URL where JSON responses will be sent via HTTP POST when generation tasks complete. For batch requests with multiple results, each completed item triggers a separate webhook call as it becomes available.
Learn more 1 resource
- Webhooks PLATFORM
- Webhooks
-
deliveryMethod
string default: async -
Determines how the API delivers task results.
Allowed values 1 value
- Returns an immediate acknowledgment with the task UUID. Poll for results using getResponse. Required for long-running tasks like video generation.
Learn more 1 resource
- Task Polling PLATFORM
-
uploadEndpoint
string URI -
Specifies a URL where the generated content will be automatically uploaded using the HTTP PUT method. The raw binary data of the media file is sent directly as the request body. For secure uploads to cloud storage, use presigned URLs that include temporary authentication credentials.
Common use cases:
- Cloud storage: Upload directly to S3 buckets, Google Cloud Storage, or Azure Blob Storage using presigned URLs.
- CDN integration: Upload to content delivery networks for immediate distribution.
// S3 presigned URL for secure upload https://your-bucket.s3.amazonaws.com/generated/content.mp4?X-Amz-Signature=abc123&X-Amz-Expires=3600 // Google Cloud Storage presigned URL https://storage.googleapis.com/your-bucket/content.jpg?X-Goog-Signature=xyz789 // Custom storage endpoint https://storage.example.com/uploads/generated-image.jpgThe content data will be sent as the request body to the specified URL when generation is complete.
-
safety
object -
Content safety checking configuration for video generation.
Properties 2 properties
-
safety»checkContentcheckContent
boolean default: false -
Enable or disable content safety checking. When enabled, defaults to
fastmode.
-
safety»modemode
string default: none -
Safety checking mode for video generation.
Allowed values 3 values
- Disables checking.
- Checks key frames.
- Checks all frames.
-
-
ttl
integer min: 60 -
Time-to-live (TTL) in seconds for generated content. Only applies when
outputTypeisURL.
-
includeCost
boolean default: false -
Include task cost in the response.
-
numberResults
integer min: 1 max: 4 default: 1 -
Number of results to generate. Each result uses a different seed, producing variations of the same parameters.
Inputs
Input resources for the task (images, audio, etc). These must be nested inside the inputs object.
inputs object.-
inputs»frameImagesframeImages
array of strings or objects min items: 1max items: 2 -
An array of frame-specific image inputs to guide video generation. Each item can be either a plain image input (UUID, URL, Data URI, or Base64) or an object that pairs an image with a target frame position.
The
frameImagesparameter allows you to constrain specific frames within the video sequence, ensuring that particular visual content appears at designated points. This is different fromreferenceImages, which provide overall visual guidance without constraining specific timeline positions.When the
frameparameter is omitted, automatic distribution rules apply:- 1 image: Used as the first frame.
- 2 images: First and last frames.
Examples 3 examples
Shorthand format: When you don't need to specify a frame position, you can pass a plain image input directly.
"frameImages": [ "aac49721-1964-481a-ae78-8a4e29b91402" ]Object format: When you need to specify a frame position, use an object with
imageandframe.First and last frames: With two images, they automatically become the first and last frames of the video sequence. You can mix shorthand and object formats."frameImages": [ { "image": "aac49721-1964-481a-ae78-8a4e29b91402", "frame": "first" } ]"frameImages": [ "aac49721-1964-481a-ae78-8a4e29b91402", { "image": "3ad204c3-a9de-4963-8a1a-c3911e3afafe", "frame": "last" } ]Format 1: string[]
-
Image input (UUID, URL, Data URI, or Base64).
Format 2: object[] 2 properties
-
inputs»frameImages»imageimage
string required -
Image input (UUID, URL, Data URI, or Base64).
-
inputs»frameImages»frameframe
object -
Target frame position for the image. Supports first and last frame.
Allowed values 4 values
- First frame of the video.
- Last frame of the video.
- Frame index 0 (first frame).
- Frame index -1 (last frame).
Generation Parameters
Core parameters for controlling the generated content.
-
model
string required value: pixverse:1@7 -
Identifier of the model to use for generation.
Learn more 3 resources
-
positivePrompt
string required min: 1 max: 2048 -
Text prompt describing elements to include in the generated output.
Learn more 2 resources
-
negativePrompt
string min: 1 max: 2048 -
Prompt to guide what to exclude from generation. Ignored when guidance is disabled (CFGScale ≤ 1).
Learn more 1 resource
-
Width of the generated media in pixels.
Learn more 2 resources
-
Height of the generated media in pixels.
Learn more 2 resources
-
resolution
string default: 720p -
Resolution preset for the output. When used with input media, automatically matches the aspect ratio from the input.
Allowed values 4 values
-
duration
float -
Length of the generated video in seconds. The total number of frames produced is determined by duration multiplied by the model's frame rate (fps).
-
seed
integer min: 0 max: 2147483647 -
Random seed for reproducible generation. When not provided, a random seed is generated in the unsigned 32-bit range.
Provider Settings
Parameters specific to this model provider. These must be nested inside the providerSettings.pixverse object.
providerSettings.pixverse object.-
providerSettings»pixverse»audioaudio
boolean default: false -
Enable audio generation.
-
providerSettings»pixverse»stylestyle
string -
Artistic style aesthetic for video generation.
Allowed values 5 values
- Japanese animation aesthetic.
- Three-dimensional animated style with depth.
- Stop-motion clay animation appearance.
- Comic book or graphic novel visual style.
- Futuristic, neon-lit dystopian aesthetic.
-
providerSettings»pixverse»thinkingthinking
string default: auto -
Enhanced reasoning mode.
Allowed values 3 values
- Max understanding.
- Faster generation.
- Automatic.
Neon Alley Dialogue
{
"taskType": "videoInference",
"taskUUID": "e037301c-ed0e-4c85-bc3e-c6a97e4f1b1d",
"model": "pixverse:1@7",
"positivePrompt": "A cinematic single-shot scene in a rain-soaked cyberpunk alley at night, 2 characters facing each other under flickering neon signs and drifting steam. A street detective in a wet trench coat speaks urgently to a young hacker with a glowing translucent umbrella. Camera begins with a medium-wide shot, slowly dollying forward as reflections shimmer across puddles, holographic ads pulse on brick walls, and distant traffic glows in the fog. Their mouth movements match a tense whispered conversation, with subtle head turns, blinking, breathing, natural hand gestures, and realistic emotional timing. Rich environmental audio: soft rain, distant hover traffic, humming neon, footsteps in shallow water, quiet city ambience, and synchronized dialogue. High visual stability, crisp facial detail, clean motion, dramatic rim lighting, atmospheric depth, cinematic color grading.",
"negativePrompt": "blurry faces, distorted hands, extra limbs, jittery motion, frame flicker, warped mouths, bad lip sync, duplicate characters, low detail, overexposed highlights, unreadable composition, camera shake",
"width": 1280,
"height": 720,
"duration": 8,
"providerSettings": {
"pixverse": {
"style": "cyberpunk",
"audio": true,
"thinking": "enabled"
}
}
}{
"taskType": "videoInference",
"taskUUID": "e037301c-ed0e-4c85-bc3e-c6a97e4f1b1d",
"videoUUID": "944829c5-2929-4f7b-9fe4-d07172d63097",
"videoURL": "https://vm.runware.ai/video/os/a05d22/ws/5/vi/944829c5-2929-4f7b-9fe4-d07172d63097.mp4",
"seed": 61861179,
"cost": 0.3978
}Cinematic Time-Lapse Transformation
{
"taskType": "videoInference",
"taskUUID": "80e699da-e6c7-4a38-bedb-b1953af46d40",
"model": "pixverse:1@7",
"positivePrompt": "A cinematic single-shot transformation sequence beginning at an ancient jungle temple at sunrise and ending in the same composition as a neon cyberpunk ruin at night. The camera slowly pushes forward through drifting mist and floating dust motes as vines sway gently, birds scatter, light rays shift, stone surfaces gradually gain glowing circuitry, holographic signs emerge from the ruins, and the environment transitions seamlessly from warm golden dawn to moody blue-magenta neon night. Highly stable geometry, clean motion, rich atmosphere, detailed textures, dramatic lighting, cinematic realism.",
"negativePrompt": "flicker, warped architecture, camera shake, duplicated objects, deformed plants, broken perspective, chaotic motion, blurry details, low resolution, text artifacts, logos, watermarks",
"width": 1280,
"height": 720,
"duration": 8,
"providerSettings": {
"pixverse": {
"style": "cyberpunk",
"audio": true,
"thinking": "auto"
}
},
"inputs": {
"frameImages": [
{
"inputImage": "https://assets.runware.ai/assets/inputs/bcc43e08-b2e7-4a44-9c85-d7999db4a0c7.jpg",
"frame": "first"
},
{
"inputImage": "https://assets.runware.ai/assets/inputs/d9c4fa52-5b9f-4a4f-a888-968271a39ff9.jpg",
"frame": "last"
}
]
}
}{
"taskType": "videoInference",
"taskUUID": "80e699da-e6c7-4a38-bedb-b1953af46d40",
"videoUUID": "6f24df34-b924-4132-91dd-cd3c0c9c5470",
"videoURL": "https://vm.runware.ai/video/os/a03d21/ws/5/vi/6f24df34-b924-4132-91dd-cd3c0c9c5470.mp4",
"seed": 1715010496,
"cost": 0.3978
}Cyberpunk Street Dialogue
{
"taskType": "videoInference",
"taskUUID": "fcc71725-dad7-4a0f-a025-f410e0175811",
"model": "pixverse:1@7",
"positivePrompt": "A single-shot cinematic scene in a rain-soaked neon cyberpunk market at night. Two characters stand under a flickering holographic awning: a street-smart female smuggler in a reflective coat and a calm male android detective with subtle illuminated facial lines. The camera begins with a medium-wide shot, slowly dollies forward and arcs slightly around them as crowds and umbrellas pass in the blurred background. Wet pavement reflects pink, teal, and amber signage. Steam rises from food stalls, distant hover traffic glides overhead, and animated billboards cast shifting light across their faces. The woman speaks first with expressive natural lip movement and says, \"You really crossed half the city for one missing memory chip?\" The android replies with precise synchronized lip movement, \"Not for the chip. For the name encoded inside it.\" Add natural pauses, eye contact, small head turns, coat movement in the wind, realistic rain droplets, subtle background pedestrians, and rich ambient city audio with rain, electric hum, footsteps, crowd murmur, and the two clearly audible voices. High visual stability, clean motion, realistic facial animation, dramatic contrast, premium sci-fi cinematography.",
"negativePrompt": "blurry faces, distorted hands, broken anatomy, extra limbs, duplicate people, jittery motion, camera shake, garbled text, subtitles, watermark, logo, low detail, out of sync lips, muffled speech, chaotic framing",
"width": 1280,
"height": 720,
"duration": 8,
"providerSettings": {
"pixverse": {
"style": "cyberpunk",
"audio": true,
"thinking": "enabled"
}
}
}{
"taskType": "videoInference",
"taskUUID": "fcc71725-dad7-4a0f-a025-f410e0175811",
"videoUUID": "d5e6596c-b6d1-48e3-9b13-0aa2f61966b8",
"videoURL": "https://vm.runware.ai/video/os/a23d05/ws/5/vi/d5e6596c-b6d1-48e3-9b13-0aa2f61966b8.mp4",
"seed": 1824711847,
"cost": 0.3978
}Neon Rooftop Duet Nightscape
{
"taskType": "videoInference",
"taskUUID": "00fa8622-053e-4705-a7f0-ba847d70bbf2",
"model": "pixverse:1@7",
"positivePrompt": "A single-shot cinematic cyberpunk rooftop concert at midnight above a vast rain-slick megacity, two young street musicians performing under a flickering holographic billboard, one singing into a chrome microphone while the other plays a transparent neon violin, drifting steam vents, distant flying traffic, pulsing magenta and cyan reflections on wet concrete, subtle wind moving coats and hair, emotional eye contact, expressive mouth movement for lyrics, realistic hand motion, smooth dolly-in camera, dramatic atmosphere, richly layered city depth, highly coherent motion, natural environmental sound with soft rain, humming city ambience, violin performance, and clear vocal singing",
"negativePrompt": "blurry faces, extra limbs, broken fingers, frozen motion, jittery camera, duplicate characters, warped instruments, distorted mouths, subtitle text, watermark, logo, low detail, noisy frame, abrupt cuts",
"width": 1280,
"height": 720,
"duration": 8,
"seed": 85288,
"providerSettings": {
"pixverse": {
"style": "cyberpunk",
"audio": true,
"thinking": "enabled"
}
}
}{
"taskType": "videoInference",
"taskUUID": "00fa8622-053e-4705-a7f0-ba847d70bbf2",
"videoUUID": "808ab180-30ce-4273-a4bf-c41250ff38a5",
"videoURL": "https://vm.runware.ai/video/os/a07d11/ws/5/vi/808ab180-30ce-4273-a4bf-c41250ff38a5.mp4",
"seed": 85288,
"cost": 0.3978
}Cyberpunk Street Monologue
{
"taskType": "videoInference",
"taskUUID": "4affe9f1-cb16-4487-9712-244416dbf172",
"model": "pixverse:1@7",
"positivePrompt": "Animate the provided first-frame image into a single-shot cinematic scene. The detective looks into camera and delivers a quiet, intense monologue while subtle rain falls around her. Her lips sync naturally as she speaks, her breath visible in the cool air. The camera slowly pushes in, neon reflections shimmer across the wet street, distant vehicles glide through the background, a holographic billboard flickers, and her coat shifts gently in the wind. Maintain strong facial consistency, clean motion, realistic lighting, atmospheric depth, and immersive urban ambience with natural speech and city sound design.",
"negativePrompt": "blurry face, distorted hands, warped mouth, bad lip sync, jitter, flicker, low detail, extra people appearing, sudden camera shake, broken anatomy, text artifacts, overexposure",
"width": 1280,
"height": 720,
"duration": 8,
"providerSettings": {
"pixverse": {
"style": "cyberpunk",
"audio": true,
"thinking": "auto"
}
},
"inputs": {
"frameImages": [
{
"inputImage": "https://assets.runware.ai/assets/inputs/4d25cccd-4fa4-4cf3-a4dc-165c5981fb45.jpg",
"frame": "first"
}
]
}
}{
"taskType": "videoInference",
"taskUUID": "4affe9f1-cb16-4487-9712-244416dbf172",
"videoUUID": "b365d0dc-1e51-48ba-ade1-3b9ff9dd59c2",
"videoURL": "https://vm.runware.ai/video/os/a02d21/ws/5/vi/b365d0dc-1e51-48ba-ade1-3b9ff9dd59c2.mp4",
"seed": 1563026224,
"cost": 0.3978
}