Skywork
Skywork

SkyReels V4

Unified multimodal video model for generation, inpainting, and editing with synchronized audio

Text to VideoImage to VideoVideo to VideoAudio to Video

SkyReels V4 Overview

SkyReels V4 is a unified multimodal video foundation model for joint video-audio generation, inpainting, and editing. It accepts text, images, video clips, masks, and audio references, and supports cinematic outputs up to 1080p, 32 FPS, and 15 seconds with synchronized audio, making it suitable for prompt-driven generation as well as guided editing workflows.

From $0.1200/ video
1s . No audio$0.120
1s . With audio$0.140

Commercial use

How to Use SkyReels V4

Overview

SkyReels V4 is a unified multimodal video model for generating and editing video with synchronized audio. It is designed to handle generation, inpainting, and editing inside one model rather than splitting those workflows across separate systems.

This makes it a strong fit for teams that want one video model for prompt-driven creation, guided edits, and multimodal control.

Capabilities

Joint Video and Audio Generation

SkyReels V4 generates video and temporally aligned audio together. This is useful for workflows where synchronized audiovisual output matters and separate post-processing pipelines would add complexity.

Text-to-Video Generation

The model supports prompt-driven video generation from text instructions. It is suited to short cinematic sequences, multi-shot generation, and scene creation guided by natural language.

Image- and Video-Guided Generation

SkyReels V4 accepts images and video clips as conditioning inputs, which makes it useful for image-to-video workflows, continuation, guided generation, and reference-based scene control.

Inpainting and Editing

The model supports mask-based inpainting and video editing under the same interface. This makes it relevant for replacing regions, extending sequences, and editing existing footage without switching to a separate editing model.

Audio-Referenced Control

SkyReels V4 can take audio references as part of the conditioning input, which helps guide synchronized sound generation and broader multimodal editing workflows.

High-Fidelity Short-Form Output

The model supports outputs up to 1080p resolution, 32 FPS, and 15 seconds. This is suited to high-quality short-form clips where motion, timing, and audiovisual coherence all matter.

Input and Output

  • Runware model ID: skywork-skyreels-v4
  • AIR ID: skywork:skyreels@v4
  • Input: text, images, video clips, masks, and audio references depending on workflow
  • Output: generated or edited video with synchronized audio
  • Maximum resolution: 1080p
  • Maximum frame rate: 32 FPS
  • Maximum duration: 15 seconds

Typical Use Cases

  • Text-to-video generation with native audio
  • Image-guided or video-guided generation
  • Video inpainting and mask-based editing
  • Short-form cinematic clip generation
  • Multimodal audiovisual editing workflows