Vidu
Vidu

Vidu Q3

Multimodal video generation with native audio and intelligent shot planning

Text to VideoImage to VideoAudio to Video
Example 1

Vidu Q3 Overview

Vidu Q3 is a multimodal video generation model that creates video with synchronized audio directly from text or images, supports intelligent multi-shot sequencing, and produces complete outputs with stable visuals and embedded subtitles without post-processing.

From $0.0455/ video

Save on average 35% vs the market

360p · 1s$0.0455
540p · 1s$0.0455
720p · 1s$0.0975
1080p · 1s$0.1040

Commercial use

How to Use Vidu Q3

Overview

Vidu Q3 is a next-generation multimodal video generation model developed by Vidu. It supports both text-to-video and image-to-video workflows through a single unified interface, allowing users to generate cinematic video clips either from text alone or by animating a reference image.

The model is designed for expressive storytelling, with support for longer short-form clips, cinematic camera motion, multi-shot sequencing, and optional synchronized audio. Vidu Q3 is well suited to narrative content, concept videos, social media clips, and creative prototyping where visual continuity and motion quality are important.

How it Works

Vidu Q3 uses a multimodal generative pipeline that combines language understanding, optional image conditioning, and temporal modelling to produce cohesive video sequences with stable motion and consistent subjects.

Prompt Interpretation

The model analyses text prompts to identify subjects, actions, environments, mood, camera behaviour, and audio cues. These signals guide both visual composition and timing throughout the generated clip.

Image-Guided Video

When a reference image is provided, it is used as the starting frame for the video. This enables image-to-video generation using the same endpoint, allowing creators to animate a still image while preserving composition, characters, or visual style.

Video & Audio Generation

Vidu Q3 generates temporally consistent video frames with smooth motion, intelligent camera movement, and natural transitions between shots. When audio is enabled, sound effects, ambience, or simple dialogue are generated alongside the visuals to align with pacing and on-screen action.

Key Features

  • Text-to-Video and Image-to-Video
    Generate videos from text prompts alone or animate a reference image using the same workflow.

  • Single Unified Endpoint
    One endpoint supports both modes. Providing an image automatically enables image-guided video generation.

  • Extended Short-Form Duration
    Supports longer short-form clips compared to many video generation models.

  • Cinematic Camera Motion
    Built-in support for pans, tracking shots, zooms, and dynamic framing.

  • Multi-Shot Sequencing
    Handles automatic shot changes and transitions within a single generation.

  • Native Audio Output
    Optional synchronized audio generation alongside video.

  • Consistent Subjects & Motion
    Maintains character appearance and motion coherence across frames and shots.

Technical Specifications

  • Model Name: Vidu Q3
  • Model Type: Multimodal video generation
  • Inputs: Text prompt, optional reference image
  • Outputs: MP4 video
  • Duration: Extended short-form clips (configurable)
  • Resolutions: HD / Full HD
  • Aspect Ratios: Common video aspect ratios supported
  • Audio: Optional native audio generation

How to Use

  1. Write a prompt describing the scene, actions, camera movement, and audio cues.
  2. (Optional) Upload a reference image to enable image-to-video generation.
  3. Choose the desired duration, resolution, and aspect ratio.
  4. Submit the request using Vidu Q3.
  5. Retrieve the generated video once processing completes.

Example prompt:
A bustling medieval market at golden hour, opening with a wide establishing shot, cutting to close-ups of merchants and crowds, with ambient chatter, footsteps, and soft music.

Tips for Better Results

  • Describe camera movement explicitly to guide cinematic motion.
  • Use reference images to lock character appearance or scene composition.
  • Include atmosphere and audio cues to improve mood and pacing.
  • Start with shorter clips when iterating, then increase duration once satisfied.

Notes & Limitations

  • Vidu Q3 is optimised for short-form narrative video generation.
  • Longer or highly complex stories may benefit from multiple generations.
  • Output quality depends heavily on prompt clarity and reference image quality.
  • Audio fidelity varies depending on scene complexity.

Documentation

More models from Vidu

Vidu Q3 Turbo is a speed-optimized multimodal video generation model that produces short video clips with synchronized audio directly from text or images. It prioritizes fast inference and responsive iteration while preserving stable motion, coherent composition, and reliable audio alignment, making it suitable for rapid prototyping and production workflows where latency is critical.

Vidu Q2 Turbo is the fast tier of the Q2 video model. It targets rapid iteration for creative pipelines. It keeps the cinematic look of Vidu Q2 Pro. It adds shorter latency, stronger large motion control, and smoother camera movement for prompt driven video shots.

Vidu Q2 Pro is a high fidelity video generation model for cinematic storytelling. It supports text prompts, image inputs, and multi reference control for long form scenes. It targets developers who need controllable motion, stable characters, and smooth camera work for complex shots.

Vidu Q1 Classic generates 1080p clips up to 16 seconds from text prompts, source images, or reference shots. It targets controllable motion and stable scenes for fast prototyping. Ideal for teams that need cinematic tests without complex video pipelines.

Vidu Q1 is a generative video model that preserves visual fidelity from multiple reference images. It supports character, scene and prop control with smooth transitions and 1080p clips. Ideal for ads, story sequences and animation workflows that need tight visual continuity.

Vidu Q1 (image) is a reference-to-image model designed for high visual fidelity. It blends multiple input images with consistent identity and style. Prompts can guide composition and layout without losing coherence. The model supports flexible aspect ratios for ads, social content, storyboards or animation assets. It produces clean visuals with minimal effort and is useful for rapid creative workflows.

Vidu 2.0 is a generative video model for rapid 1080p clip creation. It targets 4 second and 8 second shots with strong subject consistency and support for batch workflows. Developers can drive cinematic clips from text prompts and templates with improved speed and lower cost.

Vidu 1.5 is a multimodal text to video model that focuses on multi entity consistency across complex scenes. It keeps multiple characters and objects visually stable across frames and shots. Developers can build long form video workflows that need coherent motion and style control.