Vidu Q3

Multimodal video generation with native audio and intelligent shot planning

Vidu Q3 is a multimodal video generation model that creates video with synchronized audio directly from text or images, supports intelligent multi-shot sequencing, and produces complete outputs with stable visuals and embedded subtitles without post-processing.

Vidu
Commercial use
Text to VideoImage to VideoAudio to Video
From $0.0455 per sec

Average savings vs typical market rates

360p · 1sSave ~35%$0.0455
540p · 1sSave ~35%$0.0455
720p · 1sSave ~35%$0.0975
1080p · 1sSave ~35%$0.1040

README

Overview

Vidu Q3 is a next-generation multimodal video generation model developed by Vidu. It supports both text-to-video and image-to-video workflows through a single unified interface, allowing users to generate cinematic video clips either from text alone or by animating a reference image.

The model is designed for expressive storytelling, with support for longer short-form clips, cinematic camera motion, multi-shot sequencing, and optional synchronized audio. Vidu Q3 is well suited to narrative content, concept videos, social media clips, and creative prototyping where visual continuity and motion quality are important.

How it Works

Vidu Q3 uses a multimodal generative pipeline that combines language understanding, optional image conditioning, and temporal modelling to produce cohesive video sequences with stable motion and consistent subjects.

Prompt Interpretation

The model analyses text prompts to identify subjects, actions, environments, mood, camera behaviour, and audio cues. These signals guide both visual composition and timing throughout the generated clip.

Image-Guided Video

When a reference image is provided, it is used as the starting frame for the video. This enables image-to-video generation using the same endpoint, allowing creators to animate a still image while preserving composition, characters, or visual style.

Video & Audio Generation

Vidu Q3 generates temporally consistent video frames with smooth motion, intelligent camera movement, and natural transitions between shots. When audio is enabled, sound effects, ambience, or simple dialogue are generated alongside the visuals to align with pacing and on-screen action.

Key Features

  • Text-to-Video and Image-to-Video
    Generate videos from text prompts alone or animate a reference image using the same workflow.

  • Single Unified Endpoint
    One endpoint supports both modes. Providing an image automatically enables image-guided video generation.

  • Extended Short-Form Duration
    Supports longer short-form clips compared to many video generation models.

  • Cinematic Camera Motion
    Built-in support for pans, tracking shots, zooms, and dynamic framing.

  • Multi-Shot Sequencing
    Handles automatic shot changes and transitions within a single generation.

  • Native Audio Output
    Optional synchronized audio generation alongside video.

  • Consistent Subjects & Motion
    Maintains character appearance and motion coherence across frames and shots.

Technical Specifications

  • Model Name: Vidu Q3
  • Model Type: Multimodal video generation
  • Inputs: Text prompt, optional reference image
  • Outputs: MP4 video
  • Duration: Extended short-form clips (configurable)
  • Resolutions: HD / Full HD
  • Aspect Ratios: Common video aspect ratios supported
  • Audio: Optional native audio generation

How to Use

  1. Write a prompt describing the scene, actions, camera movement, and audio cues.
  2. (Optional) Upload a reference image to enable image-to-video generation.
  3. Choose the desired duration, resolution, and aspect ratio.
  4. Submit the request using Vidu Q3.
  5. Retrieve the generated video once processing completes.

Example prompt:
A bustling medieval market at golden hour, opening with a wide establishing shot, cutting to close-ups of merchants and crowds, with ambient chatter, footsteps, and soft music.

Tips for Better Results

  • Describe camera movement explicitly to guide cinematic motion.
  • Use reference images to lock character appearance or scene composition.
  • Include atmosphere and audio cues to improve mood and pacing.
  • Start with shorter clips when iterating, then increase duration once satisfied.

Notes & Limitations

  • Vidu Q3 is optimised for short-form narrative video generation.
  • Longer or highly complex stories may benefit from multiple generations.
  • Output quality depends heavily on prompt clarity and reference image quality.
  • Audio fidelity varies depending on scene complexity.

Documentation