Google
Google

Veo 3.1

Veo 3.1 cinematic AI video with native audio

Text to VideoImage to VideoAudio to Video
Example 1
Example 2
Example 3
Example 4
Example 5
Example 6
Example 7
Example 8
Example 9

Veo 3.1 Overview

Veo 3.1 is a cinematic video generation model for developers. It turns text prompts or reference images into high fidelity scenes with richer native audio, better prompt adherence, and granular shot control. Use it for story driven clips with smoother motion and consistent style.

From $0.8000/ video
720p · 4s (without audio)$0.80
720p · 4s (with audio)$1.60
720p · 8s (without audio)$1.6
720p · 8s (with audio)$3.2
4K · 8s (without audio)$3.20
4K · 8s (with audio)$4.80

Commercial use

How to Use Veo 3.1

Overview

Google Veo 3.1 is an advanced AI video generation model that turns natural language descriptions and optional image references into cinematic, story-driven video clips with rich, native audio. It’s built for creators, developers, and storytellers who need high-quality video output without manual animation or rendering.

Veo 3.1 enhances realism, motion coherence, and audiovisual coordination compared to earlier versions, enabling content that feels more immersive and expressive. It supports a range of creative workflows and is designed for rapid prototyping, concept visualisation, and creative storytelling.

How it Works

Veo 3.1 combines several generative techniques to produce cohesive video output from text and images:

Prompt Interpretation

The model parses your natural language prompt to understand subjects, actions, environments, camera movements, and audio cues.

Video Synthesis

A specialised temporal generation pipeline produces sequences of frames that maintain continuity and fluid motion. This ensures smooth transitions and consistent visual composition.

Audio Generation

Native audio tracks — including ambience, music, and sound effects — are generated to align with the visual content, enhancing feel, pacing, and immersion.

Key Features

  • Text-to-Video and Image-to-Video
    Create videos directly from descriptive prompts or use reference images to guide the visual style and composition.
  • Reference Image Support
    Use up to three asset images or a single style image to influence video content, with specific aspect ratio constraints.
  • Frame Anchoring
    Provide first and last frame images to guide motion and narrative direction.
  • Audio Synchronisation
    Generate audio that matches the rhythm and mood of the visuals without separate audio tools.
  • Consistent Motion
    Designed to handle smooth motion and transitions across all frames within the clip.

Technical Specifications

  • Model ID: google:3@2
  • Workflows Supported: Text-to-video, Image-to-video
  • Supported Resolutions: 1280×720, 1920×1080 (standard and vertical where applicable)
  • Frame Rate: 24 FPS
  • Default Duration: 8 seconds
  • Prompt Length: Typically up to 3000 characters
  • Reference Image Constraints: Aspect ratios matching supported video output, up to three asset or one style image, no mixing with frame image guidance
  • Enhanced Prompting: Always enabled to enrich user prompts for quality results

How to Use

  1. Write a clear prompt describing the scene, motion, camera style, and any audio cues.
  2. Choose your input style: text only, reference images, or start and end frame guidance.
  3. Send the request to the API or platform where Veo 3.1 is hosted.
  4. Retrieve the generated video output.

Example prompt:
A lively urban plaza at sunset, slow tracking camera circling around dancers, ambient street sounds with distant music.

Tips for Better Results

  • Start with the main subject, then layer in environment and motion.
  • Add mood, lighting, and audio cues towards the end of the prompt.
  • Use reference or frame images for tighter control over composition and motion.

Notes & Limitations

  • Veo 3.1 is optimised for short, high-quality video clips.
  • Longer narratives may require multiple generations or stitched outputs.
  • Image and aspect ratio constraints apply when using reference inputs.

Documentation

You can find full usage details, parameters, and examples here: https://runware.ai/docs/en/providers/google#veo-31

More models from Google

Gemini 3.1 Flash TTS is a text-to-speech model for expressive spoken audio generation from text. It supports granular control over delivery through audio tags, native multi-speaker dialogue, and speech generation across 70+ languages, making it suitable for narration, conversational voice apps, podcasts, audiobooks, and other production-oriented voice workflows.

Veo 3.1 Lite

Api Only

Veo 3.1 Lite is the most cost-effective model in the Veo 3.1 family, designed for high-volume applications requiring rapid iteration. It supports text-to-video and image-to-video generation at 720p or 1080p in landscape and portrait formats, with customizable duration of 4, 6, or 8 seconds. It maintains the same generation speed as Veo 3.1 Fast at less than 50% of the cost, and includes native synchronized audio generation.

Gemini 3.1 Flash Lite is Google’s flagship multimodal language model that processes text alongside images, audio, video, code, and documents. It offers high-performance reasoning, complex instruction following, and deep contextual understanding for a wide range of tasks across language, analysis, and problem solving

Nano Banana 2 (officially known as Gemini 3.1 Flash Image) is Google’s upgraded AI image generation and editing model that brings advanced visual creation capabilities to a broad audience. It generates detailed, expressive images from text and image prompts with sharp details, richer lighting, and improved adherence to complex instructions. Nano Banana 2 also supports multi-object and multi-character consistency, accurate text rendering within images, and flexible resolution control up to 4K. It is now integrated across Google’s AI platforms including the Gemini app, Search AI Mode, and other Gemini-powered services.

Gemini 3.1 Pro is Google’s flagship multimodal language model that processes text alongside images, audio, video, code, and documents. It offers high-performance reasoning, complex instruction following, and deep contextual understanding for a wide range of tasks across language, analysis, and problem solving.

Gemini 3 Flash is Google’s flagship multimodal language model that processes text alongside images, audio, video, code, and documents. It offers high-performance reasoning, complex instruction following, and deep contextual understanding for a wide range of tasks across language, analysis, and problem solving.

Nano Banana Pro (also known as Nano Banana 2) is a Gemini 3 Pro Image Preview model for controlled visual creation. It improves reasoning over lighting and camera angle. It supports high resolution output and multi image blending for production ready design workflows and creative tools.

Veo 3.1 Fast is a high speed variant of Veo 3.1 for rapid creative iteration. It supports text prompts, image prompts, and reference images. It targets low latency workflows while keeping cinematic quality for short form and multi shot video generation with native audio.

Gemini Flash Image 2.5, commonly known as Nano Banana, generates and edits images from rich prompts and multi image inputs. It maintains character identity across frames. It supports targeted edits and completions that use strong world knowledge. Ideal for visual apps that need speed and control.

Veo 3 Fast is an optimized video generation model for rapid iteration and lower cost. It creates short clips from text or images with native audio that includes dialogue, sound effects and music. It keeps realistic motion, strong physics and reliable prompt control.

Imagen 4 Ultra is Google's highest quality text to image model. It focuses on photorealism, sharp details, and accurate text rendering. It targets production workloads that need strict prompt adherence, optional higher resolution output, and fast generation through the Gemini API.

Imagen 4 Fast is a latency optimized text to image model in the Imagen 4 family. It targets interactive apps and high volume pipelines. It keeps strong Imagen 4 visual quality while cutting generation time, so teams can iterate faster and reduce serving costs in production.

Imagen 4 Preview is Google's next generation text to image model for developers. It supports 2K resolution with improved detail rendering and robust typography control. Use it to generate photorealistic or stylized assets for product shots, slides, marketing visuals, and prototypes.

Veo 3 is a state of the art generative video model with native audio. It supports text prompts and image prompts, produces short HD clips with dialogue, sound effects and music, and delivers realistic motion with strong prompt adherence for cinematic video generation.

Imagen 3 is Google’s high quality text to image model. It produces detailed, photorealistic images with improved lighting and fewer artifacts. It offers strong prompt adherence, better text rendering, and supports editing workflows through the Gemini API and Vertex AI.

Veo 2 is a text to video model that produces high resolution clips with strong control over camera movement, composition, and scene dynamics. It supports cinematic framing, object aware motion, extended durations, and up to 4K outputs for production grade workflows.

Imagen 3 Fast is a streamlined text to image model that targets low latency use cases. It delivers bright images with strong contrast and improved prompt adherence. Ideal for apps that need fast image generation inside Vertex AI and Firebase with stable, predictable performance.