Google Veo 3.1
Google Veo 3.1 cinematic AI video with native audio

Google Veo 3.1 is a cinematic video generation model for developers. It turns text prompts or reference images into high fidelity scenes with richer native audio, better prompt adherence, and granular shot control. Use it for story driven clips with smoother motion and consistent style.
Examples
















README
Overview
Google Veo 3.1 is an advanced AI video generation model that turns natural language descriptions and optional image references into cinematic, story-driven video clips with rich, native audio. It’s built for creators, developers, and storytellers who need high-quality video output without manual animation or rendering.
Veo 3.1 enhances realism, motion coherence, and audiovisual coordination compared to earlier versions, enabling content that feels more immersive and expressive. It supports a range of creative workflows and is designed for rapid prototyping, concept visualisation, and creative storytelling.
How it Works
Veo 3.1 combines several generative techniques to produce cohesive video output from text and images:
Prompt Interpretation
The model parses your natural language prompt to understand subjects, actions, environments, camera movements, and audio cues.
Video Synthesis
A specialised temporal generation pipeline produces sequences of frames that maintain continuity and fluid motion. This ensures smooth transitions and consistent visual composition.
Audio Generation
Native audio tracks — including ambience, music, and sound effects — are generated to align with the visual content, enhancing feel, pacing, and immersion.
Key Features
- Text-to-Video and Image-to-Video
Create videos directly from descriptive prompts or use reference images to guide the visual style and composition. - Reference Image Support
Use up to three asset images or a single style image to influence video content, with specific aspect ratio constraints. - Frame Anchoring
Provide first and last frame images to guide motion and narrative direction. - Audio Synchronisation
Generate audio that matches the rhythm and mood of the visuals without separate audio tools. - Consistent Motion
Designed to handle smooth motion and transitions across all frames within the clip.
Technical Specifications
- Model ID:
google:3@2 - Workflows Supported: Text-to-video, Image-to-video
- Supported Resolutions: 1280×720, 1920×1080 (standard and vertical where applicable)
- Frame Rate: 24 FPS
- Default Duration: 8 seconds
- Prompt Length: Typically up to 3000 characters
- Reference Image Constraints: Aspect ratios matching supported video output, up to three asset or one style image, no mixing with frame image guidance
- Enhanced Prompting: Always enabled to enrich user prompts for quality results
How to Use
- Write a clear prompt describing the scene, motion, camera style, and any audio cues.
- Choose your input style: text only, reference images, or start and end frame guidance.
- Send the request to the API or platform where Veo 3.1 is hosted.
- Retrieve the generated video output.
Example prompt:
A lively urban plaza at sunset, slow tracking camera circling around dancers, ambient street sounds with distant music.
Tips for Better Results
- Start with the main subject, then layer in environment and motion.
- Add mood, lighting, and audio cues towards the end of the prompt.
- Use reference or frame images for tighter control over composition and motion.
Notes & Limitations
- Veo 3.1 is optimised for short, high-quality video clips.
- Longer narratives may require multiple generations or stitched outputs.
- Image and aspect ratio constraints apply when using reference inputs.
Documentation
You can find full usage details, parameters, and examples here: https://runware.ai/docs/en/providers/google#veo-31