SOTA Models
These models represent the current state of the art in generative AI, covering a wide range of modalities including image, video, audio, and multimodal generation. Each model in this collection is selected because it demonstrates top-tier performance in its category, whether that’s visual fidelity, motion coherence, audio realism, prompt adherence, or overall output quality. Some are highly specialised, excelling at a specific task or modality, while others are more general-purpose, capable of handling complex, multi-modal workflows. Together, they reflect the latest advances in generative model research and deployment, offering access to cutting-edge capabilities that are actively shaping how modern AI-powered products are built, tested, and scaled.
Featured Models
Top-performing models in this category, recommended by our community and performance benchmarks.

Seedream 4.5
by ByteDance
Seedream 4.5 is a ByteDance image model for precise 2K to 4K generation and editing. It improves multi image composition, preserves reference detail, and renders small text more reliably. It supports up to 14 reference images for stable characters and design heavy layouts.

Kling IMAGE O1
Kling IMAGE O1 is a high control image generation model for stable characters and precise edits. It supports detailed composition control, strong style handling, and localized modifications without structural drift. Ideal for pipelines that need repeatable shots and complex visual continuity.

PixVerse v5.5
by PixVerse
PixVerse v5.5 is a director focused video model for story driven clips. It supports multi image fusion for character continuity, multi shot sequences, and native audio. It delivers smooth motion, refined cinematic control, and precise text guided video generation for complex scenes.

Kling VIDEO O1
Kling VIDEO O1 is a unified multimodal video foundation model for controllable generation and instruction based editing. It supports text prompts, visual references, and video input so developers can build high control pipelines for pacing, transitions, object changes, and style revisions.
![FLUX.2 [pro]](/_next/image?url=https%3A%2F%2Fassets.runware.ai%2Fbf7b4f37-e001-4399-8461-ecb80eedc819.jpg&w=3840&q=75)
FLUX.2 [pro]
by Black Forest Labs
FLUX.2 [pro] is a flow-matching latent transformer for precise text-to-image synthesis and reference-guided editing. It supports multi image references, 4MP outputs, and Mistral-based text conditioning for controllable composition and robust iterative edits that preserve structure.
![FLUX.2 [flex]](/_next/image?url=https%3A%2F%2Fassets.runware.ai%2F49c90436-ecb4-446f-99e9-f5832c4c9167.jpg&w=3840&q=75)
FLUX.2 [flex]
by Black Forest Labs
FLUX.2 [flex] is a configurable text to image and image editing model built for precise text placement and stable layouts. It exposes sampling and guidance controls and supports up to ten reference images for consistent characters or products across complex compositions.
![FLUX.2 [dev]](/_next/image?url=https%3A%2F%2Fassets.runware.ai%2F8e50e379-ac58-4bc8-bda4-5e344386347b.jpg&w=3840&q=75)
FLUX.2 [dev]
by Black Forest Labs
FLUX.2 dev is an open weight text to image and image editing model from Black Forest Labs. It targets developers who need precise control over prompts, references, and iteration. Use it for non commercial research, workflow prototyping, and multi conditioning image pipelines.

Nano Banana Pro
by Google
Nano Banana Pro (also known as Nano Banana 2) is a Gemini 3 Pro Image Preview model for controlled visual creation. It improves reasoning over lighting and camera angle. It supports high resolution output and multi image blending for production ready design workflows and creative tools.

ImagineArt 1.5
by ImagineArt
ImagineArt 1.5 is a hyper realistic image model for production visuals. It improves texture fidelity, light handling, and emotion capture. It supports detailed prompts, clean in image text, and multimodal workflows that mix prompts with reference images for consistent style and layout.

P-Image-Edit
P-Image-Edit is a real-time image editing model from Pruna AI. It supports multi image refinement, layout control, and style safe transformations while following prompts with high accuracy. Ideal for production pipelines that need consistent edits and tight latency budgets.

MiniMax Hailuo 2.3
by MiniMax
MiniMax Hailuo 2.3 is a cinematic video model for short form production. It accepts text prompts or image inputs and outputs 6 or 10 second clips at 768p or 1080p. It focuses on consistent motion, strong physics, and stable scenes for ads, social content, and creative shots.

Google Veo 3.1 Fast
by Google
Google Veo 3.1 Fast is a high speed variant of Veo 3.1 for rapid creative iteration. It supports text prompts, image prompts, and reference images. It targets low latency workflows while keeping cinematic quality for short form and multi shot video generation with native audio.

Google Veo 3.1
by Google
Google Veo 3.1 is a cinematic video generation model for developers. It turns text prompts or reference images into high fidelity scenes with richer native audio, better prompt adherence, and granular shot control. Use it for story driven clips with smoother motion and consistent style.

LTX-2 Pro
by Lightricks
LTX-2 Pro is a cinematic video model by Lightricks. It supports text prompts and image inputs. It outputs high resolution clips with realistic motion and precise lighting. It targets professional workflows that need stable pacing, detailed subjects, and synchronized audio.

Sora 2
by OpenAI
Sora 2 is OpenAI’s flagship generative model for video and audio. It accepts text prompts and generates visually rich clips with synchronized dialogue and sound. It improves physical realism and scene control. It also supports editing and extension of existing video inputs.

Sora 2 Pro
by OpenAI
Sora 2 Pro is the higher quality Sora 2 variant for precision video work. It supports text prompts and image inputs. It outputs synchronized video with sound, higher resolution frames, and stronger temporal consistency. Ideal for production clips and demanding pipelines.

HunyuanImage-3.0
HunyuanImage-3.0 is an 80B parameter MoE model for high fidelity text to image generation. It uses an autoregressive multimodal framework for strong world knowledge reasoning and sharp text rendering. It targets complex long prompts and precise layout control for production workloads.

Wan2.5-Preview Image
by Alibaba
Wan2.5-Preview Image is a single frame generator built from the Wan2.5 video stack. It focuses on detailed depth structure, strong prompt following, multilingual text rendering, and video grade visual quality for production ready stills in creative or product workflows.

KlingAI 2.5 Turbo Pro
KlingAI 2.5 Turbo Pro is a high performance video generation model for cinematic work. It converts prompts or stills into smooth 1080p clips with strong motion, precise camera control and tight prompt adherence. Ideal for creative tools, ads, trailers and sports scenes.

Qwen-Image-Edit-Plus
by Alibaba
Qwen-Image-Edit-Plus is a 20B image editing model that supports multi image workflows and strong identity preservation. It improves consistency on single image edits and adds native ControlNet style conditioning for precise structure control, layout edits, and bilingual text manipulation.

Seedream 4.0
by ByteDance
Seedream 4.0 is ByteDance’s multimodal image model for fast 2K to 4K generation. It supports text prompts, image editing with natural language, and multi image reference. It maintains style consistency across batches and handles bilingual Chinese and English workflows.

Qwen‑Image‑Edit
by Alibaba
Qwen‑Image‑Edit is an instruction based image editing model built on the 20B Qwen‑Image foundation. It performs semantic edits and local appearance changes while preserving layout and text fidelity. Ideal for programmatic asset cleanup, style tweaks, and precise bilingual text updates.

Qwen-Image
by Alibaba
Qwen-Image is a 20B parameter vision language model from Alibaba Cloud. It focuses on precise text conditioned image generation and supports complex Chinese or English typography. It also enables accurate image editing workflows that need layout control and strong prompt following.

Wan2.2 A14B
by Alibaba
Wan2.2 A14B is a Mixture of Experts video model with two 14B experts for layout and detail. It supports text prompts or reference images to generate cinematic 480p or 720p clips with stable inference cost and consistent motion. Ideal for pipelines on high end GPUs.

Imagen 4 Ultra
by Google
Imagen 4 Ultra is Google's highest quality text to image model. It focuses on photorealism, sharp details, and accurate text rendering. It targets production workloads that need strict prompt adherence, optional higher resolution output, and fast generation through the Gemini API.

MiniMax 02 Hailuo
by MiniMax
MiniMax 02 Hailuo is a 1080p AI video model for cinematic, high motion scenes. It converts text prompts or still images into short, polished clips with strong instruction following and realistic physics. Ideal for commercial spots, trailers, music promos, and social shorts.

Midjourney Video
by Midjourney
Midjourney Video extends Midjourney visuals into motion. It animates still or generated images into short stylized clips with configurable motion. Ideal for concept artists, storytellers, and designers who need fast cinematic video from existing frames.

Eleven v3
by ElevenLabs
Eleven v3 is a premium text to speech model for production audio. It supports 70+ languages with studio grade quality and precise expressive control using inline audio tags. Ideal for narration, podcasts, dialogue, audiobooks, and game voiceover where stable prosody matters.

SeedEdit 3.0
by ByteDance
SeedEdit 3.0 is ByteDance's high resolution image editing model for precise, prompt driven control. It preserves subjects and backgrounds while editing local regions. It supports 4K output, fast inference, and handles portrait edits, background changes, perspective shifts, and lighting tweaks.

Seedance 1.0 Pro
by ByteDance
Seedance 1.0 Pro is a ByteDance video model for 5 to 10 second clips at up to 1080p. It supports text prompts and image first frames. It delivers smooth motion with strong temporal consistency. Ideal for multi shot storytelling, ads, and design previews in real time pipelines.

Imagen 4 Preview
by Google
Imagen 4 Preview is Google's next generation text to image model for developers. It supports 2K resolution with improved detail rendering and robust typography control. Use it to generate photorealistic or stylized assets for product shots, slides, marketing visuals, and prototypes.

Google Veo 3
by Google
Google Veo 3 is a state of the art generative video model with native audio. It supports text prompts and image prompts, produces short HD clips with dialogue, effects and music, and is available through the Gemini API and Vertex AI for production workflows.

Runway Gen-4 Image
by Runway
Runway Gen-4 Image is a text-to-image model for production work. It offers strong prompt adherence, fine stylistic control, and visual consistency across scenes and characters. Ideal for pipelines that link still images into video while preserving look and layout.

Ideogram 3.0 Edit
by Ideogram
Ideogram 3.0 Edit lets you inpaint images with surgical control. Upload an image, mask a region, then refine layout or text while the rest stays intact. Ideal for typography fixes, layout tweaks, brand updates, and production safe visual polish in existing assets.

Runway Gen-4 Turbo
by Runway
Runway Gen-4 Turbo is a high speed variant of Gen-4 for rapid video ideation. It turns reference images into short cinematic clips with strong character consistency, smooth motion, and reduced credit cost. Ideal for fast iteration in production and previsualization pipelines.

Midjourney V7
by Midjourney
Midjourney V7 is a next generation text to image model that targets high realism and precise control. It improves prompt coherence, anatomy, lighting, and cinematic framing. Draft Mode supports rapid low cost exploration then refinement into detailed final renders.

Ideogram 3.0 Reframe
by Ideogram
Ideogram 3.0 Reframe performs style consistent outpainting that extends images beyond their borders. It adapts visuals to new aspect ratios without breaking composition or look. Ideal for repurposing creative, social posts, and design assets for varied formats.

Luma Ray2 Flash
Luma Ray2 Flash is a distilled Ray2 variant tuned for rapid video creation. It accepts text prompts or reference images and generates short, realistic clips with smooth motion. Ideal for developers who need lower latency video generation while keeping strong visual fidelity.

OmniHuman-1
by ByteDance
OmniHuman-1 is a ByteDance research model for human video generation from a single image and motion signals like audio. It focuses on accurate lip sync, expressive motion, and strong generalization across portraits, full body shots, cartoons, and stylized avatars.

MiniMax 01 Director
by MiniMax
MiniMax 01 Director generates short cinematic video clips from text prompts with director level control. It supports detailed camera movement instructions, stable framing, and reduced motion randomness. Ideal for film previz, ads, and story beats inside production tools.

Luma Ray2
Luma Ray2 is a flagship video generation model for cinematic shots from text prompts. It renders coherent scenes with realistic motion and strong spatial awareness. Use it to build visual storytelling tools that output high quality clips for creative and professional workflows.

Vidu 2.0
by Vidu
Vidu 2.0 is a generative video model for rapid 1080p clip creation. It targets 4 second and 8 second shots with strong subject consistency and support for batch workflows. Developers can drive cinematic clips from text prompts and templates with improved speed and lower cost.

KlingAI 1.5 Pro
KlingAI 1.5 Pro is a text to video and image to video model for 1080p clips. It adds precise motion dynamics, camera movement control, and better color accuracy. Use it for prompts or image conditioning when you need sharper motion, stable characters, and cinematic framing.
![FLUX.1 Expand [pro]](/_next/image?url=https%3A%2F%2Fassets.runware.ai%2F4a806191-bdc8-49ca-b192-907e11f19a73.jpg&w=3840&q=75)
FLUX.1 Expand [pro]
by Black Forest Labs
FLUX.1 Expand [pro] is an outpainting model that extends images beyond their original frame while preserving structure, lighting, and style. It supports controlled expansion from real or generated inputs and integrates into image editing or generative workflows that need precise, coherent borders.
![FLUX.1 Canny [pro]](/_next/image?url=https%3A%2F%2Fassets.runware.ai%2F6eb980fd-7e94-48ea-8432-4809aeb6f58e.jpg&w=3840&q=75)
FLUX.1 Canny [pro]
by Black Forest Labs
FLUX.1 Canny Pro uses Canny edge maps as structural guidance. It lets you regenerate or transform images while preserving layout and contours. Ideal for style transfer, redesigns, and controlled edits where you must keep shapes consistent across outputs.
![FLUX.1.1 [pro] Ultra](/_next/image?url=https%3A%2F%2Fassets.runware.ai%2Ffcb86e1b-ce32-4291-adf6-06cddfc22737.jpg&w=3840&q=75)
FLUX.1.1 [pro] Ultra
by Black Forest Labs
FLUX.1.1 [pro] Ultra is a high resolution text to image model from Black Forest Labs. It generates images up to 4 megapixels in about 10 seconds. Ultra mode targets sharp outputs. Raw mode targets natural photographic style. Built for API integration in real products.
![FLUX.1 Fill [pro]](/_next/image?url=https%3A%2F%2Fassets.runware.ai%2F89983546-00fb-480c-bd78-7b8d534fc816.jpg&w=3840&q=75)
FLUX.1 Fill [pro]
by Black Forest Labs
FLUX.1 Fill Pro provides advanced inpainting and outpainting for real and generated images. Supply an input image, mask, and text prompt. The model fills or extends regions with seamless content that matches context and style. Ideal for edits, layout fixes, and content-aware expansion.

Ideogram 2.0 Remix
by Ideogram
Ideogram 2.0 Remix lets you rework existing images while preserving structure and layout. Change styles or mood, adjust composition, and iterate quickly from a reference image. Ideal for designers who need fast visual variants and style exploration from prior outputs.
![FLUX.1.1 [pro]](/_next/image?url=https%3A%2F%2Fassets.runware.ai%2F1527871b-df27-4c31-946d-48eb996f6e6e.jpg&w=3840&q=75)
FLUX.1.1 [pro]
by Black Forest Labs
FLUX.1.1 Pro is a flagship text to image model from Black Forest Labs. It improves on FLUX.1 with sharper detail, stronger prompt adherence, and faster sampling. Ideal for production image pipelines, product visuals, and creative tools that require consistent high quality output.
![FLUX.1 [schnell]](/_next/image?url=https%3A%2F%2Fassets.runware.ai%2F4a580713-4f44-4a19-b85b-89027c050daf.jpg&w=3840&q=75)
FLUX.1 [schnell]
by Black Forest Labs
FLUX.1 [schnell] is an open source text to image model from Black Forest Labs. It uses 4 step distillation for very fast generation with strong visual quality. Ideal for local deployment, rapid prototyping, batch image production, and integration into custom creative pipelines.
![FLUX.1 [pro]](/_next/image?url=https%3A%2F%2Fassets.runware.ai%2Fde2e6adf-7b95-4979-98f8-572de3625f2e.jpg&w=3840&q=75)
FLUX.1 [pro]
by Black Forest Labs
FLUX.1 Pro is the flagship text to image model from Black Forest Labs. It targets production workflows that need strong prompt adherence, high visual quality, and diverse styles. Use it through the BFL API to generate robust images for design tools, apps, and creative pipelines.

Midjourney V6
by Midjourney
Midjourney V6 is a flagship text to image model for high fidelity visual generation. It improves prompt following, coherence, text rendering, and upscaling. Ideal for designers and developers who need cinematic depth, nuanced lighting, and reliable style control from natural language prompts.

DALL·E 3
by OpenAI
DALL·E 3 converts natural language prompts into detailed images with strong caption fidelity. It improves handling of complex instructions and visual details. It integrates with ChatGPT and the OpenAI API for programmatic image creation and workflow automation.