
Alibaba
Enterprise grade multimodal AI for advanced visual and video creation
Alibaba develops large scale multimodal AI systems that combine visual understanding, image generation, and high quality video synthesis inside a single research and production ecosystem. Their teams release powerful image models, visual editing tools, and fast evolving video generation technology supported by extensive infrastructure and long term investment in foundational AI. Through Runware, Alibaba becomes a flexible provider for image creation, image to image workflows, and next generation video generation, allowing teams to integrate enterprise level visual AI without changing their pipelines as new models and capabilities arrive.
Models by Alibaba

Wan2.6 Flash is a distilled, low-latency variant of the Wan2.6 multimodal video model designed for rapid image to video generation with fluid motion, visual stability, and optional synchronized audio. It produces HD clips from detailed static images while preserving subject structure and motion realism, making it suitable for preview workflows and high-throughput creative pipelines.

Api Only
Qwen-Image-2512 is an improved version of the Qwen-Image image foundation model with enhanced prompt understanding, superior text rendering accuracy, and more realistic visual details. It generates high-fidelity images from text prompts across diverse subjects and styles.

Api Only
Qwen-Image-Layered decomposes a static image into multiple RGBA layers, enabling independent editing of semantically distinct components without interfering with other parts of the image. This layered representation supports high-fidelity image editing tasks like resizing, repositioning, recoloring, and object manipulation with consistent detail and transparency handling.

Wan2.6 Image is a single-frame image generation model derived from the Wan2.6 multimodal video architecture. It focuses on strong prompt adherence, clean spatial structure, and visually coherent results, delivering video-grade image quality for creative, editorial, and product-oriented workflows.

Wan2.6 is a multimodal video model for text to video and image to video generation with support for multi-shot sequencing and native sound. It emphasizes temporal stability, consistent visual structure across shots, and reliable alignment between visuals and audio in short form video generation.

Z-Image-Turbo is a distilled vision model for sub second image generation. It produces sharp photorealistic results and supports accurate Chinese text and English text inside images. It follows complex layout instructions with stable structure for UI, posters, and scenes.

Api Only
Qwen-Image-Edit-2511 is an image editing model that applies text instructions to modify existing images with precise semantic and appearance control. It preserves visual consistency during edits, supports multi-person and character consistency, and integrates selected features and extensions that enhance object manipulation, geometric reasoning, and layout coherence.

Wan2.5-Preview is Alibaba’s multimodal video model in research preview. It supports text to video and image to video with native audio generation for clips around 10 seconds. It offers strong prompt adherence, smooth motion, and multilingual audio for narrative scenes.

Wan2.5-Preview Image is a single frame generator built from the Wan2.5 video stack. It focuses on detailed depth structure, strong prompt following, multilingual text rendering, and video grade visual quality for production ready stills in creative or product workflows.

Qwen-Image-Edit-Plus is a 20B image editing model that supports multi image workflows and strong identity preservation. It improves consistency on single image edits and adds native ControlNet style conditioning for precise structure control, layout edits, and bilingual text manipulation.

Wan2.2 Animate Turbo is an accelerated variant of Wan2.2 Animate designed for faster character animation and replacement in video. It generates coherent character motion and expression from images or existing footage while prioritizing reduced inference time for rapid iteration workflows.

Api Only
Wan2.2 Animate is a unified video model that produces character-focused animations from static images and reference videos or replaces characters in existing footage while preserving motion, expressions, and scene consistency. It uses the Wan2.2 mixture-of-experts architecture to generate coherent character movement and seamless integration with background video.

Qwen‑Image‑Edit is an instruction based image editing model built on the 20B Qwen‑Image foundation. It performs semantic edits and local appearance changes while preserving layout and text fidelity. Ideal for programmatic asset cleanup, style tweaks, and precise bilingual text updates.

Qwen‑Image-Lightning 8 Steps V1.1 is a distilled text to image LoRA for Qwen‑Image. It targets 8 step inference for near real time rendering. It improves quality consistency over V1.0 and preserves complex text layout. Ideal for high throughput image services and interactive UIs.

Qwen‑Image-Lightning 4 steps is a distilled LoRA for Qwen‑Image that targets minimal sampling steps with strong visual fidelity. It delivers up to 25× faster image generation. Ideal for real time applications and batch pipelines that need low latency inference.

Qwen‑Image-Lightning 8 steps V1.0 is a distilled LoRA for Qwen‑Image. It targets faster inference with strong text rendering and visual fidelity. Use it to generate high resolution images from prompts with fewer sampling steps and lower GPU cost.

Qwen-Image is a 20B parameter vision language model from Alibaba Cloud. It focuses on precise text conditioned image generation and supports complex Chinese or English typography. It also enables accurate image editing workflows that need layout control and strong prompt following.

Wan2.2 A14B is a Mixture of Experts video model with two 14B experts for layout and detail. It supports text prompts or reference images to generate cinematic 480p or 720p clips with stable inference cost and consistent motion. Ideal for pipelines on high end GPUs.

Wan2.2 5B is a compact hybrid text and image to video model that targets 720p 24fps output with strong motion coherence. It supports text only prompts or image guided generation. It is optimized for fast inference on consumer GPUs and fits production video workflows.

Qwen2.5-VL-3B-Instruct is a multimodal model that processes images and text together to perform visual reasoning, captioning, question answering, and structured output tasks. It integrates a vision encoder with an instruction-tuned language backbone to support complex visual understanding and interactive multimodal responses.

Qwen2.5-VL-7B-Instruct is a multimodal model that processes images and text together to perform visual reasoning, captioning, question answering, and structured output generation. It integrates a vision encoder with a 7B instruction-tuned language backbone to support rich interactive multimodal understanding.

Wan2.2 A14B Turbo accelerates Wan2.2 with fused Lightning LoRA for ultra fast diffusion. It cuts inference to 8 steps while preserving cinematic structure and detail. Ideal for rapid 480p to 720p video prototyping and iteration in production workflows.

Qwen2.5-VL-7B Age Detector is a multimodal model that analyzes a facial image to estimate age. It uses the vision encoder from Qwen2.5-VL-7B and its instruction-tuned language backbone to interpret visual features and output age predictions or age categories.
