
Alibaba
Enterprise grade multimodal AI for advanced visual and video creation
Alibaba develops large scale multimodal AI systems that combine visual understanding, image generation, and high quality video synthesis inside a single research and production ecosystem. Their teams release powerful image models, visual editing tools, and fast evolving video generation technology supported by extensive infrastructure and long term investment in foundational AI. Through Runware, Alibaba becomes a flexible provider for image creation, image to image workflows, and next generation video generation, allowing teams to integrate enterprise level visual AI without changing their pipelines as new models and capabilities arrive.
Models by Alibaba

Wan2.6
Wan2.6 is a multimodal video model for text to video and image to video generation with support for multi-shot sequencing and native sound. It emphasizes temporal stability, consistent visual structure across shots, and reliable alignment between visuals and audio in short form video generation.

Wan2.5-Preview
Wan2.5-Preview is Alibaba’s multimodal video model in research preview. It supports text to video and image to video with native audio generation for clips around 10 seconds. It offers strong prompt adherence, smooth motion, and multilingual audio for narrative scenes.

Wan2.5-Preview Image
Wan2.5-Preview Image is a single frame generator built from the Wan2.5 video stack. It focuses on detailed depth structure, strong prompt following, multilingual text rendering, and video grade visual quality for production ready stills in creative or product workflows.

Qwen-Image-Edit-Plus
Qwen-Image-Edit-Plus is a 20B image editing model that supports multi image workflows and strong identity preservation. It improves consistency on single image edits and adds native ControlNet style conditioning for precise structure control, layout edits, and bilingual text manipulation.

Qwen‑Image‑Edit
Qwen‑Image‑Edit is an instruction based image editing model built on the 20B Qwen‑Image foundation. It performs semantic edits and local appearance changes while preserving layout and text fidelity. Ideal for programmatic asset cleanup, style tweaks, and precise bilingual text updates.

Qwen‑Image-Lightning 8 Steps V1.1
Qwen‑Image-Lightning 8 Steps V1.1 is a distilled text to image LoRA for Qwen‑Image. It targets 8 step inference for near real time rendering. It improves quality consistency over V1.0 and preserves complex text layout. Ideal for high throughput image services and interactive UIs.

Qwen‑Image-Lightning (4 steps)
Qwen‑Image-Lightning 4 steps is a distilled LoRA for Qwen‑Image that targets minimal sampling steps with strong visual fidelity. It delivers up to 25× faster image generation. Ideal for real time applications and batch pipelines that need low latency inference.

Qwen‑Image-Lightning (8 steps V1.0)
Qwen‑Image-Lightning 8 steps V1.0 is a distilled LoRA for Qwen‑Image. It targets faster inference with strong text rendering and visual fidelity. Use it to generate high resolution images from prompts with fewer sampling steps and lower GPU cost.

Qwen-Image
Qwen-Image is a 20B parameter vision language model from Alibaba Cloud. It focuses on precise text conditioned image generation and supports complex Chinese or English typography. It also enables accurate image editing workflows that need layout control and strong prompt following.

Wan2.2 A14B
Wan2.2 A14B is a Mixture of Experts video model with two 14B experts for layout and detail. It supports text prompts or reference images to generate cinematic 480p or 720p clips with stable inference cost and consistent motion. Ideal for pipelines on high end GPUs.

Wan2.2 5B
Wan2.2 5B is a compact hybrid text and image to video model that targets 720p 24fps output with strong motion coherence. It supports text only prompts or image guided generation. It is optimized for fast inference on consumer GPUs and fits production video workflows.

Qwen-Image-Edit Lightning (8 steps)
Qwen-Image-Edit Lightning (8 steps) provides rapid, localized image editing with stable outputs. It suits bulk workflows that need consistent structure and layout. Developers can run quick iteration loops while keeping fine control over regions and edit strength.

