Qwen2.5-VL-3B-Instruct

Instruction-tuned vision-language model for image and text understanding

Qwen2.5-VL-3B-Instruct

Qwen2.5-VL-3B-Instruct is a multimodal model that processes images and text together to perform visual reasoning, captioning, question answering, and structured output tasks. It integrates a vision encoder with an instruction-tuned language backbone to support complex visual understanding and interactive multimodal responses.

Commercial use

Image to TextCaption

90 - 118 tokens$0.0026

More models from this creator

Qwen-Image-2.0

Qwen-Image-2.0

Qwen-Image-2.0 is an advanced image generation and editing model from Alibaba that produces high-quality images at native 2K resolution and renders professional-grade text within visuals. It unifies text-to-image and image-to-image editing into a single model with strong semantic understanding and adheres to detailed prompt instructions. The model excels at generating images that include complex textual content, infographics, posters, and layout-driven visuals.

Qwen-Image-2.0-Pro

Qwen-Image-2.0-Pro

Qwen-Image-2.0-Pro builds on Qwen-Image-2.0 with optimized visual fidelity, improved layout and typography handling, and advanced editing control for professional creative and enterprise applications. It delivers richer detail, more accurate text and iconography rendering, and refined editing semantics across a wide range of visual styles, making it suitable for advertising, branding, design systems, and high-impact visual content.

Wan2.6 Flash

Wan2.6 Flash

Wan2.6 Flash is a distilled, low-latency variant of the Wan2.6 multimodal video model designed for rapid image to video generation with fluid motion, visual stability, and optional synchronized audio. It produces HD clips from detailed static images while preserving subject structure and motion realism, making it suitable for preview workflows and high-throughput creative pipelines.

Qwen-Image-2512

Api Only

Qwen-Image-2512

Qwen-Image-2512 is an improved version of the Qwen-Image image foundation model with enhanced prompt understanding, superior text rendering accuracy, and more realistic visual details. It generates high-fidelity images from text prompts across diverse subjects and styles.

Qwen-Image-Layered

Api Only

Qwen-Image-Layered

Qwen-Image-Layered decomposes a static image into multiple RGBA layers, enabling independent editing of semantically distinct components without interfering with other parts of the image. This layered representation supports high-fidelity image editing tasks like resizing, repositioning, recoloring, and object manipulation with consistent detail and transparency handling.

Wan2.6 Image

Wan2.6 Image

Wan2.6 Image is a single-frame image generation model derived from the Wan2.6 multimodal video architecture. It focuses on strong prompt adherence, clean spatial structure, and visually coherent results, delivering video-grade image quality for creative, editorial, and product-oriented workflows.

Wan2.6

Wan2.6

Wan2.6 is a multimodal video model for text to video and image to video generation with support for multi-shot sequencing and native sound. It emphasizes temporal stability, consistent visual structure across shots, and reliable alignment between visuals and audio in short form video generation.

Z-Image-Turbo

Z-Image-Turbo

Z-Image-Turbo is a distilled vision model for sub second image generation. It produces sharp photorealistic results and supports accurate Chinese text and English text inside images. It follows complex layout instructions with stable structure for UI, posters, and scenes.

Qwen-Image-Edit-2511

Api Only

Qwen-Image-Edit-2511

Qwen-Image-Edit-2511 is an image editing model that applies text instructions to modify existing images with precise semantic and appearance control. It preserves visual consistency during edits, supports multi-person and character consistency, and integrates selected features and extensions that enhance object manipulation, geometric reasoning, and layout coherence.

Wan2.5-Preview

Wan2.5-Preview

Wan2.5-Preview is Alibaba’s multimodal video model in research preview. It supports text to video and image to video with native audio generation for clips around 10 seconds. It offers strong prompt adherence, smooth motion, and multilingual audio for narrative scenes.

Wan2.5-Preview Image

Wan2.5-Preview Image

Wan2.5-Preview Image is a single frame generator built from the Wan2.5 video stack. It focuses on detailed depth structure, strong prompt following, multilingual text rendering, and video grade visual quality for production ready stills in creative or product workflows.

Qwen-Image-Edit-Plus

Qwen-Image-Edit-Plus

Qwen-Image-Edit-Plus is a 20B image editing model that supports multi image workflows and strong identity preservation. It improves consistency on single image edits and adds native ControlNet style conditioning for precise structure control, layout edits, and bilingual text manipulation.

Wan2.2 Animate

Api Only

Wan2.2 Animate

Wan2.2 Animate is a unified video model that produces character-focused animations from static images and reference videos or replaces characters in existing footage while preserving motion, expressions, and scene consistency. It uses the Wan2.2 mixture-of-experts architecture to generate coherent character movement and seamless integration with background video.

Wan2.2 Animate Turbo

Wan2.2 Animate Turbo

Wan2.2 Animate Turbo is an accelerated variant of Wan2.2 Animate designed for faster character animation and replacement in video. It generates coherent character motion and expression from images or existing footage while prioritizing reduced inference time for rapid iteration workflows.

Qwen‑Image‑Edit

Qwen‑Image‑Edit

Qwen‑Image‑Edit is an instruction based image editing model built on the 20B Qwen‑Image foundation. It performs semantic edits and local appearance changes while preserving layout and text fidelity. Ideal for programmatic asset cleanup, style tweaks, and precise bilingual text updates.

Qwen‑Image-Lightning 8 Steps V1.1

Qwen‑Image-Lightning 8 Steps V1.1

Qwen‑Image-Lightning 8 Steps V1.1 is a distilled text to image LoRA for Qwen‑Image. It targets 8 step inference for near real time rendering. It improves quality consistency over V1.0 and preserves complex text layout. Ideal for high throughput image services and interactive UIs.

Qwen‑Image-Lightning (4 steps)

Qwen‑Image-Lightning (4 steps)

Qwen‑Image-Lightning 4 steps is a distilled LoRA for Qwen‑Image that targets minimal sampling steps with strong visual fidelity. It delivers up to 25× faster image generation. Ideal for real time applications and batch pipelines that need low latency inference.

Qwen‑Image-Lightning (8 steps V1.0)

Qwen‑Image-Lightning (8 steps V1.0)

Qwen‑Image-Lightning 8 steps V1.0 is a distilled LoRA for Qwen‑Image. It targets faster inference with strong text rendering and visual fidelity. Use it to generate high resolution images from prompts with fewer sampling steps and lower GPU cost.

Qwen-Image

Qwen-Image

Qwen-Image is a 20B parameter vision language model from Alibaba Cloud. It focuses on precise text conditioned image generation and supports complex Chinese or English typography. It also enables accurate image editing workflows that need layout control and strong prompt following.

Wan2.2 5B

Wan2.2 5B

Wan2.2 5B is a compact hybrid text and image to video model that targets 720p 24fps output with strong motion coherence. It supports text only prompts or image guided generation. It is optimized for fast inference on consumer GPUs and fits production video workflows.

Wan2.2 A14B

Wan2.2 A14B

Wan2.2 A14B is a Mixture of Experts video model with two 14B experts for layout and detail. It supports text prompts or reference images to generate cinematic 480p or 720p clips with stable inference cost and consistent motion. Ideal for pipelines on high end GPUs.

Qwen2.5-VL-7B-Instruct

Qwen2.5-VL-7B-Instruct

Qwen2.5-VL-7B-Instruct is a multimodal model that processes images and text together to perform visual reasoning, captioning, question answering, and structured output generation. It integrates a vision encoder with a 7B instruction-tuned language backbone to support rich interactive multimodal understanding.

Qwen2.5-VL-7B Age Detector

Qwen2.5-VL-7B Age Detector

Qwen2.5-VL-7B Age Detector is a multimodal model that analyzes a facial image to estimate age. It uses the vision encoder from Qwen2.5-VL-7B and its instruction-tuned language backbone to interpret visual features and output age predictions or age categories.

Wan2.2 A14B Turbo

Wan2.2 A14B Turbo

Wan2.2 A14B Turbo accelerates Wan2.2 with fused Lightning LoRA for ultra fast diffusion. It cuts inference to 8 steps while preserving cinematic structure and detail. Ideal for rapid 480p to 720p video prototyping and iteration in production workflows.

Qwen-Image-Edit Lightning (8 steps)

Qwen-Image-Edit Lightning (8 steps)

Qwen-Image-Edit Lightning (8 steps) provides rapid, localized image editing with stable outputs. It suits bulk workflows that need consistent structure and layout. Developers can run quick iteration loops while keeping fine control over regions and edit strength.