Qwen2.5-VL-3B-Instruct

Instruction-tuned vision-language model for image and text understanding

Qwen2.5-VL-3B-Instruct

Qwen2.5-VL-3B-Instruct is a multimodal model that processes images and text together to perform visual reasoning, captioning, question answering, and structured output tasks. It integrates a vision encoder with an instruction-tuned language backbone to support complex visual understanding and interactive multimodal responses.

Commercial use
Image To TextCaption