Qwen2.5-VL-7B-Instruct

by AlibabaAugust 24, 2023

Qwen2.5-VL-7B-Instruct is a multimodal model that processes images and text together to perform visual reasoning, captioning, question answering, and structured output generation. It integrates a vision encoder with a 7B instruction-tuned language backbone to support rich interactive multimodal understanding.

API Options

Platform-level options for task execution and delivery.

taskType stringrequiredvalue: caption: Identifier for the type of task being performed

taskUUID stringrequiredUUID v4: UUID v4 identifier for tracking tasks and matching async responses. Must be unique per task.

webhookURL stringuri

Specifies a webhook URL where JSON responses will be sent via HTTP POST when generation tasks complete. For batch requests with multiple results, each completed item triggers a separate webhook call as it becomes available.

Learn more1 resource

Webhooks
PLATFORM

deliveryMethod stringdefault: sync

Determines how the API delivers task results.

Allowed values2 values

: Returns complete results directly in the API response.
: Returns an immediate acknowledgment with the task UUID. Poll for results using getResponse.

Learn more1 resource

Task Polling
PLATFORM

includeCost boolean: Include task cost in the response.

Inputs

Input resources for the task (images, audio, etc). These must be nested inside the inputs object.

inputs » image image stringrequired: Image input (UUID, URL, Data URI, or Base64).

Core Parameters

Primary parameters that define the task output.

model stringrequiredvalue: alibaba-qwen2-5-vl-7b-instruct

Identifier of the model to use for generation.

Learn more3 resources

prompt string: Instructions or questions to guide the image analysis.