LLaVA-1.6-Mistral-7B

Vision-language model for image understanding and captioning

Image to TextCaption

LLaVA-1.6-Mistral-7B Overview

LLaVA-1.6-Mistral-7B is a multimodal vision-language model that processes images alongside text to generate descriptive and reasoning-based responses. It enables image captioning and visual understanding by combining a vision encoder with a Mistral 7B language backbone.

From $0.00190/ tokens
80 - 100 tokens$0.0019

Commercial use