LLaVA-1.6-Mistral-7B

Vision-language model for image understanding and captioning

LLaVA-1.6-Mistral-7B is a multimodal vision-language model that processes images alongside text to generate descriptive and reasoning-based responses. It enables image captioning and visual understanding by combining a vision encoder with a Mistral 7B language backbone.

Commercial use

Image to TextCaption

80 - 100 tokens$0.0019