OpenAI CLIP ViT-L/14

Vision encoder for text-image representation and similarity

OpenAI CLIP ViT-L/14

OpenAI CLIP ViT-L/14 is a contrastive vision-language model that embeds images and text into a shared representation space. It enables tasks like zero-shot image classification, semantic search, and similarity scoring by computing aligned feature vectors for images and texts.

Commercial use
Image To TextCaption