OpenAI CLIP ViT-L/14

Vision encoder for text-image representation and similarity

OpenAI CLIP ViT-L/14

OpenAI CLIP ViT-L/14 is a contrastive vision-language model that embeds images and text into a shared representation space. It enables tasks like zero-shot image classification, semantic search, and similarity scoring by computing aligned feature vectors for images and texts.

Commercial use
45 - 60 tokens$0.0032 - $0.0045
image-to-textcaption