
Baidu
Foundation models for language, image generation, and multimodal AI
Baidu develops the ERNIE family of foundation models across language, image generation, and multimodal understanding. Their model lineup spans general-purpose LLMs, vision-language systems, and open image-generation checkpoints designed for both research and production workflows.
Models by Baidu
ERNIE-Image is Baidu's 8B text-to-image model built on a single-stream Diffusion Transformer architecture. It is designed for strong prompt adherence, reliable text rendering, and structured visual generation, making it well suited to posters, comics, storyboards, multi-panel layouts, and other workflows where content accuracy and composition matter as much as aesthetics. The standard model emphasizes stronger general-purpose capability and instruction fidelity, typically running at around 50 inference steps.
ERNIE-Image-Turbo is Baidu's distilled fast variant of ERNIE-Image. It is optimized for substantially faster generation, typically requiring only 8 inference steps, while retaining relatively comparable performance to the full model in many scenarios. It is best suited to high-throughput ideation, rapid visual iteration, and workflows that prioritize speed and polished aesthetics over the stronger general-purpose instruction fidelity of the base model.

