Baidu

Baidu

Foundation models for language, image generation, and multimodal AI

Baidu develops the ERNIE family of foundation models across language, image generation, and multimodal understanding. Their model lineup spans general-purpose LLMs, vision-language systems, and open image-generation checkpoints designed for both research and production workflows.

Models by Baidu

ERNIE-Image is Baidu's 8B text-to-image model built on a single-stream Diffusion Transformer architecture. It is designed for strong prompt adherence, reliable text rendering, and structured visual generation, making it well suited to posters, comics, storyboards, multi-panel layouts, and other workflows where content accuracy and composition matter as much as aesthetics. The standard model emphasizes stronger general-purpose capability and instruction fidelity, typically running at around 50 inference steps.

ERNIE-Image-Turbo is Baidu's distilled fast variant of ERNIE-Image. It is optimized for substantially faster generation, typically requiring only 8 inference steps, while retaining relatively comparable performance to the full model in many scenarios. It is best suited to high-throughput ideation, rapid visual iteration, and workflows that prioritize speed and polished aesthetics over the stronger general-purpose instruction fidelity of the base model.