Text to image: Turning words into pictures with AI
The foundational pipeline: generate images from text prompts using diffusion models, with control over every step of the process.
Introduction
Text-to-image generation transforms textual descriptions into visual content, allowing you to create images from just words. While the concept is simple, understanding the various parameters and how they interact gives you precise control over your results.
This page covers the core pipeline, how text becomes an image, how to structure requests, and how to choose the right model. For deep dives into individual parameters, each one has its own concept page linked below.
Basic request example
Here's a simple text-to-image request to get you started:
[
{
"taskType": "imageInference",
"taskUUID": "a770f077-f413-47de-9dac-be0b26a35da6",
"model": "runware:101@1",
"positivePrompt": "An astronaut floating inside a giant hourglass in space, surrounded by stars and glowing dust, with galaxies swirling faintly above and golden sand below. Dreamy, surreal, cinematic",
"width": 1024,
"height": 1024,
"steps": 30
}
]{
"data": [
{
"taskType": "imageInference",
"imageUUID": "ca6b2d39-5f83-47b9-b22b-71f9afc935e8",
"taskUUID": "a770f077-f413-47de-9dac-be0b26a35da6",
"seed": 9202427981074766178,
"imageURL": "https://im.runware.ai/image/os/a14d18/ws/2/ii/ca6b2d39-5f83-47b9-b22b-71f9afc935e8.jpg"
}
]
}How text-to-image works
Text-to-image generation converts textual descriptions into visual content through a multi-stage process where the model gradually constructs an image based on your prompt. At its core, the process involves three key phases:
-
Text understanding: The input prompt is processed by a text encoder that converts natural language into a numerical representation called embeddings. These embeddings capture the semantic meaning, conceptual relationships, and stylistic cues present in your text.
-
Latent space generation: Rather than manipulating raw pixels, modern systems operate in a latent space, which is an abstract, compressed representation of images. Most advanced models use a diffusion process, which begins with random noise and gradually refines it into a meaningful image. This denoising is guided by your text embeddings and carried out by a neural network, typically a U-Net or a Transformer-based architecture like DiT. Some models follow an autoregressive approach generating images token by token.
-
Image decoding: The final latent representation is converted into a pixel-based image using a decoder, often part of a Variational Autoencoder (VAE). This step handles texture, color, and fine detail, producing the full-resolution image you see.
Together, these phases enable AI to generate images that closely match the meaning and style of your original prompt.
Model selection
The model parameter specifies which AI model to use for generation. Models are organized by architecture families, each with different capabilities:
-
SD 1.5 architecture models: Models like
civitai:4384@128713(Dreamshaper v1) or specialized variants for particular styles or subjects. These models typically excel at artistic and creative imagery. -
SDXL architecture models: Models like
civitai:133005@782002(Juggernaut XL XI) that offer higher resolution capabilities and better photorealism compared to SD 1.5 models. -
FLUX architecture models: Models like
runware:101@1(FLUX.1 Dev) deliver faster generation times, better compositional understanding, improved handling of complex scenes, and more consistent quality across different parameter settings. They're notable for their detail preservation in faces and intricate structures. -
HiDream architecture models: Models like HiDream-I1 Full are built on a Transformer-based diffusion architecture with a Mixture-of-Experts (MoE) backbone. They combine high-quality text understanding with fine-grained visual control, producing state-of-the-art results in both creative and photorealistic styles. HiDream models are especially strong in complex prompts, object interactions, and cinematic compositions.
Within each architecture, individual models may be fine-tuned for specific styles, subjects, or use cases. The model you choose significantly impacts not just the aesthetic of your results, but also how your prompt is interpreted and which parameters will be most effective.
A fierce female warrior with intricate silver armor reflecting warm sunset light, holding a glowing sword with runes carved into the blade, standing on a rocky cliff overlooking a vast fantasy valley, wind blowing through her dark hair, cinematic atmosphere
A fierce female warrior with intricate silver armor reflecting warm sunset light, holding a glowing sword with runes carved into the blade, standing on a rocky cliff overlooking a vast fantasy valley, wind blowing through her dark hair, cinematic atmosphere
You can browse available models using our Model Search API or the models directory .
Generation parameters
Every parameter in the request controls a different aspect of the generation process. Each one has its own concept page with visual examples and detailed guidance:
| Parameter | What it controls | Concept page |
|---|---|---|
positivePrompt / negativePrompt | What to generate and what to avoid | Prompts |
width / height | Canvas size and aspect ratio | Dimensions |
steps | Number of refinement iterations | Steps |
CFGScale | How strictly the model follows your prompt | CFG Scale |
scheduler | Denoising algorithm (speed vs quality) | Schedulers |
seed | Deterministic starting point for reproducibility | Seed |
vae | Visual decoder for the final image | VAE |
clipSkip | Text encoder layer selection | CLIP Skip |
Advanced features
These features extend text-to-image with additional conditioning and control:
| Feature | What it does | Concept page |
|---|---|---|
| LoRAs | Lightweight style/subject adapters that modify a base model | LoRAs |
| ControlNet | Structural guidance via edge maps, depth maps, and pose detection | ControlNet |
| IP Adapters | Reference image conditioning for style transfer and subject consistency | IP Adapters |
| Embeddings | Custom text tokens for specialized concepts (SD 1.5/SDXL only) | Embeddings |
| Refiner | Two-stage generation for enhanced detail (SDXL only) | Refiner |