Text to image: Turning words into pictures with AI

The foundational pipeline: generate images from text prompts using diffusion models, with control over every step of the process.

Introduction

Text-to-image generation transforms textual descriptions into visual content, allowing you to create images from just words. While the concept is simple, understanding the various parameters and how they interact gives you precise control over your results.

A floating island with waterfalls, pink trees, and a glowing 'GENERATE' button in front of a dark screen, surrounded by clouds

This page covers the core pipeline, how text becomes an image, how to structure requests, and how to choose the right model. For deep dives into individual parameters, each one has its own concept page linked below.

Basic request example

Here's a simple text-to-image request to get you started:

[
  {
    "taskType": "imageInference",
    "taskUUID": "a770f077-f413-47de-9dac-be0b26a35da6",
    "model": "runware:101@1",
    "positivePrompt": "An astronaut floating inside a giant hourglass in space, surrounded by stars and glowing dust, with galaxies swirling faintly above and golden sand below. Dreamy, surreal, cinematic",
    "width": 1024,
    "height": 1024,
    "steps": 30
  }
]
{
  "data": [
    {
      "taskType": "imageInference",
      "imageUUID": "ca6b2d39-5f83-47b9-b22b-71f9afc935e8",
      "taskUUID": "a770f077-f413-47de-9dac-be0b26a35da6",
      "seed": 9202427981074766178,
      "imageURL": "https://im.runware.ai/image/os/a14d18/ws/2/ii/ca6b2d39-5f83-47b9-b22b-71f9afc935e8.jpg"
    }
  ]
}

How text-to-image works

Text-to-image generation converts textual descriptions into visual content through a multi-stage process where the model gradually constructs an image based on your prompt. At its core, the process involves three key phases:

  1. Text understanding: The input prompt is processed by a text encoder that converts natural language into a numerical representation called embeddings. These embeddings capture the semantic meaning, conceptual relationships, and stylistic cues present in your text.

  2. Latent space generation: Rather than manipulating raw pixels, modern systems operate in a latent space, which is an abstract, compressed representation of images. Most advanced models use a diffusion process, which begins with random noise and gradually refines it into a meaningful image. This denoising is guided by your text embeddings and carried out by a neural network, typically a U-Net or a Transformer-based architecture like DiT. Some models follow an autoregressive approach generating images token by token.

  3. Image decoding: The final latent representation is converted into a pixel-based image using a decoder, often part of a Variational Autoencoder (VAE). This step handles texture, color, and fine detail, producing the full-resolution image you see.

Together, these phases enable AI to generate images that closely match the meaning and style of your original prompt.

Model selection

The model parameter specifies which AI model to use for generation. Models are organized by architecture families, each with different capabilities:

  • SD 1.5 architecture models: Models like civitai:4384@128713 (Dreamshaper v1) or specialized variants for particular styles or subjects. These models typically excel at artistic and creative imagery.

  • SDXL architecture models: Models like civitai:133005@782002 (Juggernaut XL XI) that offer higher resolution capabilities and better photorealism compared to SD 1.5 models.

  • FLUX architecture models: Models like runware:101@1 (FLUX.1 Dev) deliver faster generation times, better compositional understanding, improved handling of complex scenes, and more consistent quality across different parameter settings. They're notable for their detail preservation in faces and intricate structures.

  • HiDream architecture models: Models like HiDream-I1 Full are built on a Transformer-based diffusion architecture with a Mixture-of-Experts (MoE) backbone. They combine high-quality text understanding with fine-grained visual control, producing state-of-the-art results in both creative and photorealistic styles. HiDream models are especially strong in complex prompts, object interactions, and cinematic compositions.

Within each architecture, individual models may be fine-tuned for specific styles, subjects, or use cases. The model you choose significantly impacts not just the aesthetic of your results, but also how your prompt is interpreted and which parameters will be most effective.

You can browse available models using our Model Search API or the models directory .

Generation parameters

Every parameter in the request controls a different aspect of the generation process. Each one has its own concept page with visual examples and detailed guidance:

Parameter What it controls Concept page
positivePrompt / negativePrompt What to generate and what to avoid Prompts
width / height Canvas size and aspect ratio Dimensions
steps Number of refinement iterations Steps
CFGScale How strictly the model follows your prompt CFG Scale
scheduler Denoising algorithm (speed vs quality) Schedulers
seed Deterministic starting point for reproducibility Seed
vae Visual decoder for the final image VAE
clipSkip Text encoder layer selection CLIP Skip

Advanced features

These features extend text-to-image with additional conditioning and control:

Feature What it does Concept page
LoRAs Lightweight style/subject adapters that modify a base model LoRAs
ControlNet Structural guidance via edge maps, depth maps, and pose detection ControlNet
IP Adapters Reference image conditioning for style transfer and subject consistency IP Adapters
Embeddings Custom text tokens for specialized concepts (SD 1.5/SDXL only) Embeddings
Refiner Two-stage generation for enhanced detail (SDXL only) Refiner