Text to image: Turning words into pictures with AI

The foundational pipeline: generate images from text prompts using diffusion models, with control over every step of the process.

Introduction

Text-to-image generation transforms textual descriptions into visual content, allowing you to create images from just words. While the concept is simple, understanding the various parameters and how they interact gives you precise control over your results.

A floating island with waterfalls, pink trees, and a glowing 'GENERATE' button in front of a dark screen, surrounded by clouds

This page covers the core pipeline, how text becomes an image, how to structure requests, and how to choose the right model. For deep dives into individual parameters, each one has its own concept page linked below.

Basic request example

Here's a simple text-to-image request to get you started:

import { createClient } from '@runware/sdk'

const client = await createClient({ apiKey: process.env.RUNWARE_API_KEY })
await client.connect()

const [result] = await client.run({
  model: 'runware:101@1',
  positivePrompt: 'An astronaut floating inside a giant hourglass in space, surrounded by stars and glowing dust, with galaxies swirling faintly above and golden sand below. Dreamy, surreal, cinematic',
  width: 1024,
  height: 1024,
  steps: 30
})

import asyncio
import os

from runware import Runware


async def main():
    async with Runware(api_key=os.environ["RUNWARE_API_KEY"]) as client:
        results = await client.run({
            "model": "runware:101@1",
            "positivePrompt": "An astronaut floating inside a giant hourglass in space, surrounded by stars and glowing dust, with galaxies swirling faintly above and golden sand below. Dreamy, surreal, cinematic",
            "width": 1024,
            "height": 1024,
            "steps": 30
        })


asyncio.run(main())

curl https://api.runware.ai/v1 \
  -H "Authorization: Bearer $RUNWARE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '[
    {
      "taskType": "imageInference",
      "taskUUID": "a770f077-f413-47de-9dac-be0b26a35da6",
      "model": "runware:101@1",
      "positivePrompt": "An astronaut floating inside a giant hourglass in space, surrounded by stars and glowing dust, with galaxies swirling faintly above and golden sand below. Dreamy, surreal, cinematic",
      "width": 1024,
      "height": 1024,
      "steps": 30
    }
  ]'

runware run runware:101@1 \
  positivePrompt="An astronaut floating inside a giant hourglass in space, surrounded by stars and glowing dust, with galaxies swirling faintly above and golden sand below. Dreamy, surreal, cinematic" \
  width=1024 \
  height=1024 \
  steps=30

{
  "taskType": "imageInference",
  "taskUUID": "a770f077-f413-47de-9dac-be0b26a35da6",
  "model": "runware:101@1",
  "positivePrompt": "An astronaut floating inside a giant hourglass in space, surrounded by stars and glowing dust, with galaxies swirling faintly above and golden sand below. Dreamy, surreal, cinematic",
  "width": 1024,
  "height": 1024,
  "steps": 30
}

Response

{
  "data": [
    {
      "taskType": "imageInference",
      "imageUUID": "ca6b2d39-5f83-47b9-b22b-71f9afc935e8",
      "taskUUID": "a770f077-f413-47de-9dac-be0b26a35da6",
      "seed": 9202427981074766178,
      "imageURL": "https://im.runware.ai/image/os/a14d18/ws/2/ii/ca6b2d39-5f83-47b9-b22b-71f9afc935e8.jpg"
    }
  ]
}

How text-to-image works

Text-to-image generation converts textual descriptions into visual content through a multi-stage process where the model gradually constructs an image based on your prompt. At its core, the process involves three key phases:

Text understanding: The input prompt is processed by a text encoder that converts natural language into a numerical representation called embeddings. These embeddings capture the semantic meaning, conceptual relationships, and stylistic cues present in your text.
Latent space generation: Rather than manipulating raw pixels, modern systems operate in a latent space, which is an abstract, compressed representation of images. Most advanced models use a diffusion process, which begins with random noise and gradually refines it into a meaningful image. This denoising is guided by your text embeddings and carried out by a neural network, typically a U-Net or a Transformer-based architecture like DiT. Some models follow an autoregressive approach generating images token by token.
Image decoding: The final latent representation is converted into a pixel-based image using a decoder, often part of a Variational Autoencoder (VAE). This step handles texture, color, and fine detail, producing the full-resolution image you see.

Together, these phases enable AI to generate images that closely match the meaning and style of your original prompt.

Model selection

The model parameter specifies which AI model to use for generation. Models are organized by architecture families, each with different capabilities:

SD 1.5 architecture models: Models like civitai:4384@128713 (Dreamshaper v1) or specialized variants for particular styles or subjects. These models typically excel at artistic and creative imagery.
SDXL architecture models: Models like civitai:133005@782002 (Juggernaut XL XI) that offer higher resolution capabilities and better photorealism compared to SD 1.5 models.
FLUX architecture models: Models like runware:101@1 (FLUX.1 Dev) deliver faster generation times, better compositional understanding, improved handling of complex scenes, and more consistent quality across different parameter settings. They're notable for their detail preservation in faces and intricate structures.
HiDream architecture models: Models like HiDream-I1 Full are built on a Transformer-based diffusion architecture with a Mixture-of-Experts (MoE) backbone. They combine high-quality text understanding with fine-grained visual control, producing state-of-the-art results in both creative and photorealistic styles. HiDream models are especially strong in complex prompts, object interactions, and cinematic compositions.

Within each architecture, individual models may be fine-tuned for specific styles, subjects, or use cases. The model you choose significantly impacts not just the aesthetic of your results, but also how your prompt is interpreted and which parameters will be most effective.

A fierce female warrior in silver armor holding a glowing runed sword on a rocky cliff overlooking a fantasy valley — Juggernaut Pro FLUX

A pixel-art style fierce female warrior in silver armor holding a glowing sword on a cliff — PixelWave FLUX.1-dev 03

You can browse available models using our Model Search API or the models directory.

Generation parameters

Every parameter in the request controls a different aspect of the generation process. Each one has its own concept page with visual examples and detailed guidance:

Parameter	What it controls	Concept page
`positivePrompt` / `negativePrompt`	What to generate and what to avoid	Prompts
`width` / `height`	Canvas size and aspect ratio	Dimensions
`steps`	Number of refinement iterations	Steps
`CFGScale`	How strictly the model follows your prompt	CFG Scale
`scheduler`	Denoising algorithm (speed vs quality)	Schedulers
`seed`	Deterministic starting point for reproducibility	Seed
`vae`	Visual decoder for the final image	VAE
`clipSkip`	Text encoder layer selection	CLIP Skip

Advanced features

These features extend text-to-image with additional conditioning and control:

Feature	What it does	Concept page
LoRAs	Lightweight style/subject adapters that modify a base model	LoRAs
ControlNet	Structural guidance via edge maps, depth maps, and pose detection	ControlNet
IP Adapters	Reference image conditioning for style transfer and subject consistency	IP Adapters
Embeddings	Custom text tokens for specialized concepts (SD 1.5/SDXL only)	Embeddings
Refiner	Two-stage generation for enhanced detail (SDXL only)	Refiner