Text to image: Turning words into pictures with AI

Master the art of text-to-image generation with Runware's API. Learn how key parameters work and how to fine-tune your results for optimal image quality.

Introduction

Text-to-image generation transforms textual descriptions into visual content, allowing you to create images from just words. While the concept is simple, understanding the various parameters and how they interact gives you powerful control over your results.

A floating island with waterfalls, pink trees, and a glowing 'GENERATE' button in front of a dark screen, surrounded by clouds

This guide breaks down the key parameters and techniques for getting the most from text-to-image generation with Runware's API, with practical explanations for developers who integrate this workflow into their applications.

Basic request example

Here's a simple text-to-image request to get you started:

Request

[
  {
    "taskType": "imageInference",
    "taskUUID": "a770f077-f413-47de-9dac-be0b26a35da6",
    "model": "runware:101@1",
    "positivePrompt": "An astronaut floating inside a giant hourglass in space, surrounded by stars and glowing dust, with galaxies swirling faintly above and golden sand below. Dreamy, surreal, cinematic",
    "width": 1024,
    "height": 1024,
    "steps": 30
  }
]

Response

{
  "data": [
    {
      "taskType": "imageInference",
      "imageUUID": "ca6b2d39-5f83-47b9-b22b-71f9afc935e8",
      "taskUUID": "a770f077-f413-47de-9dac-be0b26a35da6",
      "seed": 9202427981074766178,
      "imageURL": "https://im.runware.ai/image/ws/2/ii/ca6b2d39-5f83-47b9-b22b-71f9afc935e8.jpg"
    }
  ]
}

As we explore each parameter in this guide, you'll learn how to customize requests to achieve exactly the results you want.

How text-to-image works

Text-to-image generation converts textual descriptions into visual content through a multi-stage process where the model gradually constructs an image based on your prompt. At its core, the process involves three key phases:

Text understanding: The input prompt is processed by a text encoder that converts natural language into a numerical representation called embeddings. These embeddings capture the semantic meaning, conceptual relationships, and stylistic cues present in your text.
Latent space generation: Rather than manipulating raw pixels, modern systems operate in a latent space, which is an abstract, compressed representation of images. Most advanced models use a diffusion process, which begins with random noise and gradually refines it into a meaningful image. This denoising is guided by your text embeddings and carried out by a neural network, typically a U-Net or a Transformer-based architecture like DiT. Some models follow an autoregressive approach generating images token by token.
Image decoding: The final latent representation is converted into a pixel-based image using a decoder, often part of a Variational Autoencoder (VAE). This step handles texture, color, and fine detail, producing the full-resolution image you see.

Together, these phases enable AI to generate images that closely match the meaning and style of your original prompt.

From a practical perspective, this technical process translates into a sequence of actions:

First, craft your prompts to clearly define what you want to see (and avoid).
Then, choose an appropriate model suited to your desired style and content.
Next, set the generation parameters that influence the overall image creation process.
Finally, evaluate the results and refine your approach as needed.

Parameters like steps, CFGScale, and scheduler directly control the generation process, determining how many iterations occur, how strictly your prompt is followed, and which mathematical approach guides the generation. Meanwhile, parameters like seed influence the initial conditions, ensuring reproducibility.

Different models may implement variations of this process, but the fundamental approach of translating text understanding into visual generation remains consistent. Understanding these parameters is key to getting consistent, high-quality results.

Key parameters

Prompts: Guiding the generation

The prompt parameters provide the textual guidance that steers the image generation process. These text strings are processed by the model's text encoder(s) to create embeddings that influence the generation at each step.

The positivePrompt parameter defines what you want to see in the image. During generation, this text is tokenized (broken into word pieces) and encoded into a high-dimensional representation that guides the model toward specific visual concepts, styles, and attributes. The model has learned associations between language and imagery during training, allowing it to translate your textual descriptions into visual elements.

Conversely, the negativePrompt parameter specifies what you want to avoid. It works through a similar embedding process but exerts an opposing influence, steering the generation away from undesired characteristics or elements. This can be particularly useful for avoiding common artifacts, unwanted styles, or problematic content.

An astronaut floating inside a giant hourglass

An astronaut floating inside a giant hourglass in space, surrounded by stars and glowing dust, with galaxies swirling faintly above and golden sand below. Dreamy, surreal, cinematic

The position of terms in your prompt can affect their influence, with earlier terms typically receiving more emphasis in most models. Additionally, the semantic relationships between words matter, as the model interprets phrases and combinations differently than isolated terms.

Take note that some model architectures (like FLUX) don't support negative prompts at all. When using these models, the negativePrompt parameter will be ignored in your request.

Model selection: The foundation of generation

The model parameter specifies which specific AI model to use for generation.

Models are organized by architecture families, each with different capabilities:

SD 1.5 architecture models: Models like civitai:4384@128713 (Dreamshaper v1) or specialized variants for particular styles or subjects. These models typically excel at artistic and creative imagery.
SDXL architecture models: Models like civitai:133005@782002 (Juggernaut XL XI) that offer higher resolution capabilities and better photorealism compared to SD 1.5 models.
FLUX architecture models: Models like runware:101@1 (FLUX.1 Dev) deliver faster generation times, better compositional understanding, improved handling of complex scenes, and more consistent quality across different parameter settings. They're particularly notable for their detail preservation in faces and intricate structures.
HiDream architecture models: Models like HiDream-I1 Full are built on a Transformer-based diffusion architecture with a Mixture-of-Experts (MoE) backbone. They combine high-quality text understanding with fine-grained visual control, producing state-of-the-art results in both creative and photorealistic styles. HiDream models are especially strong in complex prompts, object interactions, and cinematic compositions.

Within each architecture, individual models may be fine-tuned for specific styles, subjects, or use cases. The model you choose significantly impacts not just the aesthetic of your results, but also how your prompt is interpreted and which parameters will be most effective.

A fierce female warrior with intricate silver armor reflecting warm sunset light, holding a glowing sword with runes carved into the blade, standing on a rocky cliff overlooking a vast fantasy valley, wind blowing through her dark hair, cinematic atmosphere — Juggernaut Pro FLUX

You can browse available models using our Model Search API or using our Model Explorer tool.

Image dimensions: Canvas size and ratio

The width and height parameters define your image's dimensions and aspect ratio.

While square formats (1:1) are common for general purposes, specific aspect ratios can enhance certain types of content:

Portrait dimensions (taller than wide, like 768×1024) typically produce better results for character portraits, fashion images, and full-body shots.
Landscape dimensions (wider than tall, like 1024×768) excel at scenic views, environments, and panoramic compositions.

Our API supports a wide range of dimensions, enabling you to generate ultra-wide panoramas or tall vertical images that would be difficult to create with standard aspect ratios. This flexibility is particularly valuable for specialized use cases like banner images, mobile app content, or widescreen presentations.

A floating island with waterfalls spilling into the sky, surrounded by colorful clouds and giant birds, under a vibrant sunset, ultra-wide cinematic shot

AI models are trained on images with specific dimensions, which creates "sweet spots" where they perform best. While some traditional models work best between 512-1024 pixels per side, newer architectures like FLUX models can produce excellent results at larger dimensions. Experiment with different sizes for your chosen model to find the sweet spot that balances quality and generation speed for your specific needs.

Remember that you can always upscale your lower-resolution images using our API, which allows you to generate higher-resolution images without sacrificing quality.

Steps: Trading quality for speed

The steps parameter defines how many iterations the model performs during image generation. While different model architectures use varying internal mechanisms, the steps parameter consistently controls the level of refinement in the generation process.

In diffusion models, each step typically removes a bit of noise, gradually turning random input into a detailed image. In transformer-based or autoregressive models, steps guide how many refinement cycles or generation passes the model performs. Regardless of the internal method, higher step counts usually lead to more coherent and detailed results, though they may also increase generation time.

Steps 0

A giant jellyfish floats through a sunlit canyon, its tendrils trailing softly across the rock walls as dust dances in the air — Generation time: 0.884sThe structure can barely be seen

The generation process generally follows these phases regardless of architecture:

Early steps: Establish basic composition, rough shapes, and color palette distribution.
Middle steps: Form recognizable objects, define spatial relationships, and develop textural foundations.
Later steps: Refine details, enhance coherence between elements, and develop subtle lighting nuances.
Final steps: Polish fine details and smooth transitions, often with increasingly subtle changes.

The optimal step count varies by model architecture and generation algorithm (scheduler), directly impacting both generation time and image quality.

Model distillation

Some models are created through a process called knowledge distillation, where a smaller and more efficient model is trained to mimic the outputs of a larger model. Distilled model architectures like LCM (Latent Consistency Model) or FLUX.1 Schnell can generate high-quality images in significantly fewer steps (4-8) compared to their non-distilled counterparts. This optimization makes them particularly valuable for applications where generation speed is critical, though they may occasionally trade some detail quality or prompt adherence for this efficiency.

CFG scale: Balancing creativity and control

The CFGScale (Classifier-Free Guidance Scale) parameter controls how strictly the model follows your prompt during image generation. Technically, it's a weighting factor that determines the influence of your prompt on the generation process.

At each step of the generation process, the model computes two predictions:

Unconditioned prediction: What the model would generate with an empty prompt.
Conditioned prediction: What the model would generate following your specific prompt.

CFG Scale then amplifies the difference between these two predictions, pushing the generation toward what your prompt describes. Higher values give more weight to your prompt's guidance at the expense of creativity/correctness. You can turn off CFG Scale by setting it to 0 or 1.

CFGScale 0

A crystal-clear lake surrounded by snow-capped mountains under a vibrant pink and orange sunset, with a small wooden cabin on the shore and pine trees reflected in the still water

In practice, lower CFG values allow for more creative freedom, while higher values enforce stricter prompt adherence. However, extremely high settings can lead to overguidance, causing artifacts, unnatural saturation, or distorted layouts.

Different model architectures handle CFG in distinct ways. Newer architectures like FLUX use a "CFG-distilled" approach where the parameter still exists, but has a much more subtle effect on generation. For FLUX models, the entire CFG range tends to produce more consistent outputs compared to the dramatic changes seen in traditional diffusion models like SD 1.5 or SDXL, which typically work best with lower CFG values.

For precise control, start with a model's recommended range and adjust based on your specific requirements.

Scheduler: The algorithmic path to your image

The scheduler parameter (sometimes called "sampler") defines the mathematical algorithm that guides the image generation process in diffusion models.

Each scheduler defines a different denoising trajectory, which can be more linear, stochastic, or adaptive. Different model architectures support different schedulers optimized for their structure. Schedulers control how noise is removed over time, affecting both quality and generation time.

A dreamy portrait of a woman with long wavy blonde hair adorned with small white flowers, wearing a light pastel dress, standing in a misty forest at dawn, soft ethereal lighting and a serene, magical atmosphere — DDIM

Popular schedulers include:

DPM++ 2M Karras: A great all-around choice with excellent detail and balanced results.
Euler A: Very fast and tends to produce more creative results, making it great for experimentation.
DPM++ 3M SDE: A newer scheduler that offers even better quality at high steps, perfect for detailed or large renders.
UniPC: A good middle ground between speed and image quality, slightly faster than DPM++ 2M Karras without losing much detail.

For a complete list of available schedulers, check our Schedulers page .

Seed: Controlling randomness deterministically

The seed parameter provides a deterministic starting point for the pseudo-random processes in image generation.

In diffusion models, the seed determines the initial noise pattern from which the image is gradually refined. While different architectures may interpret and process that noise differently, the seed consistently enables reproducibility and controlled variation across generations.

Seed 1

A massive sci-fi space battle with starships firing lasers across a backdrop of colorful nebulae, asteroids drifting between the chaos

Seed values serve several important purposes:

Reproducibility: The same seed and parameters will always produce the identical image.
Controlled experimentation: Change specific parameters while keeping the composition consistent.
Iterations: Find a good composition, then save the seed for further refinement.

If you don't specify a seed, a random one will be generated. When you find an image you like, note its seed value (returned in the response object) for future use.

VAE: Visual decoder

The vae parameter specifies which Variational Autoencoder to use for converting the model's internal representations into the final image.

A VAE consists of two parts:

An encoder that compresses images into a low-dimensional latent space (used during model training).
A decoder that reconstructs images from latent representations (used during inference).

The model doesn't work directly with pixels during generation, but instead operates in a compressed latent space, a lower-dimensional representation that's more computationally efficient. The VAE's decoder is responsible for the crucial final step of converting these abstract latent representations back into a visible image with proper colors, textures, and details.

The VAE parameter allows you to specify an alternative decoder to use for the final conversion step.

Anime-style girl with pink twin tails and violet eyes, wearing a sailor school uniform, smiling against a bright blue sky with fluffy clouds — Default SDXL VAE

Custom VAEs can affect several aspects of the final image:

Color reproduction: Different VAEs can produce more vibrant or accurate colors.
Detail preservation: Some VAEs better preserve fine details in the latent-to-pixel conversion.
Artifact reduction: Specialized VAEs can reduce common issues like color banding or blotches.

Not all architectures support custom VAEs. Some models, including FLUX, use their own integrated decoding methods and don't support VAE customization.

Clip Skip: Adjusting text interpretation

The clipSkip parameter controls which layer of the CLIP text encoder is used to interpret your prompt.

To understand this parameter, it helps to know how text is processed during image generation.

Most diffusion models use a component called CLIP (Contrastive Language-Image Pre-training), which contains a text encoder that translates your prompt into a numerical representation the image generation model can understand. This text encoder is a neural network that processes text through multiple layers, each extracting different levels of meaning and context.

Clip Skip determines how many layers from the end of this text encoder to skip when extracting embeddings:

Lower skip values include more of the deeper layers, which tend to focus on abstract or stylistic aspects of your prompt.
Higher skip values push the model to rely on earlier layers, which emphasize more literal and concrete interpretations.

Tuning this setting can affect how strictly the model follows your prompt versus how much creative interpretation it applies.

For sticker images, using a clipSkip value of 2 is preferred, leading to a simpler, cleaner result that better fits the minimalistic style expected of a sticker.

Smiling avocado with sunglasses emoji, stickers pack, outline, white borders, detailed, cartoon, black background — ClipSkip 0

For photorealistic portrait images where capturing fine details and realism is more important, not using clipSkip produces a richer and more detailed image that better matches the intended outcome.

A stylized portrait of a woman with vibrant orange and teal makeup, short platinum hair, and big statement earrings, captured in bright sunlight with colorful reflections around her. Fresh, bold, energetic — ClipSkip 0

ClipSkip only applies to models that use the CLIP text encoder, such as SD 1.5 and SDXL-based models. Other models that rely on different text encoders (like T5 or LLaMA) will not be affected by this parameter.

Note that SDXL models already skip one layer by default, so setting ClipSkip to 2 with SDXL effectively skips three layers from the original encoder.

Advanced features

Beyond the core parameters, several advanced features can significantly enhance your text-to-image generations.

Refiner: Two-stage generation

SDXL refiner models implement a two-stage generation process that can significantly enhance image quality. While the base model creates the initial image with overall composition and content, a refiner model specializes in improving details and textures.

Our refiner model implementation use the Ensemble of Expert Denoisers method, a technical approach where image generation begins with the base model and concludes with the refiner model. Importantly, this is a continuous process with no intermediate image generated. Instead, the base model processes the latent tensor for a specified number of steps, then hands it off to the refiner model to complete the remaining steps.

Refiner Start Step Percentage 0

Portrait of a man with short curly hair and a beard, wearing a denim jacket in golden hour light, urban background, soft focus, calm expression — 40 steps (0 from refiner model)

The process works as follows:

The SDXL base model begins denoising the random latent tensor.
At a specified transition point (controlled by startStep or startStepPercentage), the refiner model takes over and continues denoising from this exact point, specializing in enhancing details, textures, and overall coherence.
The final image is generated only after the refiner completes its processing.

The refiner parameter is an object that contains several sub-parameters:

[
  {
    // other parameters...
    "refiner": {
      "model": "civitai:101055@128080",
      "startStepPercentage": 90
    }
  }
]

The refiner model is specifically trained to excel at detail enhancement in the final denoising stages, not for the entire generation process. Starting the refiner too early can produce poor results, as these models lack the capability to properly form basic compositions and structures. For optimal results, limit the refiner to the final 5-15% of steps.

ControlNet: Structural guidance

ControlNet provides precise structural control over the generation process by using conditioning images (guide images) to direct how the model creates specific aspects of the output. It works by integrating additional visual guidance into the model's generation pipeline, allowing specific visual elements to influence the creation process alongside your text prompt.

These conditioning mechanisms can interpret various types of visual guidance, including edge maps (like Canny or MLSD) for structural guidance, depth maps for spatial composition, pose detection for human positioning, segmentation maps for object placement, among others.

Girl with a Pearl Earring by Johannes Vermeer — Original

Canny image of Girl with a Pearl Earring — Canny edge map

Depth image of Girl with a Pearl Earring — Depth map

OpenPose image of Girl with a Pearl Earring — OpenPose image

A futuristic woman with iridescent skin wearing a sleek chrome headwrap and minimalist pearl earring, soft glowing light, cyberpunk aesthetic, gentle expression, clean background, elegant and surreal — runware:25@1

Each ControlNet model is trained to work with a specific type of preprocessed guidance image. The workflow typically involves:

First preprocessing your reference image using our ControlNet preprocessing tools to generate the appropriate guidance image (edge map, depth map, pose detection or any other type of guidance image).
Then providing this preprocessed guidance image as the guideImage parameter along with the corresponding ControlNet model and settings.
During generation, the system uses this preprocessed guidance to influence the creation process, balancing this structural guidance with your text prompt based on the weight parameter.

This two-step process (preprocessing + inference) gives you precise control over how the structural guidance is prepared and applied.

The controlNet parameter is an array that can contain multiple ControlNet models. Each model can have its own settings.

[
  {
    // other parameters...
    "controlNet": [{
      "model": "runware:25@1",
      "guideImage": "56f8916f-1a33-49cb-b67f-2c4f48472563",
      "startStep": 1,
      "endStep": 10,
      "weight": 1.0,
      "controlMode": "balanced"
    }]
  }
]

The weight parameter controls how strongly the ControlNet guidance influences the generation process. The more weight you give to the ControlNet guidance, the more it will influence the final image.

The timing parameters determine when the ControlNet guidance is applied during the generation process. The startStep/startStepPercentage and endStep/endStepPercentage parameters define the specific steps when guidance begins and ends (e.g., steps 1-10 of a 30-step generation).

These timing controls offer strategic advantages:

Starting guidance later (higher startStep) allows more creative initial formation before structural guidance kicks in.
Ending guidance earlier (lower endStep) lets your prompt take control for final detailing.

Different timing strategies produce distinctly different results, making these parameters powerful tools for fine-tuning exactly how and when structural guidance shapes your image. Play with different timing strategies to discover the perfect balance.

The controlMode parameter determines how the ControlNet guidance is applied relative to the base model's generation process. This parameter works alongside weight to fine-tune exactly how structural guidance interacts with text instructions.

LoRAs: Style and subject adapters

LoRAs (Low-Rank Adaptations) are lightweight neural network adjustments that modify a base model's behavior to enhance specific styles, subjects, or concepts. Technically, LoRAs work by applying small, targeted changes to the weights of specific layers in the generation model, effectively "teaching" it new capabilities without changing the entire model.

Each LoRA model contains specialized knowledge that can significantly influence the output when combined with your prompt. This knowledge often focuses on particular artistic styles. Other LoRAs may be trained on specific subject matter, like some famous person. Some go even further, embedding abstract visual concepts such as certain composition techniques, color palettes, lighting dynamics, or aesthetic rules. By integrating a LoRA into the generation process, you effectively inject this visual expertise into your prompt, allowing for greater control and consistency in the output.

The lora parameter is an array that can contain multiple LoRA models. Each model can have its own settings.

[
  {
    // other parameters...
    "lora": [{
      "model": "civitai:120096@135931",
      "weight": 1.0
    }]
  }
]

Mixing multiple LoRAs allows for fascinating combinations, such as mixing an artistic style LoRA with a subject matter LoRA. When using multiple LoRAs together, consider using slightly lower weights for each to prevent them from competing too strongly.

LoRAs achieve their effect through mathematically low-rank decomposition of weight changes, which is why they can be so small (typically 50-150MB) compared to full models (6-25GB). This efficiency allows for mixing multiple specialized adaptations without the computational cost of full model swapping.

Embeddings: Custom concepts

Embeddings (also called Textual Inversions) are specialized text tokens that encapsulate complex visual concepts, styles, or subjects. Unlike LoRAs which modify the model's weights, embeddings work by teaching the model's text encoder to recognize new tokens that represent specific visual ideas.

Embeddings creates a representation of a visual concept derived from training images. When this embeddings are applied, the model interprets it as an instruction to either include or avoid the associated visual concept, depending on whether it's used positively or negatively.

Embeddings are particularly useful when you need to generate consistent results across multiple runs or capture concepts that are difficult to express with plain text prompts. They are often used for the accurate and repeatable generation of specific characters or subjects, ensuring that key facial features, outfits, or poses remain stable. Embeddings can also encode distinctive artistic styles, allowing you to apply a unique aesthetic even if it's hard to describe explicitly. In workflows that require visual consistency across multiple generations, embeddings provide a compact and powerful way to anchor those visual traits.

Negative embeddings are especially useful for fixing common issues like distorted hands, unrealistic anatomy, or other artifacts. For example, a "hand-fixing" embedding used with a negative weight can significantly improve hand details without changing your overall image concept.

Embeddings are directly added to your request through the embeddings array parameter, which can include multiple embeddings simultaneously.

[
  {
    // other parameters...
    "embeddings": [
      { "model": "civitai:118418@134583", "weight": 1.5 },
      { "model": "civitai:98259@539032", "weight": 0.8 }
    ]
  }
]

The weight parameter controls how strongly the embedding influences the generation, with a range from -4.0 to 4.0. Positive weights enhance or add the embedded concept to your generation, while negative weights suppress or remove that concept. Higher absolute values create stronger influence in either direction.

While both LoRAs and embeddings can influence style and content, they work differently. Embeddings integrate directly with the prompt processing pipeline, while LoRAs modify the generation model itself. For maximum control, these techniques can be combined in the same generation. For example, using a style LoRA with a negative artifact-fixing embedding.