ControlNet: Structural guidance for image generation

Provides structural control over generation using conditioning inputs like edge maps, depth maps, and pose detection.

Introduction

ControlNet provides precise structural control over the generation process by using conditioning images (guide images) to direct how the model creates specific aspects of the output. It integrates additional visual guidance into the model's generation pipeline, allowing specific visual elements to influence the creation process alongside your text prompt.

While ControlNet originated in image generation, the underlying concept of conditioning a model with structural guidance has expanded to video generation as well. Video-aware ControlNet models can enforce consistent poses, depth, or edges across frames to maintain temporal coherence.

The conditioning mechanisms can interpret various types of visual guidance: edge maps (like Canny or MLSD) for structural guidance, depth maps for spatial composition, pose detection for human positioning, and segmentation maps for object placement.

Girl with a Pearl Earring by Johannes Vermeer — Original

Canny edge detection map of Girl with a Pearl Earring, showing white outlines on a black background — Canny edge map

Depth map of Girl with a Pearl Earring, showing grayscale depth information — Depth map

OpenPose skeleton overlay of Girl with a Pearl Earring, showing detected body keypoints — OpenPose image

Blurred version of Girl with a Pearl Earring used as structural guidance — Blurred image

Grayscale version of Girl with a Pearl Earring used as tonal guidance — Grayscale image

A futuristic woman with iridescent skin wearing a chrome headwrap and pearl earring in cyberpunk style — runware:25@1

How it works

Each ControlNet model is trained to work with a specific type of preprocessed guidance image. The workflow involves two steps:

Preprocess your reference image using our ControlNet preprocessing tools to generate the appropriate guidance image (edge map, depth map, pose skeleton, or other guidance type).
Provide this preprocessed guidance image as the guideImage parameter along with the corresponding ControlNet model and settings. During generation, the system uses this guidance to influence the output, balancing structural control with your text prompt based on the weight parameter.

This preprocessing + inference workflow gives you explicit control over how the structural guidance is prepared and applied.

Request structure

The controlNet parameter is an array that can contain multiple ControlNet models. Each model can have its own settings.

[
  {
    // other parameters...
    "controlNet": [{
      "model": "runware:25@1",
      "guideImage": "56f8916f-1a33-49cb-b67f-2c4f48472563",
      "startStep": 1,
      "endStep": 10,
      "weight": 1.0,
      "controlMode": "balanced"
    }]
  }
]

Weight and timing

The weight parameter controls how strongly the ControlNet guidance influences the generation process. Higher values force the output to follow the guide image more closely, while lower values allow the text prompt to dominate.

The timing parameters determine when the ControlNet guidance is applied during the generation process. The startStep/startStepPercentage and endStep/endStepPercentage parameters define the specific steps when guidance begins and ends (e.g., steps 1-10 of a 30-step generation).

These timing controls offer strategic advantages:

Starting guidance later (higher startStep) allows more creative initial formation before structural guidance kicks in.
Ending guidance earlier (lower endStep) lets your prompt take control for final detailing.

Different timing strategies produce distinctly different results. A common approach is applying ControlNet only during the first 30-50% of steps to lock in composition, then letting the prompt and model handle textures and fine detail in the remaining steps.

Control mode

The controlMode parameter determines how the ControlNet guidance is applied relative to the base model's generation process:

balanced: Equal influence between the ControlNet guidance and the text prompt. A safe default.
prompt: Gives more weight to the text prompt, useful when you want the structure as a loose guide rather than a hard constraint.
controlnet: Gives more weight to the guide image, useful when structural accuracy is more important than prompt creativity.

Tips

Match the preprocessor to your intent. A Canny edge map preserves fine outlines (good for architecture, product shots), while a depth map preserves spatial relationships without edge detail (good for scenes and compositions).
Use lower weights for creative work. Weights of 0.4-0.6 let the model interpret the guide loosely, which works well for artistic output. Weights above 0.8 enforce strict adherence, better suited for technical or reproduction tasks.
Combine multiple ControlNets. You can stack a depth map with a pose skeleton to control both scene layout and character positioning simultaneously. When stacking, lower the individual weights (0.3-0.5 each) to prevent the guides from competing.
Preprocess at the right resolution. The guide image should match your target output dimensions. Mismatched resolutions can cause alignment issues between the guide and the generated content.