IP Adapters: Reference-based image generation
Use reference images to guide generation, enabling style transfer and visual consistency across outputs.
Introduction
IP-Adapters (Image Prompt Adapters) enable reference-based generation, where one or more input images guide the visual output alongside your text prompt. Unlike standard image-to-image which directly transforms the input, IP-Adapters extract visual features from reference images and inject them into the generation process to create entirely new content that inherits those features.
This makes IP-Adapters the right tool when you want to transfer a visual quality (a style, a color palette, a subject's appearance) into a new composition without being constrained by the reference image's layout or structure.
Close-up of a realistic honeybee standing on a wooden surface
robot
How it works
IP-Adapters work by passing your reference images through a vision encoder (typically a CLIP image encoder) that extracts high-level visual features: color distributions, textures, shapes, stylistic patterns, and subject characteristics. These features are then injected into the generation model's attention layers, where they influence the output alongside your text prompt.
The key difference from image-to-image is structural. Image-to-image uses the reference as a pixel-level starting point, preserving layout and composition while modifying details. IP-Adapters use the reference as a feature-level conditioning signal, preserving visual qualities while allowing entirely new compositions.
This means you can:
- Provide a photo of a product and generate it in a completely different scene.
- Use an artwork as a style reference and apply that aesthetic to any subject.
- Feed multiple references to blend visual characteristics from different sources.
Request structure
The ipAdapters parameter is an array of IP-Adapter configurations. Each configuration specifies a model, reference images, and influence settings.
[
{
"taskType": "imageInference",
"model": "civitai:101055@128078",
"positivePrompt": "robot",
"ipAdapters": [{
"model": "runware:55@4",
"guideImages": ["a1b2c3d4-e5f6-7890-abcd-ef1234567890"],
"weight": 0.8
}],
"steps": 30,
"width": 1024,
"height": 1024
}
]The required fields are model (which IP-Adapter architecture to use) and guideImages (an array of image UUIDs, URLs, or base64 data). Optional parameters include:
-
weight: Controls the strength of the reference image's influence (0.0 to 2.0). Higher values make the output look more like the reference. -
combineMethod: How multiple guide images are combined when you provide more than one. -
embedScaling: Controls how the visual embeddings are scaled during injection. -
weightType: Determines how weight is distributed across the model's attention layers.
IP Adapters vs. image-to-image vs. ControlNet
These three features all use reference images, but they serve different purposes:
| Feature | What it preserves | What changes | Best for |
|---|---|---|---|
| Image-to-image | Layout, composition, overall structure | Colors, textures, style, detail level | Restyling, upscaling, prompt-guided modifications |
| ControlNet | Specific structural elements (edges, depth, pose) | Everything else | Precise structural control, pose matching |
| IP Adapters | Visual style, subject features, color palette | Composition, layout, scene | Style transfer, subject consistency, visual references |
The choice depends on what you want to keep from the reference. If you want to keep the layout, use image-to-image. If you want to keep specific structural elements like edges or poses, use ControlNet. If you want to keep the visual "feel" while generating something new, use IP Adapters.
FLUX Redux
FLUX Redux (runware:105@1) is an IP-Adapter model built specifically for the FLUX architecture. It enables image variation generation: given a reference image, it reproduces the image with variations, letting you refine existing images or create multiple alternatives from a single reference.
Unlike standard IP Adapters, FLUX Redux works differently:
- Use
runware:105@1as the IP-Adapter model. - Provide the input image in the
guideImageparameter inside theipAdaptersobject. - Use a FLUX base model (typically
runware:101@1) as the generation model. Other FLUX models work too. - There is no
weightparameter. ThepositivePrompthas no effect on the output.
[
{
"taskType": "imageInference",
"model": "runware:101@1",
"positivePrompt": "__BLANK__",
"ipAdapters": [{
"guideImage": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"model": "runware:105@1"
}],
"steps": 30,
"width": 1024,
"height": 1024
}
]Use __BLANK__ as your positivePrompt to generate pure variations without any text guidance. This tells the model to focus exclusively on the visual information from the reference image.
Tips
- Start with weight 0.5-0.7. Full weight (1.0+) often overpowers the text prompt, making the output a near-copy of the reference. Lower weights give a balanced blend of reference style and prompt content.
- Use clear, well-lit reference images. The vision encoder extracts features better from images with good contrast and clear subjects. Dark, noisy, or cluttered references produce weaker conditioning.
- Combine with text prompts for control. IP Adapters work best when paired with a descriptive prompt. The prompt defines what to generate, the reference defines how it should look. Without a prompt, the model tends to reproduce the reference too literally.
- Multiple references for style blending. You can pass multiple images in the
guideImagesarray to blend characteristics from different sources. Use lower weights (0.3-0.5) when blending to prevent one reference from dominating.