MODEL ID openai:gpt-image@2
live

GPT Image 2

OpenAI
by OpenAI

GPT Image 2 is a general-purpose GPT Image family model for text-to-image generation and image editing. Its strengths include strong prompt adherence, readable embedded text, detailed edits, photorealistic rendering, and structured visual outputs such as posters, packaging, product comps, diagrams, and other layout-sensitive images.

GPT Image 2

Prompting GPT Image 2

How to get the most out of GPT Image 2. Covers prompt format tricks, photorealism, text rendering, infographics, world knowledge, ad creatives, and multi-image workflows for editing, style transfer, character consistency, and compositing.

Introduction

GPT Image 2 is an LLM-based image model. Unlike diffusion models that compress your prompt into fixed-size embeddings, this model reads the full text the same way a chat model does. That changes what you can put in a prompt: structured formats, inline negation, pseudocode, and long design briefs all work because the model parses language, not tokens.

This guide covers what GPT Image 2 does differently: photorealism, text rendering, infographics, world knowledge, multi-image workflows, and prompt format tricks.

Photorealism

GPT Image 2 renders photorealistic images with natural lighting, real skin texture, plausible imperfections, and surface detail that most diffusion models smooth away. Two prompt habits push it further:

Say "photorealistic" explicitly. The word strongly engages the model's photorealistic mode. Phrases like "real photograph" or "shot on a real camera" also work, but "photorealistic" is the most reliable single trigger.

Use camera language for composition, not precision. Lens focal lengths, depth of field, film stock, and exposure hints guide the overall look. The model interprets them loosely (a "50mm lens" won't produce optically accurate focal-length behavior) but the compositional intent comes through clearly.

The first prompt reads like a documentary brief: subject, detail cues (stubble, frayed cuffs), film stock, and lighting. The second is a moody urban scene with atmospheric lighting. Neither prompt micromanages the composition. The model fills in plausible reflections, shadow angles, surface wear, and material textures because it understands what these scenes look like in reality.

Text in images

GPT Image 2 renders in-image text more reliably than most image models. Three habits improve it further:

Quote the text literally. Wrap any text that must appear verbatim in straight quotes inside the prompt. Without quotes, the model treats your words as suggestions and will rewrite them. With quotes, the model treats them as literal output to render.

Specify the typographic treatment. "Bold serif gold typography across the lower third", "small sans-serif white tagline", "single-column list, one item per line". The model can render the same string in dozens of ways, and the prompt is where you pick.

Use quality: "high" for dense or small text. Menu items, infographic labels, slide footnotes, packaging copy: anything that will be read rather than glanced at benefits from high. Larger headline text usually renders well at medium.

When the model rewrites a string or splices in extra letters, two prompt-level fixes usually work: add "render text verbatim, exactly as written, no extra characters" after the quoted string, and spell unusual words letter-by-letter the first time you mention them.

Infographics and structured visuals

Infographics, diagrams, slides, and charts are where GPT Image 2 pulls ahead of most image models. The combination of reliable text rendering and layout reasoning means you can generate structured visuals with real labels and readable data in a single pass.

Prompt these like a design brief, not an illustration request: name the deliverable (infographic, flowchart, dashboard), describe the data to cover, set the visual system (color palette, typography, chart types), and add constraints (no filler, no watermark). Use quality: "high" for these. Dense labels and small text need it.

The prompt above names the deliverable, lists six data blocks to include, sets a color system, and lets the model generate the actual numbers. A wide canvas (1792 × 1024) gives the model room for a map and dashboard side by side.

For process-oriented infographics, you can prompt conversationally: describe what you want to learn rather than specifying every label. Use a tall vertical canvas (1024 × 1792) to give the model room for many stages:

This prompt doesn't specify water volumes or wavelength numbers. The model supplies them because it understands tree biology. For technical subjects where you want accurate content but don't want to research the exact data points yourself, a conversational prompt lets the model do the research.

World knowledge

Because GPT Image 2 is built on an LLM, it carries factual knowledge into image generation. You can reference real events and historical periods by context rather than detailed visual description, and the model fills in the rest.

The prompt says "Bethel, New York on August 16, 1969" without ever mentioning Woodstock, tie-dye, or a concert stage. The model infers the event from the date and location and renders a period-accurate crowd scene. This is the kind of reasoning that diffusion models cannot do: connecting factual knowledge to visual output.

Ad creatives

Ad generation combines photorealism, text rendering, compositional reasoning, and brand direction into a single practical workflow. Prompt these like a creative brief: name the brand, describe the audience, set the visual tone, and include the exact copy.

The model handles brand positioning (youth streetwear), art direction (golden-hour, concrete, layered outfits), text rendering (clean tagline), and layout in a single pass. For campaign exploration, request numberResults: 3 or 4 to get visual variety without re-prompting.

Multi-image workflows

The inputs.referenceImages array accepts up to 16 reference images per request. The prompt language decides what the model does with them. Below are the patterns that come up most: style transfer, character consistency, product composites, and targeted edits.

Two conventions make multi-image prompts more reliable:

  • Label references by index when there's more than one. "Image 1 is the scene to preserve. Image 2 is the style reference." The model is much better at obeying spatial and stylistic instructions when references have explicit roles.
  • Be explicit about what to preserve and what to change. Multi-image work is where preserve lists matter most. "Preserve the bottle's shape, cap, label, and exact proportions" is the difference between a clean composite and a redesigned product.

The examples below each use one reference image plus a prompt.

Style transfer

Apply the visual language of one image to a new subject. The reference carries palette, brushwork, paper texture, line weight, and media grain. The prompt supplies the new subject.

Character consistency

Reuse a specific character across multiple compositions. The reference fixes the character's design. The prompt places them in a new scene or activity.

For multi-page work (children's books, comic strips, illustrated docs), use the first generation as the anchor reference for every subsequent page. Don't re-prompt the character from scratch each time, because the model drifts.

Product composite

Place a specific product (with its real shape, label, and proportions intact) into a new scene. This is the workflow for swapping backgrounds on product photography without losing the identity of the product itself.

The preserve list ("preserve the bottle's shape, cap, label, and exact proportions, do not alter the bottle in any way") is doing most of the work here. Without it, the model often "improves" the product by simplifying the cap or rewriting the label.

Edit with a preserve list

The same multi-image surface handles edits. Pass the source image as the reference, then write the prompt as two halves: what changes, and what must stay exactly the same. The more explicit the preserve list, the cleaner the edit.

Edit prompts benefit from explicit redundancy. "Preserve the lettering exactly as it appears, in the same position, with the same kerning and font" reads heavy but each clause prevents a different drift mode. Trying to be concise here ("keep the sign") leaves room for the model to "improve" the text.

Sample request

A composite request showing the full shape:

[
  {
    "taskType": "imageInference",
    "taskUUID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "model": "openai:gpt-image@2",
    "positivePrompt": "Place the exact perfume bottle from the reference image into a moody marble-bathroom scene...",
    "width": 1024,
    "height": 1024,
    "inputs": {
      "referenceImages": [
        "https://example.com/perfume-bottle.jpg"
      ]
    },
    "providerSettings": {
      "openai": {
        "quality": "high"
      }
    }
  }
]

Prompt format tricks

Because the model is an LLM, it understands structured input that diffusion models ignore. Three formats stand out.

In-prompt negative prompting

GPT Image 2 doesn't have a negativePrompt parameter, but you can write negative prompt: directly inside the positive prompt and the model respects it. Append it after the main description, separated by a line break.

The model treats the negative section as an exclusion list. This works for removing objects, styles, colors, or artifacts without needing a dedicated parameter.

Pseudocode and function syntax

The model interprets function-like syntax as generative instructions. You can write full function calls with named parameters or drop constructs like pick(), random_color(), random_pose(), random_texture(), or any variation you define, inline within natural prompts.

The tarot prompt uses sum(3, 8) to produce the numeral and random_mythological() to pick the figure, while locking the palette and border. The animal prompt reads as natural English with pseudocode slots dropped in. Both are useful for batch generation where you want variety in specific dimensions while keeping the style locked.

JSON-structured prompts

You can pass a raw JSON object as the prompt. The model parses the keys and generates accordingly.

JSON prompts are most useful when you're generating images programmatically and want a structured, predictable interface between your code and the model. They also make it easy to swap individual values (change the subject, keep the lighting) without rewriting prose.

The 32,000-character prompt limit gives you room for detailed briefs in any of these formats.

Tips

  1. Default to medium, switch to high for small text or fine detail. The cost difference is real and the quality difference is invisible on most outputs.

  2. Say "photorealistic" for realism. This single word is the strongest trigger for the model's photorealistic mode. Add camera language (lens, film stock, lighting direction) for compositional control.

  3. Lock text with quotes plus "verbatim". Render the tagline "Stay Curious" verbatim, exactly as written, no extra characters. prevents the model from rewriting your copy.

  4. Use one reference image unless you need more. Each additional reference adds room for the model to lose track of the anchor. Two well-labeled references beat five vague ones.

  5. Restate the preserve list on every iteration. Drift compounds across follow-ups. Repeating "preserve X, Y, Z" each turn is cheaper than fixing a botched second pass.

  6. Pick quality explicitly before shipping. auto is fine for prototyping, but you lose control over latency and cost in a pipeline.

  7. Use numberResults for exploration. Request 3-4 variations in a single call. The model's spread within a batch is wider than its consistency across separate calls, so a single batch shows more useful variety.

  8. Let the model reason about content. For technical infographics and historical scenes, describe what you want to learn rather than dictating every detail. The model's world knowledge fills in accurate data points, period details, and domain-specific content.