MODEL ID ideogram:4@0
live

Ideogram 4.0

Ideogram
by Ideogram

Ideogram 4.0 is Ideogram's most capable text-to-image model for design-heavy image generation. It is built for frontier text rendering across languages, structured prompt control through natural language or JSON, bounding-box layout control, transparent background generation, and high-fidelity 2K output. It is well suited to posters, branded graphics, packaging, product visuals, typography-led compositions, and other workflows where design precision matters as much as visual quality.

Ideogram 4.0

Structured prompts

How Ideogram 4.0's two prompting modes work, the full JSON schema the model was trained on (top-level keys, style_description, compositional_deconstruction, element types, bbox, color_palette), when to send natural language and let Magic Prompt expand it, and when to hand-craft the JSON for explicit control.

Introduction

Image models are usually opaque about how they read a complex prompt. You write a sentence describing a layered scene, the model interprets it however it interprets it, and you have no way to see what it actually understood until the image comes back. Multi-element compositions lose track of their elements. Typography-led layouts garble the text. Repeatable brand work isn't actually repeatable because the model rolls fresh dice every run.

Ideogram 4.0 closes that gap. The model is driven by a structured JSON prompt with a fixed schema of reserved keys. The schema names the high-level description, the visual style, the background, and each individual element as an object or text entry. You can write natural language and let the provider's Magic Prompt step expand it into that schema automatically, or you can hand-craft the JSON yourself and skip the expansion. Either way, the model operates on the structure, not the sentence.

This guide covers the two prompting modes, the full reserved-key schema, the iteration loop that lets you start in natural language and tighten with structured edits, when each mode earns its place, and concrete patterns for typography-heavy and design-focused scenes.

The two prompting modes

Every Ideogram 4.0 request takes one of two inputs: a natural-language positivePrompt or a structured settings.structuredPrompt JSON object. They are mutually exclusive at the API level.

positivePrompt is the familiar path. You write the scene in plain language, the provider's Magic Prompt step expands your sentence into the structured JSON, and the model generates against that expansion. There is no Magic Prompt toggle. It runs automatically whenever a natural-language prompt is sent. The expanded JSON comes back in the response so you can capture it and feed it back as settings.structuredPrompt to iterate. See Iterating from the expansion below.

settings.structuredPrompt is the explicit path. You provide the structured JSON object yourself, and the model receives it without any expansion step. What you write is what gets generated against, with no language model in between.

Sending both positivePrompt and settings.structuredPrompt in the same request is rejected. Pick one path per request.

[
  {
    "taskType": "imageInference",
    "taskUUID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "model": "ideogram:4@0",
    "positivePrompt": "An ornate vintage apothecary label wrapped around a tall amber glass medicine bottle on a weathered walnut shelf, with the title 'Dr. Faukland's Tincture of Quassia' set in Victorian slab-serif type.",
    "width": 2048,
    "height": 2048,
    "outputType": "URL"
  }
]
[
  {
    "taskType": "imageInference",
    "taskUUID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "model": "ideogram:4@0",
    "settings": {
      "structuredPrompt": {
        "high_level_description": "An ornate vintage apothecary label on an amber glass bottle...",
        "style_description": {
          "aesthetics": "Late Victorian apothecary product photography...",
          "lighting": "Soft warm window light from the upper left...",
          "photo": "Editorial product still life, 35mm film, 50mm prime lens...",
          "medium": "Photograph.",
          "color_palette": ["#4A2F1B", "#A8753A", "#D4B080", "#2B1A0E"]
        },
        "compositional_deconstruction": {
          "background": "Aged dark walnut shelf...",
          "elements": [
            { "type": "text", "text": "DR. FAUKLAND'S", "desc": "Top of the label in small condensed serif capitals..." },
            { "type": "text", "text": "TINCTURE OF QUASSIA", "desc": "Main product name in Victorian Tuscan slab-serif..." }
          ]
        }
      }
    },
    "width": 2048,
    "height": 2048,
    "outputType": "URL"
  }
]

Inside the structured JSON

The JSON has three top-level keys, with the bulk of the work happening inside compositional_deconstruction.elements[]. The full schema (every field, the photo/art_style switch, key order rules, hex format) lives in the Ideogram 4.0 model page . The compact version of the overall shape:

{
  "high_level_description": "",
  "style_description": {
    "aesthetics": "", "lighting": "", "photo": "", "medium": "", "color_palette": []
  },
  "compositional_deconstruction": {
    "background": "",
    "elements": [
      { "type": "obj", "bbox": [], "color_palette": [], "desc": "" },
      { "type": "text", "bbox": [], "color_palette": [], "text": "", "desc": "" }
    ]
  }
}

Five things about the schema that are easy to miss once you've read the field list and worth keeping in mind while you write prompts:

type: "obj" versus type: "text" is a hard dispatch. Text elements get their text field rendered literally, byte for byte. Obj elements get their desc interpreted as natural language the model paints from. This is the difference between Ideogram and image models that treat all content as interpretable language. The brand name in a text element comes back exactly as you wrote it. That same brand name buried inside an obj desc comes back as the model's best guess.

desc is freeform natural language inside an otherwise rigid container. Everywhere else the model expects specific keys in specific orders with specific types. Inside desc you write whatever sentence you would write to a person. That paradox is the whole shape of the structured prompt, with rigid scaffolding around freeform content slots. It is what lets you build production templates where the structure is fixed and only the content varies.

bbox is the heavier positioning touch, per-element and optional. Most layouts work fine with positional language in desc ("centered along the top", "lower-right corner"). When the descriptive position keeps drifting and you need the element to land inside a specific region, add a bbox array of four integers in 0–1000 normalised coordinates as [y_min, x_min, y_max, x_max]row-first, y before x. Works on text and obj alike, and coexists with desc on the same element: the bbox declares the rectangle, desc still carries the treatment notes. See Labelled spatial composition below for a transit map worked example.

photo versus art_style is a model-behaviour fork, not a label. The model uses one rendering path for photographic prompts and another for illustrative ones. Picking the wrong one for your subject is one of the most common ways to get a worse-than-expected output. If you are generating something that would exist as a print, drawing, or graphic, reach for art_style even when "it kind of looks like a photo of a poster" is technically true.

color_palette is the only non-language conditioning channel, and it works at two scopes. Everything else in the JSON is text the model reads. The palette array is a separate signal entirely: hex values the model is trained to treat as colours to favour. At the image level (style_description.color_palette, up to 16 hex values) it anchors the colour story of the whole composition. At the element level (per obj or text, up to 5 hex values) it scopes a colour to one element, useful when a brand colour or product colour must hold without bleeding into the rest of the scene. Describing colours in desc or aesthetics is interpretation. Passing hex through the palette is direction.

Snake_case spellings only. color_palette, not colorPalette or color-palette. compositional_deconstruction, not composition or breakdown. Key order is also part of the contract, and the validator checks it. The model page schema enumerates every field and order rule.

Natural language vs structured

For simple scenes, natural language carries everything. The Magic Prompt expansion is fast and good enough that hand-crafting JSON is wasted effort on a single-subject portrait or a clean landscape.

Complexity is where the two paths separate. When the scene has multiple text elements with specific copy, when typography hierarchy must be exact, when the same scene needs to render consistently across runs, the natural-language sentence loses precision against the JSON it produces.

The two cards below were generated from the same subject. The first prompt is natural language. The second is the structured JSON path with the same scene broken down explicitly.

The natural-language card is plausible, not specified. The model picked a goldcrest and a workable layout, then rendered the text it could identify. The exact size of the common name, the position of the card number, the colour of the series mark, and the precise wording of the caption all came back as the Magic Prompt's best interpretation, not as your spec. The structured card encodes each of those decisions explicitly, so the caption reads as written, the series mark lands at the bottom in green, and the card number sits in the lower-right corner because the model received those instructions directly.

Iterating from the expansion

Magic Prompt runs every time you send a natural-language positivePrompt. The structured JSON it produces is returned in the response alongside the image, so you can capture it and use it as the starting point for a structured iteration.

The loop:

  1. Send a natural-language positivePrompt. Get back an image and the structured JSON Magic Prompt generated to produce it.
  2. Read the JSON. Find the element you want to change.
  3. Edit that element. Leave the rest of the JSON unchanged.
  4. Send the edited JSON as settings.structuredPrompt on the next call.

The image comes back with your change applied and the rest of the scene staying close to the previous render, because the JSON the model receives is the same as the previous call except for the element you touched. The unchanged elements won't be pixel-identical from one run to the next, but they keep the same role in the composition.

The example below picks up from the natural-language goldcrest card generated earlier. The response from that initial call returned the structured JSON Magic Prompt expanded the prompt into. To turn the card into a bullfinch, you bump the high_level_description to name the new species, then rewrite five element entries: the bird's obj description, the common name, the Latin species name, the field-guide caption, and the card number. The style block, the series mark, the frame border, and the background description stay close to where they were, because the JSON describing them didn't change.

[
  {
    "taskType": "imageInference",
    "taskUUID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "imageUUID": "9c1b2d3a-4e5f-6789-abcd-ef0123456789",
    "imageURL": "https://im.runware.ai/image/os/a14d18/ws/2/ii/9c1b2d3a-4e5f-6789-abcd-ef0123456789.jpg",
    "structuredPrompt": {
      "high_level_description": "A vintage 1950s collectible natural-history trading card showing a goldcrest bird perched on a pine sprig...",
      "style_description": {
        "aesthetics": "Mid-century British natural-history collectible card design...",
        "lighting": "Soft even studio light...",
        "medium": "Print on cream cardstock...",
        "art_style": "Vintage 1950s halftone field-guide illustration...",
        "color_palette": ["#F4E9CC", "#1F3B26", "#7A5A2F", "#1A1A1A"]
      },
      "compositional_deconstruction": {
        "background": "Cream cardstock filling the frame...",
        "elements": [
          { "type": "text", "text": "GOLDCREST", "desc": "Common name in bold black serif capitals..." },
          { "type": "obj", "desc": "Small detailed halftone illustration of a male goldcrest..." },
          { "type": "text", "text": "Regulus regulus", "desc": "Latin species name in italic serif..." },
          { "type": "text", "text": "Smallest of the British songbirds, around 9 cm in length...", "desc": "Field-guide caption..." },
          { "type": "text", "text": "17 OF 50", "desc": "Card number..." },
          { "type": "text", "text": "BRITISH SONGBIRDS", "desc": "Series mark..." },
          { "type": "obj", "desc": "Thin dark green ornamental frame border..." }
        ]
      }
    }
  }
]
[
  {
    "taskType": "imageInference",
    "taskUUID": "b2c3d4e5-f6a7-8901-bcde-f23456789012",
    "model": "ideogram:4@0",
    "settings": {
      "structuredPrompt": {
        "high_level_description": "A vintage 1950s collectible natural-history trading card showing a bullfinch perched on a hawthorn branch, printed on cream cardstock with classic British natural-history series typography.",
        "style_description": {
          "aesthetics": "Mid-century British natural-history collectible card design, restrained ornamentation, modest paper aging.",
          "lighting": "Soft even studio light with no directional shadow.",
          "medium": "Print on cream cardstock with halftone illustration.",
          "art_style": "Vintage 1950s halftone field-guide illustration with thin line work and flat colour fills.",
          "color_palette": ["#F4E9CC", "#1F3B26", "#7A5A2F", "#1A1A1A"]
        },
        "compositional_deconstruction": {
          "background": "Cream cardstock filling the frame, very slight foxing along the edges and a faint paper grain texture.",
          "elements": [
            { "type": "text", "text": "BULLFINCH", "desc": "Common name in bold black serif capitals centered across the top of the card." },
            { "type": "obj", "desc": "Small detailed halftone illustration of a male bullfinch with its glossy black cap and bright rose-pink breast, perched on a fine hawthorn branch with a few berries, occupying the central upper two-thirds of the card." },
            { "type": "text", "text": "Pyrrhula pyrrhula", "desc": "Latin species name in italic serif directly beneath the bird illustration." },
            { "type": "text", "text": "A stout, rose-breasted finch of hedgerows and orchards, around 15 cm long. Pairs are usually seen together throughout the year.", "desc": "Field-guide caption in small condensed serif beneath the Latin name, set as two short justified lines." },
            { "type": "text", "text": "18 OF 50", "desc": "Card number in small bold caps in the lower-right corner." },
            { "type": "text", "text": "BRITISH SONGBIRDS", "desc": "Series mark in small spaced caps along the very bottom edge of the card, dark green ink." },
            { "type": "obj", "desc": "Thin dark green ornamental frame border running just inside the edges of the card." }
          ]
        }
      }
    },
    "width": 1664,
    "height": 2496,
    "outputType": "URL"
  }
]

The bullfinch sits in the same composition as the goldcrest because the JSON skeleton didn't change. The cream cardstock, the halftone treatment, the green frame, and the field-guide layout all carry through. The series mark still reads BRITISH SONGBIRDS because the iteration didn't touch it, even though bullfinches are technically not in the song-passerine narrow sense. That is the workflow's point: the JSON is the contract, and the elements you don't edit keep their role in the scene because the model receives the same instructions for them.

This is consistency, not pixel-for-pixel reproducibility. The angle of the bird, the exact font weight on the common name, the precise green of the border, and the small framing details all shift between runs because the model regenerates the whole image from scratch every time. Two runs of the same positivePrompt produce two different Magic Prompt expansions and therefore two different scenes. Two runs of the same structuredPrompt start from the same instructions, so the output stays close to itself across runs, similar in composition without being identical. The iteration loop only works because the JSON is what the model receives.

Re-running this prompt with edited JSON keys regenerates the entire image from scratch, so the layout, characters, and background will all shift even when only one field changed. For genuinely localized edits that preserve the rest of the frame (swap the text on a sign, change a product label, replace one object), reach for Ideogram 3.0 Edit instead, which takes a seed image plus a mask and only repaints the masked region.

When to reach for structured prompts

Natural language is the right starting point. The Magic Prompt expansion is fast and the result is usually close to what you wanted. Reach for the structured JSON when one of the following is true:

  1. The exact copy matters. Headlines, brand names, product names, prices, dates, addresses, identification numbers. Anything where a paraphrase is a failure. Putting the copy inside a text element guarantees the model receives it verbatim.
  2. There are multiple text elements with hierarchy. A poster with a main title, a subtitle, a date line, and a venue mark is asking for a layout the sentence can't describe cleanly. Each text element lets you set position, weight, and treatment per piece of copy.
  3. The composition has many distinct objects. A scene with five or more discrete elements is hard to keep straight in a single sentence. Listing them explicitly stops the model from collapsing or omitting them.
  4. You need precise colour control. color_palette at the image level (and per element) gives the model explicit hex colours to favour, which is a different and tighter signal than describing colours in prose.
  5. You need positional precision. bbox coordinates pin elements to specific regions of the canvas. Descriptive positioning ("centered along the top") is a useful approximation. bbox is a target.
  6. You need to iterate on one part without rewriting the whole prompt. Structured JSON lets you change a single element's desc and re-run, without disturbing the rest of the scene.

For everything else, write the sentence and let Magic Prompt do its work.

Patterns

Bold typography hierarchy

A 1960s Polish Cyrk poster reduces a complex performance to a single image and four lines of type. The structured prompt puts the title, the subtitle, and the tour-stop line each in their own text element with size, weight, and position separately specified. The acrobat lives as a single obj element with its painting style described. A style_description.color_palette of cream-red-yellow pins the iconic palette of the school directly.

Labelled spatial composition

A transit map is mostly text annotations attached to specific positions. Sentence-based prompting has no real way to ask for a station label at a precise junction. The structured prompt lets each station name live as its own text element with a positional description, the route ribbons live as obj elements with their colours and directions specified, and the style_description.color_palette names the Vignelli-era subway colour identity directly.

Precise multi-line typography

A movie poster's title and credit block lives or dies on the type hierarchy. The structured prompt names the tagline, the title, the director credit, the cast line, the below-the-line credits, the studio imprint, and the release date as separate elements, so the IMAX-era typography ladder is the model's brief instead of its guess. A style_description.color_palette of deep navy, near-black, pale cream, and cool blue anchors the action-sci-fi mood.

Tips

  1. Use the exact reserved spellings. Snake_case, in the documented order. color_palette, not colorPalette. art_style, not artStyle. The verifier treats unknown keys as warnings, and keys beyond the reserved set aren't expected to affect the result.
  2. Pick photo or art_style, not both. Photographs use photo. Everything else (illustration, paint, screen-print, 3D render) uses art_style. They are mutually exclusive inside style_description.
  3. Write the high_level_description first. It is the global frame the rest of the JSON lives inside. A vague summary leaves room for the model to drift in directions the elements don't anticipate.
  4. Keep background to the setting, not the subject. Surface, light, atmosphere. If something is a discrete object or piece of text, it belongs in elements, not in background.
  5. Quote the literal text content. The text field is the copy that gets rendered. Apostrophes and accents are reproduced as written. If you want "Dr. Faukland's" with a curly apostrophe, write a curly apostrophe.
  6. Order elements in roughly reading order. Descriptions that reference "directly beneath the title block" need the title block to come first. List elements top-to-bottom or background-to-foreground.
  7. Draft in natural language, iterate via the returned JSON. Magic Prompt is fast enough that the first generation should almost always come from a positivePrompt. Capture the JSON it returns, edit the elements that need tightening, and re-send as structuredPrompt. Hand-authoring the full JSON from scratch is rarely the right starting point.