MODEL ID ideogram:4@0
live

Ideogram 4.0

Ideogram
by Ideogram

Ideogram 4.0 is Ideogram's most capable text-to-image model for design-heavy image generation. It is built for frontier text rendering across languages, structured prompt control through natural language or JSON, bounding-box layout control, transparent background generation, and high-fidelity 2K output. It is well suited to posters, branded graphics, packaging, product visuals, typography-led compositions, and other workflows where design precision matters as much as visual quality.

Ideogram 4.0

Text and design output

How to use Ideogram 4.0 for typography-heavy designs: rendering long and dense text, multilingual and handwritten scripts, descriptive and bbox-anchored layout, image-level and per-element color palettes, transparent backgrounds, aspect-ratio presets, and the three rendering-speed tiers.

Introduction

Text inside images is the historical weak spot of generative models. Most image models treat letters as visual texture rather than content. They recognise that text looks a certain way without understanding spelling or character order. Headlines come back garbled and brand names get warped, so any design that depends on legible copy ends up unusable.

Ideogram 4.0 is built for the opposite problem. It treats text as a first-class element with explicit content rather than visual decoration, which makes it reliable for designs where the words have to be exactly right. It handles dense small copy, multilingual scripts, handwritten lettering, and text that's been rotated or inverted as a deliberate design choice.

This guide covers the text-rendering capabilities, then the design output features that depend on them: descriptive and bbox-anchored layout, image-level and per-element colour palettes, transparent-background output, the aspect-ratio presets, and the three rendering-speed tiers.

Text rendering

The four sub-sections below each isolate one capability. The same text element type drives all of them. What changes is what the model is being asked to render and how the surrounding desc describes it.

Dense small text

Most image models can render a headline. Almost none can render a paragraph. As soon as the copy gets long, the text degrades into shapes that look like writing but aren't actually words. Ideogram 4.0 handles paragraph-length copy and small annotations at near-readable sizes, even when the layout is a wall of labels at different positions.

The star chart below carries more than fifteen distinct labels: five constellation names inside the disc, eleven small star annotations packed around its rim, the plate caption, and the publisher mark.

The five constellation names hold their italic engraved character inside the disc. The eleven star annotations packed around the rim stay legible at engraving-plate sizes, even with leader lines threading between them. The plate number and publisher mark at the bottom edge sit cleanly in their assigned positions.

Multilingual scripts

Text rendering in image models has historically meant Latin script. Anything else returns shapes that look approximately like the script in question but don't spell anything. Ideogram 4.0 handles the major non-Latin scripts with the same precision, which matters for international signage and multilingual packaging in any brand work that has to cross markets.

Each line of the sign is a separate text element with its own content. The same baggage-claim instruction written four ways at equal weight is a fairly extreme test, and each script's character carries through to the final render.

Handwritten and stylized lettering

Handwriting is harder than print because every glyph is an organic shape rather than a typographic instance. Most image models render "handwriting" as a fake script font with no actual hand quality. Ideogram 4.0 treats handwriting as an aesthetic to render, not a typeface to substitute, and renders dip-pen cursive, calligraphy, and sketched lettering as if a real hand made them.

The recipe page below uses handwriting where it would actually appear in 1924: a title, two section headings, an ingredient list, a method paragraph, and a signature, all in a single hand.

The slight forward slant and the variation in stroke weight read as a real hand. The flourish under Beatrice's signature is a small piece of personality the model added because the structure asked for it.

Inverted and rotated text

Text doesn't always sit horizontally on the canvas. Coins, seals, stamps, and curved type lockups all require letters that follow a curve, and a good portion of design work asks for text rotated or inverted as a deliberate choice. Most image models handle this badly, producing horizontal text visually distorted to fit a shape, rather than letters that actually fit the curve.

The wax seal below has two arcs of text following its rim. The top arc reads upright. The bottom arc is rotated to follow the curve, the way real engraved seals are designed.

The motto along the top arcs upright. The institutional name along the bottom arcs with each glyph rotated to follow the curve, so the text reads correctly to someone tilting the seal. This is the typographic discipline an industrial-design pipeline expects, and what most generators flatten into horizontal type that's been bent into a shape.

Layout control

Layout in a structured prompt has two levers: descriptive positioning in each element's desc field, and explicit bbox coordinates that pin an element to a region of the canvas.

Descriptive positioning is the lighter touch. The model reads phrases like "centered along the top", "in the lower-right corner", or "directly beneath the title block" and places elements accordingly. It works well when the layout has clear hierarchy and the model has enough room to make small decisions.

bbox is the heavier touch. It's an array of four integers, [y_min, x_min, y_max, x_max], in 0–1000 normalised coordinates with the origin at the top-left. The model honours the box through its shared positional embedding, so the element lands inside the named region rather than approximately near it.

The bbox order is row-first (y, x rather than x, y). Designers normally think in (x, y). Build the bbox as [top, left, bottom, right] to keep the order straight. Values must be integers, in [0, 1000], with y_min ≤ y_max and x_min ≤ x_max.

The concert ticket below is generated with explicit bbox coordinates on every element. The ASCII diagram on the left is a separate, illustrative pass through Nano Banana to sketch roughly where each element lands. The photograph on the right is what Ideogram rendered from the actual bbox coordinates.

ADMIT ONE sits at the top because its bbox is [40, 220, 110, 480]: top edge near y=40, near-centred horizontally between x=220 and x=480. The venue name fills the upper title block from [140, 60, 240, 660]. The performer name dominates the lower half via [360, 40, 540, 680]. The perforation obj is a tall narrow rectangle running full-height at [0, 700, 1000, 720]. Each right-stub element carves out its own small box inside the x=750-and-right zone, with a duplicate ADMIT ONE at the top of the stub and the price sitting just below the seat assignment so the stub reads as four evenly weighted rows. The elements land inside the named rectangles rather than approximately near them, and the perforation cleanly separates the stub from the main ticket area without the model negotiating where it should sit.

You can mix the two approaches. Pin the elements whose position is non-negotiable with bbox, and let the rest fall through descriptive positioning. Inside a single element, both fields can coexist: bbox declares the rectangle, and desc still carries the style and treatment notes.

Colour palette control

Colour conditioning in the structured prompt is explicit. Instead of describing colours in language ("warm sunset tones with terracotta and cream"), you list hex values the model treats as the colours to favour in the composition.

There are two places color_palette can appear in the JSON:

  • Inside style_description, at the image level. Up to 16 colours. The global palette for the entire image.
  • Inside an individual element, at the per-element level. Up to 5 colours per element. Targeted conditioning for one specific object or piece of text.

Both fields take an array of uppercase #RRGGBB hex strings. Shorthand #RGB and #RRGGBBAA formats are rejected by the verifier.

The two posters below come from the same structured prompt, distinguished only by a different style_description.color_palette and a one-word change in the medium description.

Both posters carry the same scene description: a mountain ridge under a wide sky, a lone pine, a sun on the horizon, the title block above, the imprint along the bottom. The font weights, the exact silhouette of the pine, the mountain peaks, and the foreground will vary between any two runs of an image model. What does not vary is the mood, and the mood is what the palette controls. The model treats the array as a target conditioning signal, not as a hint to interpret in language.

Per-element color_palette gives one element its own conditioning channel. A text element with its own palette can hold a brand colour that the rest of the scene doesn't have. An obj element with its own palette can carry a product colour without bleeding into the background. Up to 5 colours per element.

Transparent backgrounds

Design pipelines rarely use generated images as the final composition. They need elements that drop into a layered file: logos, icons, monograms, ornaments, badges. Ideogram 4.0 can produce these with a transparent background by asking for one explicitly in the prompt's background field, and by requesting outputFormat: "PNG" so the alpha channel survives.

The monogram above is a single image element delivered with no scene behind it. Drop it into a card design or a letterhead template without having to mask anything by hand.

Transparency only survives in formats that carry an alpha channel. Always set outputFormat: "PNG" when you want the background to come through transparent.

Aspect ratios

Every output is approximately 4 million pixels. At 1:1 that's 2048 × 2048. Wider or taller ratios trade square pixel count for shape, and the API accepts only the predefined presets. The common ones for design work:

Ratio Dimensions Typical use
1:1 2048 × 2048 Social squares, album covers, packshots
16:9 2560 × 1440 Landscape banners, video stills
9:16 1440 × 2560 Vertical video, story posts
3:2 2496 × 1664 35mm photo proportions, posters
2:3 1664 × 2496 Portrait posters, book covers
4:5 1792 × 2240 Editorial portrait, Instagram portrait
5:4 2240 × 1792 Specimen cards, museum labels
8:5 2560 × 1600 Widescreen banners

There are 23 presets in total, including extreme aspect ratios like 22:9, 23:9, 8:3, 12:5, and the very long 3:1 and 1:3 for ultra-wide banners or pillar formats. The full list lives in the model's request schema. Sending a width and height outside the presets is rejected by the API. Pick the closest preset to your target output.

Quality tiers

Three rendering-speed tiers determine how much compute the model spends per generation:

  • TURBO is the fastest tier. The first iterations of an idea, low-stakes content, anything where the next pass is more important than this one.
  • DEFAULT is the middle tier and the right choice for most production work.
  • QUALITY is the slowest tier. Final delivery, typography-dense compositions, hero assets.

The same prompt at each tier produces visibly different output. The differences are most pronounced in fine text and material rendering, the kind of work where the extra compute earns its price.

The vintage watchmaker's storefront window below comes from the same structured prompt at each tier, a photoreal scene with hand-painted gold-leaf lettering on dark green glass.

TURBO is fast enough for rapid iteration. The proprietor line, the establishment date, and the brush-stroke detail on the gold leaf tend to drift at this tier. DEFAULT recovers most of the small-text crispness and the reflected street behind the glass. QUALITY holds the condensed capitals sharply through to the corner lettering and renders the gold-leaf grain and the glass reflections with finer detail.

Tips

  1. Reach for Ideogram when the words have to be right. Other text-to-image models have caught up on subject rendering and style. Copy fidelity is still the differentiator. Brand work, packaging, posters, anywhere the words are the brief.
  2. Quote the literal copy inside text fields. The text field is exactly what gets rendered. Apostrophes, accents, special characters, all reproduced as written. If you want a curly apostrophe in "Dr. Faukland's", write a curly apostrophe.
  3. Describe the position, weight, and treatment in desc. "Bold black serif capitals across the top", "small italic underneath", "lower-right corner". The desc is where you spend the typographic discipline a sentence prompt can't carry.
  4. Use bbox when descriptive positioning isn't tight enough. [y_min, x_min, y_max, x_max] in 0-1000 normalised coordinates, row-first ordering, written as [top, left, bottom, right]. Pin the elements that must land exactly, leave the rest descriptive.
  5. Use color_palette for directed colour. Describing colours in prose is interpretation. The color_palette array is a conditioning signal. 16 colours image-level, 5 per element, uppercase #RRGGBB.
  6. Use PNG output for any asset that needs transparency. JPG flattens the alpha channel and gives you a white or black background where you wanted transparency. The outputFormat: "PNG" setting is one line of payload.
  7. Pick the aspect ratio that matches the final canvas, not the closest one to your composition. Resizing or cropping a 16:9 to fit a 4:5 Instagram portrait loses the framing the model designed to. Generate at the target ratio.
  8. Use QUALITY tier for hero assets and typography-dense work. The extra compute pays off where small text precision matters. For thumbnails and exploratory iterations, TURBO is enough.