MODEL IDideogram:4@0

live

Ideogram 4.0

by IdeogramJune 3, 2026

Ideogram 4.0 is Ideogram's most capable text-to-image model for design-heavy image generation. It is built for frontier text rendering across languages, structured prompt control through natural language or JSON, bounding-box layout control, transparent background generation, and high-fidelity 2K output. It is well suited to posters, branded graphics, packaging, product visuals, typography-led compositions, and other workflows where design precision matters as much as visual quality.

Text and design output

How to use Ideogram 4.0 for typography-heavy design where the text has to be readable and exactly right, the layout has to land, and the palette has to lock to brand.

Introduction

Text inside images is the historical weak spot of generative models. Most image models treat letters as visual texture rather than content. They recognise that text looks a certain way without understanding spelling or character order. Headlines come back garbled and brand names get warped, so any design that depends on legible copy ends up unusable.

Ideogram 4.0 is built for the opposite problem. It treats text as a first-class element with explicit content rather than visual decoration, which makes it reliable for designs where the words have to be exactly right. It handles dense small copy, multilingual scripts, handwritten lettering, and text that's been rotated or inverted as a deliberate design choice.

A Swiss-style modernist art exhibition poster on cream paper with a large vermilion red circle on the right and the title 'FIELDS OF FOLD' set in tightly leaded black sans-serif type on the upper left — Hero: a Swiss-style exhibition poster with title, subtitle, date line, and gallery imprint as four distinct text elements, anchored by a three-colour palette

This guide covers the text-rendering capabilities, then the design output features that depend on them: descriptive and bbox-anchored layout, image-level and per-element colour palettes, transparent-background output, the aspect-ratio presets, and the three rendering-speed tiers.

Text rendering

The four sub-sections below each isolate one capability. The same text element type drives all of them. What changes is what the model is being asked to render and how the surrounding desc describes it.

Dense small text

Most image models can render a headline. Almost none can render a paragraph. As soon as the copy gets long, the text degrades into shapes that look like writing but aren't actually words. Ideogram 4.0 handles paragraph-length copy and small annotations at near-readable sizes, even when the layout is a wall of labels at different positions.

The star chart below carries more than fifteen distinct labels: five constellation names inside the disc, eleven small star annotations packed around its rim, the plate caption, and the publisher mark.

A vintage 19th-century almanac plate showing a circular dark-blue star chart of the northern summer sky, with five constellations (Lyra, Cygnus, Aquila, Hercules, Draco) labelled inside the disc and eleven named stars annotated in fine sepia ink around the disc — A 19th-century almanac plate with five constellation names and eleven small star annotations packed around the disc

The five constellation names hold their italic engraved character inside the disc. The eleven star annotations packed around the rim stay legible at engraving-plate sizes, even with leader lines threading between them. The plate number and publisher mark at the bottom edge sit cleanly in their assigned positions.

Multilingual scripts

Text rendering in image models has historically meant Latin script. Anything else returns shapes that look approximately like the script in question but don't spell anything. Ideogram 4.0 handles the major non-Latin scripts with the same precision, which matters for international signage and multilingual packaging in any brand work that has to cross markets.

A brushed-aluminium airport wayfinding sign reading 'Baggage Claim' in English, 手荷物受取所 in Japanese, استلام الأمتعة in Arabic, and 'Recogida de equipajes' in Spanish, alongside a white suitcase pictogram and downward arrow — An airport wayfinding sign rendering the same instruction in English, Japanese, Arabic, and Spanish at equal weight

Each line of the sign is a separate text element with its own content. The same baggage-claim instruction written four ways at equal weight is a fairly extreme test, and each script's character carries through to the final render.

Handwritten and stylized lettering

Handwriting is harder than print because every glyph is an organic shape rather than a typographic instance. Most image models render "handwriting" as a fake script font with no actual hand quality. Ideogram 4.0 treats handwriting as an aesthetic to render, not a typeface to substitute, and renders dip-pen cursive, calligraphy, and sketched lettering as if a real hand made them.

The recipe page below uses handwriting where it would actually appear in 1924: a title, two section headings, an ingredient list, a method paragraph, and a signature, all in a single hand.

An open page from a vintage cookbook with a handwritten dip-pen cursive recipe reading 'Aunt Beatrice's Lemon Sponge', followed by an Ingredients section, a Method section, and the signature 'Beatrice Holloway, Easter 1924', with a small ink sketch of a frosted sponge cake in the upper-right corner — A 1920s family-cookbook page with a recipe title, two underlined section headings, an ingredient list, a method paragraph, and a signed attribution all hand-lettered

The slight forward slant and the variation in stroke weight read as a real hand. The flourish under Beatrice's signature is a small piece of personality the model added because the structure asked for it.

Inverted and rotated text

Text doesn't always sit horizontally on the canvas. Coins, seals, stamps, and curved type lockups all require letters that follow a curve, and a good portion of design work asks for text rotated or inverted as a deliberate choice. Most image models handle this badly, producing horizontal text visually distorted to fit a shape, rather than letters that actually fit the curve.

The wax seal below has two arcs of text following its rim. The top arc reads upright. The bottom arc is rotated to follow the curve, the way real engraved seals are designed.

A crimson wax seal on cream parchment, with 'VINCULUM ET LIBERTAS' arcing along the top rim upright and 'RESPVBLICA AQVITANIAE' arcing along the bottom rim with the letters rotated to follow the curve, surrounding an embossed oak-branch-and-key emblem and the Roman numerals 'MMXII' — A wax seal with a top arc reading upright and a bottom arc rotated to follow the rim

The motto along the top arcs upright. The institutional name along the bottom arcs with each glyph rotated to follow the curve, so the text reads correctly to someone tilting the seal. This is the typographic discipline an industrial-design pipeline expects, and what most generators flatten into horizontal type that's been bent into a shape.

Layout control

Layout in a structured prompt has two levers: descriptive positioning in each element's desc field, and explicit bbox coordinates that pin an element to a region of the canvas.

Descriptive positioning is the lighter touch. The model reads phrases like "centered along the top", "in the lower-right corner", or "directly beneath the title block" and places elements accordingly. It works well when the layout has clear hierarchy and the model has enough room to make small decisions.

bbox is the heavier touch. It's an array of four integers, [y_min, x_min, y_max, x_max], in 0–1000 normalised coordinates with the origin at the top-left. The model honours the box through its shared positional embedding, so the element lands inside the named region rather than approximately near it.

The bbox order is row-first (y, x rather than x, y). Designers normally think in (x, y). Build the bbox as [top, left, bottom, right] to keep the order straight. Values must be integers, in [0, 1000], with y_min ≤ y_max and x_min ≤ x_max.

The concert ticket below is generated with explicit bbox coordinates on every element. The ASCII diagram on the left is a separate, illustrative pass through Nano Banana to sketch roughly where each element lands. The photograph on the right is what Ideogram rendered from the actual bbox coordinates.

An ASCII art diagram of a concert ticket layout, drawn with dashes, pipes, and plus signs on a white background. Five stacked rectangles in the left main area labelled ADMIT ONE, THE FILLMORE WEST, AN EVENING WITH, JONAS HARWELL TRIO, and FRIDAY · APRIL 12 · 8:00 PM. A vertical column of colons in the centre marks the perforation. Three small rectangles in the right stub labelled NO. 0274, ROW G · SEAT 14, and $4.50 — Illustrative ASCII sketch by Nano Banana (approximate, not coordinate-accurate)

A vintage 1970s concert ticket reading 'ADMIT ONE / THE FILLMORE WEST / AN EVENING WITH / JONAS HARWELL TRIO / FRIDAY · APRIL 12 · 8:00 PM' with a perforated right stub showing the ticket number, row, seat, and price, every element positioned by bbox coordinates — Ideogram's rendering from the same coordinates

ADMIT ONE sits at the top because its bbox is [40, 220, 110, 480]: top edge near y=40, near-centred horizontally between x=220 and x=480. The venue name fills the upper title block from [140, 60, 240, 660]. The performer name dominates the lower half via [360, 40, 540, 680]. The perforation obj is a tall narrow rectangle running full-height at [0, 700, 1000, 720]. Each right-stub element carves out its own small box inside the x=750-and-right zone, with a duplicate ADMIT ONE at the top of the stub and the price sitting just below the seat assignment so the stub reads as four evenly weighted rows. The elements land inside the named rectangles rather than approximately near them, and the perforation cleanly separates the stub from the main ticket area without the model negotiating where it should sit.

You can mix the two approaches. Pin the elements whose position is non-negotiable with bbox, and let the rest fall through descriptive positioning. Inside a single element, both fields can coexist: bbox declares the rectangle, and desc still carries the style and treatment notes.

Colour palette control

Colour conditioning in the structured prompt is explicit. Instead of describing colours in language ("warm sunset tones with terracotta and cream"), you list hex values the model treats as the colours to favour in the composition.

There are two places color_palette can appear in the JSON:

Inside style_description, at the image level. Up to 16 colours. The global palette for the entire image.
Inside an individual element, at the per-element level. Up to 5 colours per element. Targeted conditioning for one specific object or piece of text.

Both fields take an array of uppercase #RRGGBB hex strings. Shorthand #RGB and #RRGGBBAA formats are rejected by the verifier.

The two posters below come from the same structured prompt, distinguished only by a different style_description.color_palette and a one-word change in the medium description.

A WPA-style national-park travel poster for Black Pine Ridge National Park, with a craggy mountain ridge silhouette, a lone pine in the foreground, and a setting sun on the horizon, under a warm sunset sky, in terracotta, peach, deep brown, and cream — Warm palette

The same Black Pine Ridge National Park travel poster in a cool palette of teal, powder blue, navy, and pale grey-cream, with the mountain ridge under a pre-dawn sky — Cool palette

Both posters carry the same scene description: a mountain ridge under a wide sky, a lone pine, a sun on the horizon, the title block above, the imprint along the bottom. The font weights, the exact silhouette of the pine, the mountain peaks, and the foreground will vary between any two runs of an image model. What does not vary is the mood, and the mood is what the palette controls. The model treats the array as a target conditioning signal, not as a hint to interpret in language.

Per-element color_palette gives one element its own conditioning channel. A text element with its own palette can hold a brand colour that the rest of the scene doesn't have. An obj element with its own palette can carry a product colour without bleeding into the background. Up to 5 colours per element.

Transparent backgrounds

Design pipelines rarely use generated images as the final composition. They need elements that drop into a layered file: logos, icons, monograms, ornaments, badges. Ideogram 4.0 can produce these with a transparent background by asking for one explicitly in the prompt's background field, and by requesting outputFormat: "PNG" so the alpha channel survives.

An elegant typographic monogram of interlocking serif letters M and T in deep navy blue with gold-leaf serif detail, surrounded by a thin gold-leaf decorative oval frame, on a transparent background — An interlocking M-and-T monogram delivered with no background, ready to drop into a layered design

The monogram above is a single image element delivered with no scene behind it. Drop it into a card design or a letterhead template without having to mask anything by hand.

Transparency only survives in formats that carry an alpha channel. Always set outputFormat: "PNG" when you want the background to come through transparent.

Aspect ratios

Every output is approximately 4 million pixels. At 1:1 that's 2048 × 2048. Wider or taller ratios trade square pixel count for shape, and the API accepts only the predefined presets. The common ones for design work:

Ratio	Dimensions	Typical use
1:1	2048 × 2048	Social squares, album covers, packshots
16:9	2560 × 1440	Landscape banners, video stills
9:16	1440 × 2560	Vertical video, story posts
3:2	2496 × 1664	35mm photo proportions, posters
2:3	1664 × 2496	Portrait posters, book covers
4:5	1792 × 2240	Editorial portrait, Instagram portrait
5:4	2240 × 1792	Specimen cards, museum labels
8:5	2560 × 1600	Widescreen banners

There are 23 presets in total, including extreme aspect ratios like 22:9, 23:9, 8:3, 12:5, and the very long 3:1 and 1:3 for ultra-wide banners or pillar formats. The full list lives in the model's request schema. Sending a width and height outside the presets is rejected by the API. Pick the closest preset to your target output.

Quality tiers

Three rendering-speed tiers determine how much compute the model spends per generation:

TURBO is the fastest tier. The first iterations of an idea, low-stakes content, anything where the next pass is more important than this one.
DEFAULT is the middle tier and the right choice for most production work.
QUALITY is the slowest tier. Final delivery, typography-dense compositions, hero assets.

The same prompt at each tier produces visibly different output. The differences are most pronounced in fine text and material rendering, the kind of work where the extra compute earns its price.

The vintage watchmaker's storefront window below comes from the same structured prompt at each tier, a photoreal scene with hand-painted gold-leaf lettering on dark green glass.

A close-up photograph of a vintage watchmaker's shopfront window with gold-leaf lettering on dark green glass, at TURBO quality, with the proprietor line and establishment date slightly soft — TURBO

The same watchmaker's shopfront window at DEFAULT quality, with the proprietor line and establishment date crisper — DEFAULT

The same watchmaker's shopfront window at QUALITY quality, with sharp gold-leaf brush-stroke detail, a crisp proprietor line, and a finely reflected brick street behind the glass — QUALITY

TURBO is fast enough for rapid iteration. The proprietor line, the establishment date, and the brush-stroke detail on the gold leaf tend to drift at this tier. DEFAULT recovers most of the small-text crispness and the reflected street behind the glass. QUALITY holds the condensed capitals sharply through to the corner lettering and renders the gold-leaf grain and the glass reflections with finer detail.

Tips

Reach for Ideogram when the words have to be right. Other text-to-image models have caught up on subject rendering and style. Copy fidelity is still the differentiator. Brand work, packaging, posters, anywhere the words are the brief.
Quote the literal copy inside text fields. The text field is exactly what gets rendered. Apostrophes, accents, special characters, all reproduced as written. If you want a curly apostrophe in "Dr. Faukland's", write a curly apostrophe.
Describe the position, weight, and treatment in desc. "Bold black serif capitals across the top", "small italic underneath", "lower-right corner". The desc is where you spend the typographic discipline a sentence prompt can't carry.
Use bbox when descriptive positioning isn't tight enough. [y_min, x_min, y_max, x_max] in 0-1000 normalised coordinates, row-first ordering, written as [top, left, bottom, right]. Pin the elements that must land exactly, leave the rest descriptive.
Use color_palette for directed colour. Describing colours in prose is interpretation. The color_palette array is a conditioning signal. 16 colours image-level, 5 per element, uppercase #RRGGBB.
Use PNG output for any asset that needs transparency. JPG flattens the alpha channel and gives you a white or black background where you wanted transparency. The outputFormat: "PNG" setting is one line of payload.
Pick the aspect ratio that matches the final canvas, not the closest one to your composition. Resizing or cropping a 16:9 to fit a 4:5 Instagram portrait loses the framing the model designed to. Generate at the target ratio.
Use QUALITY tier for hero assets and typography-dense work. The extra compute pays off where small text precision matters. For thumbnails and exploratory iterations, TURBO is enough.