CLIP Skip: Adjusting text interpretation depth

Adjusts which text encoder layer interprets your prompt, shifting between literal and abstract output.

Introduction

The clipSkip parameter controls which layer of the CLIP text encoder is used to interpret your prompt. By skipping deeper layers, you change how the model reads your text, shifting between literal interpretation and more abstract, stylistic output.

Most diffusion models use a component called CLIP (Contrastive Language-Image Pre-training), which contains a text encoder that translates your prompt into a numerical representation the model can understand. This text encoder is a neural network with multiple layers, each extracting different levels of meaning:

Deeper layers (lower skip values) focus on abstract, semantic, and stylistic aspects of your prompt.
Earlier layers (higher skip values) emphasize literal, concrete interpretations.

Adjusting clipSkip lets you tune whether the model follows your prompt closely or takes more creative liberties with style and composition.

For sticker images, using a clipSkip value of 2 is preferred, leading to a simpler, cleaner result that better fits the minimalistic style expected of a sticker.

A smiling avocado with sunglasses sticker, ClipSkip 0 — ClipSkip 0

A smiling avocado with sunglasses sticker, ClipSkip 1 — ClipSkip 1

A simpler smiling avocado with sunglasses sticker, ClipSkip 2 — ClipSkip 2

For photorealistic portrait images where capturing fine details and realism is more important, not using clipSkip produces a richer and more detailed image that better matches the intended outcome.

Stylized portrait of a woman, ClipSkip 0 — ClipSkip 0

Stylized portrait of a woman, ClipSkip 1 — ClipSkip 1

Stylized portrait of a woman, ClipSkip 2 — ClipSkip 2

Request structure

The clipSkip parameter is an integer passed at the top level of your generation request.

import { createClient } from '@runware/sdk'

const client = await createClient({ apiKey: process.env.RUNWARE_API_KEY })
await client.connect()

const [result] = await client.run({
  model: 'civitai:101055@128078',
  positivePrompt: 'Smiling avocado with sunglasses emoji, stickers pack',
  clipSkip: 2,
  steps: 30,
  width: 1024,
  height: 1024
})

import asyncio
import os

from runware import Runware


async def main():
    async with Runware(api_key=os.environ["RUNWARE_API_KEY"]) as client:
        results = await client.run({
            "model": "civitai:101055@128078",
            "positivePrompt": "Smiling avocado with sunglasses emoji, stickers pack",
            "clipSkip": 2,
            "steps": 30,
            "width": 1024,
            "height": 1024
        })


asyncio.run(main())

curl https://api.runware.ai/v1 \
  -H "Authorization: Bearer $RUNWARE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '[
    {
      "taskType": "imageInference",
      "model": "civitai:101055@128078",
      "positivePrompt": "Smiling avocado with sunglasses emoji, stickers pack",
      "clipSkip": 2,
      "steps": 30,
      "width": 1024,
      "height": 1024
    }
  ]'

runware run civitai:101055@128078 \
  positivePrompt="Smiling avocado with sunglasses emoji, stickers pack" \
  clipSkip=2 \
  steps=30 \
  width=1024 \
  height=1024

{
  "taskType": "imageInference",
  "model": "civitai:101055@128078",
  "positivePrompt": "Smiling avocado with sunglasses emoji, stickers pack",
  "clipSkip": 2,
  "steps": 30,
  "width": 1024,
  "height": 1024
}

Recommended values

The optimal clipSkip value depends on your use case and the content you're generating:

Use case	Recommended value	Why
Photorealism, portraits	0 (disabled)	Deeper layers preserve fine detail, skin texture, and accurate color reproduction
Anime, illustrations	1 - 2	Many anime-trained models respond well to skipping 1-2 layers. The output becomes more stylized and compositionally bold
Stickers, flat art	2	Skipping more layers produces cleaner lines and simpler forms that suit flat design
Abstract, experimental	2 - 3	Higher skip values push the model toward looser, more interpretive output

Architecture notes

clipSkip only applies to models that use the CLIP text encoder, such as SD 1.5 and SDXL-based models. Other models that rely on different text encoders (like T5 or LLaMA) will not be affected by this parameter.

Note that SDXL models already skip one layer by default, so setting clipSkip to 2 with SDXL effectively skips three layers from the original encoder.

Models like FLUX, Recraft, and other non-CLIP architectures ignore this parameter entirely. If you're unsure whether your model uses CLIP, omit clipSkip and let the default behavior take effect.

Tips

Default to 0 for most work. Unless you're targeting a specific stylistic effect, the default (no skip) gives you the most faithful interpretation of your prompt.
Match the model's training. Many community models on Civitai list a recommended clipSkip value. If the model was fine-tuned with clipSkip: 2, using that value will produce output closest to the model's intended aesthetic.
Pair with LoRAs carefully. Some LoRAs are trained with a specific clipSkip value. Mismatching can produce unexpected style shifts. Check the LoRA's documentation if available.