CLIP Skip: Adjusting text interpretation depth

Adjusts which text encoder layer interprets your prompt, shifting between literal and abstract output.

Introduction

The clipSkip parameter controls which layer of the CLIP text encoder is used to interpret your prompt. By skipping deeper layers, you change how the model reads your text, shifting between literal interpretation and more abstract, stylistic output.

Most diffusion models use a component called CLIP (Contrastive Language-Image Pre-training), which contains a text encoder that translates your prompt into a numerical representation the model can understand. This text encoder is a neural network with multiple layers, each extracting different levels of meaning:

  • Deeper layers (lower skip values) focus on abstract, semantic, and stylistic aspects of your prompt.
  • Earlier layers (higher skip values) emphasize literal, concrete interpretations.

Adjusting clipSkip lets you tune whether the model follows your prompt closely or takes more creative liberties with style and composition.

Request structure

The clipSkip parameter is an integer passed at the top level of your generation request.

[
  {
    "taskType": "imageInference",
    "model": "civitai:101055@128078",
    "positivePrompt": "Smiling avocado with sunglasses emoji, stickers pack",
    "clipSkip": 2,
    "steps": 30,
    "width": 1024,
    "height": 1024
  }
]

Architecture notes

clipSkip only applies to models that use the CLIP text encoder, such as SD 1.5 and SDXL-based models. Other models that rely on different text encoders (like T5 or LLaMA) will not be affected by this parameter.

Note that SDXL models already skip one layer by default, so setting clipSkip to 2 with SDXL effectively skips three layers from the original encoder.

Models like FLUX, Recraft, and other non-CLIP architectures ignore this parameter entirely. If you're unsure whether your model uses CLIP, omit clipSkip and let the default behavior take effect.

Tips

  1. Default to 0 for most work. Unless you're targeting a specific stylistic effect, the default (no skip) gives you the most faithful interpretation of your prompt.
  2. Match the model's training. Many community models on Civitai list a recommended clipSkip value. If the model was fine-tuned with clipSkip: 2, using that value will produce output closest to the model's intended aesthetic.
  3. Pair with LoRAs carefully. Some LoRAs are trained with a specific clipSkip value. Mismatching can produce unexpected style shifts. Check the LoRA's documentation if available.