ACE-Step v1.5 XL Base

4B music generation model with higher audio quality and full editing task support

Text to Audio

ACE-Step v1.5 XL Base Overview

ACE-Step v1.5 XL Base is the 4B DiT variant of ACE-Step 1.5 for high-quality music generation and editing. It supports text-to-music, cover generation, repaint, extract, lego, and complete workflows, uses 50 inference steps with CFG, and is designed for longer-form audio generation up to 10 minutes with broad multilingual prompt support.

How to Use ACE-Step v1.5 XL Base

Overview

ACE-Step v1.5 XL Base is the higher-capacity 4B variant of the ACE-Step 1.5 music model family. Compared with the smaller 2B models, it is intended for stronger audio quality while keeping the broader task coverage of the base configuration.

This model is a good fit when quality matters more than raw generation speed and the workflow may need more than simple text-to-music output.

Capabilities

Text-to-Music Generation

Generate music from text prompts, including songs with lyrics, structured musical descriptions, and genre or instrumentation guidance.

Full Editing Task Coverage

The XL Base variant supports the widest task set in the ACE-Step XL line, including text-to-music, cover generation, repaint, extract, lego, and complete workflows.

Longer Audio Generation

The ACE-Step 1.5 family supports generation from short clips up to 10 minutes, which makes it suitable for full songs, extended compositions, and longer-form background music.

Multilingual Prompting and Lyrics

The model family supports prompt adherence across 50+ languages, including lyric-driven workflows and structured song control.

Higher-Capacity XL Decoder

This variant uses a 4B DiT decoder for stronger audio quality than the standard 2B models. It is the quality-oriented base option within the XL line.

Input and Output

  • Model ID: acestep-v15-xl-base
  • Input: text prompts, lyrics, and task-specific conditioning depending on workflow
  • Output: generated audio
  • Inference profile: 50 steps with CFG
  • Supported duration: 10 seconds to 10 minutes
  • Task coverage: text-to-music, cover, repaint, extract, lego, complete

Hardware Notes

The XL line has materially higher memory requirements than the 2B models. The official model card lists support starting at 12 GB VRAM with offload and quantization, with 20 GB or more recommended for running without offload.

Typical Use Cases

  • Higher-quality text-to-music generation
  • Full-song generation with lyrics and structure
  • Audio editing and reconstruction workflows
  • Cover generation and arrangement transfer
  • Longer-form soundtrack or background music generation