
ACE-Step v1.5 XL Base
4B music generation model with higher audio quality and full editing task support
ACE-Step v1.5 XL Base
4B music generation model with higher audio quality and full editing task support
ACE-Step v1.5 XL Base Overview
ACE-Step v1.5 XL Base is the 4B DiT variant of ACE-Step 1.5 for high-quality music generation and editing. It supports text-to-music, cover generation, repaint, extract, lego, and complete workflows, uses 50 inference steps with CFG, and is designed for longer-form audio generation up to 10 minutes with broad multilingual prompt support.
How to Use ACE-Step v1.5 XL Base
Overview
ACE-Step v1.5 XL Base is the higher-capacity 4B variant of the ACE-Step 1.5 music model family. Compared with the smaller 2B models, it is intended for stronger audio quality while keeping the broader task coverage of the base configuration.
This model is a good fit when quality matters more than raw generation speed and the workflow may need more than simple text-to-music output.
Capabilities
Text-to-Music Generation
Generate music from text prompts, including songs with lyrics, structured musical descriptions, and genre or instrumentation guidance.
Full Editing Task Coverage
The XL Base variant supports the widest task set in the ACE-Step XL line, including text-to-music, cover generation, repaint, extract, lego, and complete workflows.
Longer Audio Generation
The ACE-Step 1.5 family supports generation from short clips up to 10 minutes, which makes it suitable for full songs, extended compositions, and longer-form background music.
Multilingual Prompting and Lyrics
The model family supports prompt adherence across 50+ languages, including lyric-driven workflows and structured song control.
Higher-Capacity XL Decoder
This variant uses a 4B DiT decoder for stronger audio quality than the standard 2B models. It is the quality-oriented base option within the XL line.
Input and Output
- Model ID:
acestep-v15-xl-base - Input: text prompts, lyrics, and task-specific conditioning depending on workflow
- Output: generated audio
- Inference profile: 50 steps with CFG
- Supported duration: 10 seconds to 10 minutes
- Task coverage: text-to-music, cover, repaint, extract, lego, complete
Hardware Notes
The XL line has materially higher memory requirements than the 2B models. The official model card lists support starting at 12 GB VRAM with offload and quantization, with 20 GB or more recommended for running without offload.
Typical Use Cases
- Higher-quality text-to-music generation
- Full-song generation with lyrics and structure
- Audio editing and reconstruction workflows
- Cover generation and arrangement transfer
- Longer-form soundtrack or background music generation