
ACE-Step v1.5 XL SFT
Highest-quality 4B music generation model with CFG-controlled prompt adherence
ACE-Step v1.5 XL SFT
Highest-quality 4B music generation model with CFG-controlled prompt adherence
ACE-Step v1.5 XL SFT Overview
ACE-Step v1.5 XL SFT is the supervised fine-tuned 4B DiT variant in the ACE-Step 1.5 XL line. It is positioned as the highest-quality XL option, combining 50-step CFG inference with stronger prompt adherence and refined audio quality for text-to-music, cover, and repaint workflows when final output quality matters more than speed or broader editing task coverage.
How to Use ACE-Step v1.5 XL SFT
Overview
ACE-Step v1.5 XL SFT is the highest-quality 4B variant in the ACE-Step 1.5 XL music model family.
Compared with XL Base and XL Turbo, it is positioned as the quality-first option: a supervised fine-tuned model with CFG support, very high audio quality, and stronger prompt adherence when the goal is final output quality rather than the broadest task set or the fastest turnaround.
Capabilities
Text-to-Music Generation
Generate music from text prompts, including songs with lyrics, structured musical descriptions, and genre or instrumentation guidance.
Quality-First XL Variant
Within the ACE-Step 1.5 XL line, the SFT variant is explicitly positioned as the highest-quality option. It is the strongest fit when the generation should sound as polished as possible and prompt adherence matters.
CFG-Based Prompt Control
ACE-Step v1.5 XL SFT supports classifier-free guidance, which allows finer-grained control over prompt adherence than the distilled turbo path.
Standard XL Task Coverage
The SFT variant supports the standard ACE-Step XL generation workflow set, including text-to-music, cover generation, and repaint tasks.
Longer Audio Generation
Like the rest of the ACE-Step 1.5 family, the XL line supports generation from short clips up to 10 minutes, making it suitable for full songs, extended compositions, and longer-form audio work.
Multilingual Prompting and Lyrics
The model family supports prompt adherence across 50+ languages, including lyric-driven workflows and structured song control.
Input and Output
- AIR ID:
runware:[email protected] - Input: text prompts, lyrics, and task-specific conditioning depending on workflow
- Output: generated audio
- Inference profile: 50 steps with CFG
- Supported duration: 10 seconds to 10 minutes
- Task coverage: text-to-music, cover, repaint
Hardware Notes
The XL line has materially higher memory requirements than the 2B models. The official model card lists support starting at 12 GB VRAM with CPU offload and INT8 quantization, with 20 GB or more recommended for running without offload and 24 GB for full-quality XL plus the 4B LM.
Typical Use Cases
- Highest-quality text-to-music generation in the XL family
- Final-pass song generation from prompts and lyrics
- Cover generation with stronger prompt adherence
- Repaint workflows where quality matters more than speed
- Longer-form soundtrack or background music generation