Best Audio

The strongest audio generation models available, selected for clarity, naturalness, and timing across voice, narration, sound effects, and music output.

Launch model

Top Pick

Launch View details

Best rated

Fish Audio S2.1 Pro

by Fish Audio

Fish Audio S2.1 Pro is a flagship text-to-speech model built for highly expressive, low-latency speech generation. It supports natural-language bracket cues for emotion and delivery control, multi-speaker dialogue in a single generation, 80+ languages with automatic language detection, and realtime streaming with very fast time to first audio.

Featured Models

Top-performing models in this category, recommended by our community and performance benchmarks.

Launch View details

Inworld Realtime TTS-2

by Inworld AI

Inworld Realtime TTS-2 is a conversational text-to-speech model built for realtime voice interaction rather than static narration. It supports free-form voice direction, carries tone and pacing forward from prior audio in a session, preserves one voice identity across 100+ languages, and is designed for expressive, low-latency speech in assistants, characters, support agents, and interactive products.

Launch View details

Gemini 3.1 Flash TTS

by Google

Gemini 3.1 Flash TTS is a text-to-speech model for expressive spoken audio generation from text. It supports granular control over delivery through audio tags, native multi-speaker dialogue, and speech generation across 70+ languages, making it suitable for narration, conversational voice apps, podcasts, audiobooks, and other production-oriented voice workflows.

Launch View details

ACE-Step v1.5 XL SFT

ACE-Step v1.5 XL SFT is the supervised fine-tuned 4B DiT variant in the ACE-Step 1.5 XL line. It is positioned as the highest-quality XL option, combining 50-step CFG inference with stronger prompt adherence and refined audio quality for text-to-music, cover, and repaint workflows when final output quality matters more than speed or broader editing task coverage.

Launch View details

ACE-Step v1.5 XL Base

ACE-Step v1.5 XL Base is the 4B DiT variant of ACE-Step 1.5 for high-quality music generation and editing. It supports text-to-music, cover generation, repaint, extract, lego, and complete workflows, uses 50 inference steps with CFG, and is designed for longer-form audio generation up to 10 minutes with broad multilingual prompt support.

Launch View details

ACE-Step v1.5 XL Turbo

ACE-Step v1.5 XL Turbo is the accelerated 4B DiT variant of ACE-Step 1.5. It is optimized for faster music generation with 8-step distilled inference while retaining the higher-capacity XL architecture. It supports text-to-music, cover generation, and repaint workflows, making it suitable for rapid iteration when the 2B turbo model is not enough in audio quality.

Launch View details

MiniMax Music 2.6

by MiniMax

MiniMax Music 2.6 is MiniMax’s latest music generation model for full vocal songs and instrumentals from text prompts. It supports natural-language prompts or detailed production-style instructions, follows specified BPM and key with high reliability, and exposes fine-grained song structure control through section tags. The same Music API also supports instrumental generation, lyrics-assisted workflows, and synchronous or streaming delivery.

Launch View details

MiniMax Music Cover

by MiniMax

MiniMax Music Cover is MiniMax’s song-to-song transformation model for reimagining an existing track in a new style. It preserves the original vocal melody while changing voice timbre, instrumentation, genre, and arrangement through a text prompt. It supports one-step generation from reference audio or a two-step workflow with preprocessing and optional lyric editing.

Launch View details

MiniMax Speech 2.8

by MiniMax

MiniMax Speech 2.8 is an advanced text-to-speech model that turns text into natural, expressive audio in multiple languages. It delivers broadcast-ready speech with rich prosody, emotional control, and a diverse voice library. The model supports up to large input lengths and can be used for voiceovers, narration, accessibility tools, and interactive voice applications.

#10

Launch View details

xAI Text-to-Speech

by xAI

xAI Text-to-Speech converts text into natural-sounding spoken audio with a single API call. It offers five expressive voices (Eve, Ara, Leo, Rex, and Sal), inline speech tags for fine-grained control over pauses, laughter, whispers, and emphasis, and supports over 20 auto-detected languages.

#11

Launch View details

Inworld TTS-1.5 Max

by Inworld AI

Inworld TTS-1.5 Max is a high-fidelity text-to-speech model engineered for expressive voice synthesis with rich prosody, nuanced emotional range, and broadcast-ready audio quality. It supports a wide set of languages and delivers more natural pronunciation and expressive variation suitable for narration, content creation, and immersive character voices. The Max variant prioritizes audio quality and expressiveness while still supporting responsive generation.

#12

Launch View details

Inworld TTS-1.5 Mini

by Inworld AI

Inworld TTS-1.5 Mini is a lightweight text-to-speech model designed for real-time voice experiences with ultra-low latency and efficient performance. It delivers natural, expressive audio suitable for interactive agents, voice assistants, and conversational applications where responsiveness is critical. The Mini variant balances speed and quality, enabling responsive speech output even under constrained compute conditions.

#13

Launch View details

Qwen3-TTS 1.7B Base

by Alibaba

Qwen3-TTS 1.7B Base is the foundation text-to-speech model from Alibaba's Qwen3-TTS family. It generates human-like speech across 10+ languages including Chinese, English, Japanese, Korean, and European languages. It supports voice cloning from a 3-second audio sample and achieves latency as low as 97ms for real-time applications.

#14

Launch View details

Qwen3-TTS 1.7B CustomVoice

by Alibaba

Qwen3-TTS 1.7B CustomVoice is a text-to-speech model from Alibaba that offers nine premium preset timbres across various combinations of gender, age, language, and dialect. It provides precise style control over target voices through user instructions, supports voice cloning from a 3-second sample, and generates speech in 10+ languages with latency as low as 97ms.

#15

Launch View details

Qwen3-TTS 1.7B VoiceDesign

by Alibaba

Qwen3-TTS 1.7B VoiceDesign is a text-to-speech model from Alibaba that creates custom voices from natural language descriptions specifying emotion, tone, and prosody. It supports voice cloning from a 3-second audio sample, generates speech in 10+ languages including Chinese, English, Japanese, Korean, and European languages, and achieves latency as low as 97ms.

#16

Launch View details

ACE-Step v1.5 Base

ACE-Step v1.5 Base is an open-source music generation foundation model built on a hybrid LLM planner and Diffusion Transformer architecture. It generates full tracks from text prompts with support for voice cloning, lyric editing, remixing, cover generation, and compositions up to 10 minutes. It supports over 50 languages and runs on consumer hardware with under 4GB VRAM.

#17

Launch View details

Dia2 2B

Dia2 2B is a 2 billion parameter streaming text-to-speech model from Nari Labs designed for real-time conversational AI. It begins generating audio immediately from partial text input, supports multi-speaker dialogue via speaker tags, voice cloning from a few seconds of reference audio, and non-verbal cues like laughter, sighs, and coughs. Released under Apache 2.0 for commercial use.

#18

Launch View details

lipsync-2-pro

by sync.

lipsync-2-pro extends lipsync-2 with diffusion-based enhancement for studio-grade lip synchronization. It preserves fine facial details such as teeth, facial hair, and micro-expressions while supporting high-resolution output suitable for professional post-production workflows.

Best Audio

Fish Audio S2.1 Pro

Featured Models

Inworld Realtime TTS-2

Gemini 3.1 Flash TTS

ACE-Step v1.5 XL SFT

ACE-Step v1.5 XL Base

ACE-Step v1.5 XL Turbo

MiniMax Music 2.6

MiniMax Music Cover

MiniMax Speech 2.8

xAI Text-to-Speech

Inworld TTS-1.5 Max

Inworld TTS-1.5 Mini

Qwen3-TTS 1.7B Base

Qwen3-TTS 1.7B CustomVoice

Qwen3-TTS 1.7B VoiceDesign

ACE-Step v1.5 Base

Dia2 2B

lipsync-2-pro

Explore other collections