Your AI video pipeline is silent. Here's the fix.
Mirelo SFX v1.5 adds synchronized sound effects to silent AI video clips without regeneration, now available on Runware.

Most AI video pipelines have the same gap. You generate a clip, it looks good, and then you realize it has no sound, or the sound is just not good enough. Your options at that point are not great: regenerate and hope the next model delivers something better, spend time manually sourcing and syncing effects, or ship something that feels unfinished.
Mirelo SFX v1.5 is a post-processing step that removes that decision. You pass it a silent video, it returns the same clip with synchronized sound effects. No regenerating. No writing sound descriptions for every clip. The video is the prompt.
Now, it's available on Runware.
How Mirelo SFX v1.5 works
Mirelo SFX v1.5 reads your video frames, infers what's happening on screen, and generates sound effects timed to match. Footsteps, impacts, ambient noise, mechanical sounds, or weather. Whatever the scene calls for. You can steer it with a text prompt if you want, or leave it to interpret the footage on its own.
A few things worth knowing upfront:
- Input: silent video clips up to 10 seconds (longer clips are truncated)
- Output: the same video with synced SFX, or an audio-only track for compositing
- Steering: optional text prompt to guide the sound style or character
- Variations: 2 to 4 samples per call
- Pricing on Runware: $0.01 per second of output audio per sample (a three-second clip with three variations costs nine cents)
The model won't generate speech or music. If you need those, you're handling them separately. For most pipeline use cases, that's actually preferable: clean, compositable SFX with nothing bleeding in.
Explosion: Video + Sound
INPUT VIDEO, NO SOUND
OUTPUT VIDEO WITH MIRELO
Cooking: Video + Sound
INPUT VIDEO, NO SOUND
OUTPUT VIDEO WITH MIRELO
How it compares
MMAudio V2 is the closest like-for-like alternative: video in, synced audio out. It's a solid research model. Published at CVPR 2025, capable output, available via API on third-party platforms.
The practical differences are worth knowing. MMAudio's weights are released under CC-BY-NC 4.0, which means commercial use isn't guaranteed and needs to be assessed before you ship. It also works best with a text prompt alongside the video, so someone or something in your pipeline needs to describe the sound for each clip. And it returns one output per call.
Mirelo is a commercial API with clear usage rights, requires no prompt, and returns up to four variations per call. For pipelines running at volume without a human in the loop, those aren't minor differences.
| Mirelo SFX v1.5 | MMAudio V2 | |
|---|---|---|
| Input | Silent video | Video + text prompt |
| Output | Clean SFX track | Full audio (SFX, ambient, sometimes music) |
| Commercial use | Yes | Not guaranteed (CC-BY-NC) |
| Prompt required | No | Recommended |
| Variations per call | 2 to 4 | 1 |
| Max duration | 10 seconds | 8 seconds (optimized) |
Mirelo SFX v1.5 versus MMAudio V2
MMAudio V2
Mirelo SFX v1.5
MMAudio V2
Mirelo SFX v1.5
MMAudio V2
Mirelo SFX v1.5
Where it fits in real pipelines
The simplest case
Generate video, call Mirelo, get back a clip with sound. For use cases that don't need voice or music, that's the whole pipeline. It works well for:
- Product showcases and demos
- Social content and short-form video
- Game cutscenes and UI previews
- Any context where a watchable clip matters more than a fully produced piece
It won't replace a sound designer on a major production. But as a default step that runs on every clip, it holds up.
Prompting when you have context
The model interprets video frames on its own, but it responds well to text prompts when you give them. If your pipeline already tracks generation metadata, use it.
A clip generated with a prompt like "underwater chase sequence" will produce noticeably better audio when you pass something like "muffled impact, water movement, low ambient pressure" to Mirelo alongside the video. Stylized or non-photorealistic content especially benefits here: the visual cues are harder for any model to read, and a prompt anchors the output in the right sonic territory. Passing your generation prompt through to Mirelo adds almost no overhead and improves results on edge cases the model would otherwise misread.
The full audio stack
When you need all three layers (sound effects, voice, and music), keep them separate. Each layer stays independent, so if a clip gets regenerated, you're re-running Mirelo on that clip, not rebuilding the whole mix.
A typical setup:
- SFX: Mirelo SFX v1.5, using audio-only output mode
- Voice: ElevenLabs or similar
- Music: your background score model of choice
- Composition: FFmpeg or your audio tool of choice
Batch with variation selection
Mirelo returns two to four variants per call. In a high-volume pipeline, run each clip with three variations and build lightweight selection logic to pick the best one before delivery. It's not a substitute for quality review on anything high-stakes, but for pipelines generating at volume, it reduces outliers without adding a full review step.
A few honest caveats
The ten-second limit is real. For short-form content, it's fine. For anything longer, you're either trimming clips or chaining calls. Or you need to choose a different approach.
Where it can struggle is with highly abstract or surreal video, where motion cues are ambiguous. Most standard AI video output won't hit this edge case, but some will. Worth testing on your specific content before committing it to production.
And as with any generative model, build variation selection into your pipeline rather than assuming the first output is the right one.
Getting started
Mirelo SFX v1.5 is in the Runware Playground now if you want to test it on your own clips before writing any API code. Same credentials, same billing as everything else on the platform.
Try it in the Playground or go straight to the API docs.
