Runware Liquid Metal Logo

Sonic Inference Engine®

Custom AI-native hardware engineered to deliver the fastest AI inference at 10x lower price. Designed from the PCB up for high inference throughput with proprietary boards, servers, racks, networking, datacenter and cooling architecture.

+100%
Extra throughput per GPU chip
No restrictions
Run any model natively
10x Efficiency
2x inference with 5x less OPEX
Geo Proximity
Deployed near your users.

Engineered for AI Inference

  • Custom AI native servers running best-in-class Nvidia GPUs
  • End-to-end design of compute, storage, networking and cooling
  • Accelerated inference software with +100% compounded throughput
  • Proprietary Model Lake with sub-second cold starts for 400k+ models

Dominates Traditional Data Centers

  • 2x inference throughput for top open source AI models
  • 80% lower CAPEX and OPEX to deploy
  • 50x faster build-out, 3 weeks to deploy instead of 3+ years
  • Highest-density AI compute, 1 MW in a 20 ft. Inference Pod

meet the inference pod

The system powering Sonic Inference Engine, optimised for cost and performance

Front view of a pod

Unified API – Scale inference to millions of users instantly

2x ThroughputInference Nodes
2x ThroughputInference Nodes
2x ThroughputInference Nodes
2x ThroughputInference Nodes
2x ThroughputInference Nodes
2x ThroughputInference Nodes
High Performance Networking & RoutingCustom ultra-low-latency PCIe networking | Intelligent inference request routing to nodes
Realtime Model LakeSub-second cold starts for 400k+ models | Enterprise-grade redundant storage
Inference Pod

scaling to 1M+ AI models in 2026

System architecture for on-demand loading, intelligent routing, and model residency

  • Model Lake keeps all models available for inference at any time
  • Models load on demand into GPU memory when not already resident
  • Intelligent routing sends requests to the best node by availability, performance, and queue
  • Models stay resident on nodes while they continue receiving requests
  • Model Upload API loads compatible models directly into the Model Lake for instant access
Model Lake Global Scaling Diagram

built for sustained performance

Parallel GPUs, high-frequency CPUs, and software tuned for throughput

+2x AI Inference Throughput per Node
Off the shelf GPU servers waste 40 to 60% throughput due to CPU frequency and memory bottlenecks. Our Inference Nodes are built to sustain 100% GPU throughput across multiple GPUs, continuously.
Designed to Parallelise Large Model Inference
Store nearly any large model locally and parallelise inference across multiple GPUs, delivering the lowest end to end inference time in the industry across models and modalities.
No AI Model Limitations, Native Support
Run AI models natively on Inference Nodes with all supported capabilities. No customisation or adaptation needed, our hardware runs any open source model that runs on a standard GPU server.
Low Level Software Optimizations
We optimise everything from BIOS and kernel to OS distribution and configuration to maximise latency reduction and throughput. These optimisations amplify hardware performance gains by up to 100%.
Sustained performance

local access, global scale

A globally distributed inference platform built for low latency and unlimited scale

Local availability
Global points of presence with local Model Lakes, low latency and realtime cold starts
Universal compatibility
Any Pod can run any model, no restrictions on model or inference type
Elastic scaling
Capacity scaling via 3rd party GPU providers, no scale or capacity limits
Direct power sourcing
Power sourced directly at the source, lowest cost, no overheads or transport charges
Global map

// let's build

pay less, ship more

Join 200K+ devs using the most flexible, fast, and lowest cost API for media generation.