Beta

Scale AI without paying
for idle infrastructure.

Q: Is this GPU rental?

No, and that's kind of the point. You never touch a node. You deploy an application; we run, scale, route, and meter it. You pay for runtime, not for babysitting infrastructure.

Q: What can I run?

Code, containers, existing servers, and custom inference applications, plus CPU-heavy app workloads and GPU workloads across image, video, audio, 3D, LLM, and custom apps. If it serves traffic, it fits.

Q: Can I reserve capacity?

Yes, and at scale you'll want to. On-demand covers bursts; reserved envelopes give you guaranteed availability, concurrency, region, and deployment priority, up to 50% lower pricing depending on duration, availability, and scale.

Q: How is billing calculated?

Per second of runtime against the GPU class you choose, shown as per-hour equivalents. You're billed while your application is running or held warm. On-demand scheduling waits, image pulls, and platform failures are our cost. Reserved capacity is billable even while idle, because you're holding warm capacity.

Q: Which GPUs can I run on?

A cost-efficient default covers most inference. Higher-memory and frontier GPU classes are available from our wider pool, so you can match the hardware to your workload. Reserved clusters are scoped per deployment; talk to us if you need them.

Deploy and scale your AI workloads on Runware's optimized hardware. Pay only for the runtime you use, billed by the second, with no infrastructure to manage and capacity that scales with demand.

See pricing

$1.99per GPU-hourat launch

Runtime pricing

Pay-per-second runtime billing at bare-metal price points.

Pay only for the time you use - not cold starts, not idle capacity. Flexible by default with no commitment, or reserve capacity for a lower rate. Runware's optimized hardware and software acceleration bring serverless price points down to levels usually reserved for long-term bare-metal rentals.

Typical market rateComparable serverless GPU capacity

$3–4+

Runware launch priceRTX PRO 6000 · 96 GB GDDR7

$1.99

Per GPU-hour, workhorse inference capacity. Running our own optimized hardware lets us offer some of the best rates in the industry, and launch pricing is limited, so lock in your rate early. List rates move around, so check the math on your own workload.

ProfileMemory / classTypical usePer secondPer hourTerms

vCPU nodeCPU service profileChatbot apps, routing, Comfy servers, orchestration$0.0000044 / vCPU-sec$0.016On-demand or reserved

RTX PRO 6000Launch96 GB GDDR7Runware workhorse inference$0.000553 / GPU-sec$1.99as low as $0.99Launch reserved pricing

H10080 GB HBMHigh-throughput inference and compatibility needs$0.000767 / GPU-sec$2.76Launch pricing preview

H200141 GB HBM3eLarge-context and memory-heavy inference$0.000883 / GPU-sec$3.18Launch pricing preview

B200Premium reservedLarge reserved deployments and premium clusters$0.001386 / GPU-sec$4.99Reserved cluster pricing

B300Premium reservedFrontier reserved clusters with custom specsCustomCustom1 to 2 year reserved terms

You pay for runtime, not infrastructure ownership. The meter runs while your application does. A cost-efficient GPU covers most inference; higher-memory and frontier classes are available from our wider pool, so you can match the hardware to your workload.

Reserved capacity

Lower rates when you reserve capacity.

Run flexible with no commitment and per-second billing, or reserve capacity for guaranteed availability, concurrency, and region at a lower rate. Reservation terms are optional and scoped to your scale.

Reserved capacityUp to 50%lower than on-demand

Lock in lower economics and deployment priority with a reservation when your workload is steady. No commitment required when it isn't.

One deploy.
We run the rest.

Bring a container, an existing server, or model code and get a production endpoint that scales with traffic, with support for both fast synchronous calls and long-running jobs. Scaling, routing, and recovery are ours to run, not yours.

Source, configure, compute - step through the deploy flow and you have a live, autoscaling endpoint.

Deployment wizard step one: choose how to bring your workload to Runware — an HTTP server, a container, or just code.

How it works

Serverless runtime for code, containers, and applications.

01 · Package

Package the
workload

Code, a container, an existing server, or your application. Private registries and runtime secrets included.

02 · GPU

Choose a
GPU class

A cost-efficient default covers most inference. Reach for higher-memory or frontier classes from the wider pool when a workload needs them.

03 · Scale

Scale with
usage

Burst on demand for peaks, then settle back to normal. Reserve a capacity envelope when throughput must be guaranteed.

04 · Visibility

Full cost
visibility

Logs, worker state, latency, errors, and live cost. You always know what's running and exactly what you're saving against the alternatives.

Platform features

Managed serverless toolkit.

Everything below ships with every application. No enterprise-tier gating on the basics.

Elasticity

Scale from zero to thousands

Burst on demand around your real traffic, then settle back to nothing. Reserve capacity when you need guaranteed throughput, region, or priority.

Pay for usage

Per-second runtime billing

You pay while your application runs, not for idle capacity you own. The clearest way to keep inference costs tied to demand.

Fully managed

No infrastructure to run

Scaling, routing, and recovery are handled for you. You deploy; we keep it healthy and serving traffic.

Long-running jobs

Sync and async, handled

Fast calls return immediately; long renders and big batches are queued and executed for you, without timeout gymnastics.

Observability

Logs, state, usage, cost

Application state, latency, errors, and live cost visibility built in. The dashboard you'd build yourself, already there.

Security

Private registries and secrets

Pull private images and inject runtime secrets safely. Your code and weights stay yours.

What you pay for

running
held warm
reserved capacity

You're billed by the second while your application runs or is held warm. Reserved capacity is billable even when idle, because you're holding it.

What you never pay for

scheduling wait
image pulls
our failures

On-demand waits, image pulls, and platform failures are our cost, not yours.

Got a model, want an API?Explore API GatewayBring your model. We get it production-ready and you consume it as a dedicated API. →

FAQs

Is this GPU rental?

No, and that's kind of the point. You never touch a node. You deploy an application; we run, scale, route, and meter it. You pay for runtime, not for babysitting infrastructure.

What can I run?

Code, containers, existing servers, and custom inference applications, plus CPU-heavy app workloads and GPU workloads across image, video, audio, 3D, LLM, and custom apps. If it serves traffic, it fits.

Can I reserve capacity?

Yes, and at scale you'll want to. On-demand covers bursts; reserved envelopes give you guaranteed availability, concurrency, region, and deployment priority, up to 50% lower pricing depending on duration, availability, and scale.

How is billing calculated?

Per second of runtime against the GPU class you choose, shown as per-hour equivalents. You're billed while your application is running or held warm. On-demand scheduling waits, image pulls, and platform failures are our cost. Reserved capacity is billable even while idle, because you're holding warm capacity.

Which GPUs can I run on?

A cost-efficient default covers most inference. Higher-memory and frontier GPU classes are available from our wider pool, so you can match the hardware to your workload. Reserved clusters are scoped per deployment; talk to us if you need them.

Runware Serverless Beta

Stop paying for idle GPUs.

Deploy an application, pick a profile, pay for runtime. If our math doesn't beat your current bill, don't switch, but do check the math.

Scale AI without payingfor idle infrastructure.