Rate Limits

Understanding platform rate limits and best practices for building resilient integrations with Runware.

How we handle capacity

Runware operates on a shared infrastructure model with finite processing capacity. Rather than enforcing hard rate limits that reject requests arbitrarily, we use dynamic queue-based systems that balance load across all users while prioritizing active workloads.

This approach maximizes platform accessibility while maintaining service quality. You won't hit artificial walls, but you will experience graceful degradation under high load rather than immediate rejection.

Current approach:

No hard rate limits - We don't reject requests based on arbitrary thresholds.
Shared queues - Requests are processed through model-specific queues with finite capacity.
Fair allocation - Processing capacity is distributed fairly across active users.
Graceful degradation - Under load, you'll experience increased latency rather than immediate failures.

What to expect

Under normal conditions, requests are processed quickly and reliably. When platform capacity is reached, here's what happens:

1. Queueing - Requests enter a queue and wait for available processing capacity.

2. Increased latency - Generation time increases as queue depth grows. This is normal and expected.

3. Timeout potential - Extremely long queues may result in timeout errors after waiting too long.

4. Transient failures - During peak demand, some requests may fail with retryable errors.

HTTP status codes

You may encounter these status codes when capacity is constrained:

429 Too Many Requests - Queue capacity exceeded, implement retry with exponential backoff.
503 Service Unavailable - Temporary capacity constraint, retry recommended.
504 Gateway Timeout - Request exceeded maximum queue wait time.

All of these are transient and should be handled with retry logic.

Third-party provider limits

We integrate with external AI providers like OpenAI, Black Forest Labs, ByteDance, and others. Their rate limits and capacity constraints affect our service, even when our infrastructure has available capacity.

Models from third-party providers are subject to the provider's own rate limits and capacity constraints. When a provider experiences high demand or enforces their limits, you may see reduced throughput or temporary unavailability for those specific models.

What this means:

Provider throttling appears as our errors - You'll see 429, 503, or increased latency when upstream providers hit their limits.
We can't control their capacity - Some providers enforce strict concurrency or rate limits that cascade to our service.
Model-specific constraints - Check individual model documentation for provider-specific limitations.

This is an inherent part of working with third-party AI services and why we recommend implementing robust retry logic.

Best practices

Implement retry logic

Always implement exponential backoff for failed requests. This is critical for resilient integrations.

TypeScript

async function generateWithRetry(params: any, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await runware.imageInference(params);
    } catch (error: any) {
      // Retry on capacity errors
      if (error.status === 429 || error.status === 503) {
        const delay = Math.pow(2, i) * 1000; // 1s, 2s, 4s
        await new Promise(resolve => setTimeout(resolve, delay));
        continue;
      }
      // Don't retry other errors
      throw error;
    }
  }
  throw new Error('Max retries exceeded');
}

Python

import time

async def generate_with_retry(params, max_retries=3):
    for i in range(max_retries):
        try:
            return await runware.image_inference(params)
        except Exception as error:
            # Retry on capacity errors
            if hasattr(error, 'status') and error.status in [429, 503]:
                delay = (2 ** i) * 1000  # 1s, 2s, 4s
                time.sleep(delay / 1000)
                continue
            # Don't retry other errors
            raise error
    raise Exception('Max retries exceeded')

Manage concurrency

While we don't enforce hard concurrency limits, we recommend the following for optimal performance:

Standard usage - 2-4 concurrent requests provides optimal throughput for most use cases.

High-volume deployments - Contact our team to discuss capacity planning and dedicated infrastructure options.

Burst workloads - Implement request throttling on your end to avoid saturating queues during sudden spikes.

Sending hundreds of concurrent requests may saturate queues and result in timeouts. Implement concurrency limits in your application for better reliability.

Monitor response times

Track generation latency as a proxy for platform capacity:

Consistent sub-30s responses - Healthy capacity, system operating normally.
Increasing latency trends - Approaching capacity limits, consider reducing concurrency.
Frequent timeouts - Queue saturation, implement backoff or reduce request volume.

Use latency monitoring to dynamically adjust your request patterns and avoid overwhelming the platform.

Model-specific capacity

Some models have limited infrastructure due to hardware requirements, demand patterns, or provider constraints:

High-demand models - Popular models may experience longer queues during peak hours.

Resource-intensive models - Video generation and large models require significant GPU resources, limiting concurrent capacity. Check individual model documentation for specific constraints.

Third-party provider models - Models from external providers are subject to their capacity constraints and rate limits.

For production deployments requiring guaranteed capacity or SLAs on specific models, contact our sales team to discuss dedicated infrastructure options.

What's coming

We're actively developing improvements to provide more predictable performance:

Tiered concurrency limits - Predictable limits based on usage tier, with clear thresholds and automatic scaling.

Capacity reservations - Guaranteed throughput for enterprise users with dedicated infrastructure.

Real-time capacity metrics - API endpoints showing current queue depth and estimated processing times.

Serverless scaling - Dynamic model loading for improved availability and reduced cold starts.

These improvements will provide enterprise-grade reliability while maintaining platform accessibility for all users.

Summary

Current state:

No hard rate limits, but shared infrastructure with finite capacity
Graceful degradation through queueing under load
Third-party provider limits affect service availability

Your responsibilities:

Implement exponential backoff retry logic
Manage concurrency (2-4 concurrent requests recommended)
Monitor latency and adjust request patterns

Need guarantees?

Contact sales for dedicated capacity, SLAs, and custom infrastructure

This approach balances accessibility with reliability. By following these best practices, you'll build integrations that gracefully handle capacity constraints and maintain high uptime.