Streaming (SSE)

Stream text inference results token-by-token using Server-Sent Events. Learn how to enable streaming and parse the SSE response format.

Overview

Streaming lets you receive text inference responses token-by-token as they are generated, rather than waiting for the entire response to complete. Results are delivered via Server-Sent Events (SSE), a lightweight HTTP-based protocol designed for real-time, server-to-client data delivery.

This is particularly useful for chat interfaces and any application where perceived latency matters. Instead of a multi-second wait followed by a wall of text, your users see the response appear progressively.

Streaming is available for all text inference models and works alongside the existing sync and async delivery methods.

Enabling streaming

Add "deliveryMethod": "stream" to your request. Everything else stays the same:

import { createClient } from '@runware/sdk'

const client = await createClient({ apiKey: process.env.RUNWARE_API_KEY })
await client.connect()

const [result] = await client.run({
  model: 'minimax:m2.7@0',
  deliveryMethod: 'stream',
  messages: [
    {
      role: 'user',
      content: 'Hello'
    }
  ],
  settings: {
    maxTokens: 4096,
    temperature: 1
  }
})

import asyncio
import os

from runware import Runware


async def main():
    async with Runware(api_key=os.environ["RUNWARE_API_KEY"]) as client:
        results = await client.run({
            "model": "minimax:m2.7@0",
            "deliveryMethod": "stream",
            "messages": [
                {
                    "role": "user",
                    "content": "Hello"
                }
            ],
            "settings": {
                "maxTokens": 4096,
                "temperature": 1
            }
        })


asyncio.run(main())

curl https://api.runware.ai/v1 \
  -H "Authorization: Bearer $RUNWARE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '[
    {
      "taskType": "textInference",
      "taskUUID": "a770f077-f413-47de-9dac-be0b26a35da6",
      "model": "minimax:m2.7@0",
      "deliveryMethod": "stream",
      "messages": [
        {
          "role": "user",
          "content": "Hello"
        }
      ],
      "settings": {
        "maxTokens": 4096,
        "temperature": 1
      }
    }
  ]'

runware run minimax:m2.7@0 \
  deliveryMethod=stream \
  messages.0.role=user \
  messages.0.content=Hello \
  settings.maxTokens=4096 \
  settings.temperature=1

{
  "taskType": "textInference",
  "taskUUID": "a770f077-f413-47de-9dac-be0b26a35da6",
  "model": "minimax:m2.7@0",
  "deliveryMethod": "stream",
  "messages": [
    {
      "role": "user",
      "content": "Hello"
    }
  ],
  "settings": {
    "maxTokens": 4096,
    "temperature": 1
  }
}

The response will be an SSE stream instead of a single JSON object. Each event contains a chunk of the generated text that you can display immediately.

Delivery methods compared

The deliveryMethod parameter controls how results are returned. Streaming is one of three options available for text inference tasks:

Value	Behavior	Best for
`sync`	Waits for the full response, returns it as a single JSON object. This is the default.	Simple integrations, short responses
`stream`	Streams tokens as SSE events as they are generated.	Chat UIs, long-form generation
`async`	Returns immediately with a task acknowledgment. Poll for results using Task Polling.	Background processing, long-running tasks

SSE response format

The response is a standard SSE stream. Each event is a line prefixed with data:, followed by a JSON object, and terminated by a blank line. The stream ends with a data: [DONE] sentinel. The server may send : ping comments as keepalives, which should be ignored.

: ping

data: {"taskUUID":"a770f077-f413-47de-9dac-be0b26a35da6","taskType":"textInference","delta":{"text":"Hello"},"finishReason":null}

data: {"taskUUID":"a770f077-f413-47de-9dac-be0b26a35da6","taskType":"textInference","delta":{"text":" there"},"finishReason":null}

data: {"taskUUID":"a770f077-f413-47de-9dac-be0b26a35da6","taskType":"textInference","delta":{},"finishReason":"stop"}

data: [DONE]

Parsing rules

Follow these steps to parse the SSE stream:

Skip blank lines and comment lines (lines starting with :).
Strip the data: prefix from each event line.
Stop when you see data: [DONE], which signals the end of the stream.
Parse each remaining line as JSON.
Read the text from delta.text.
Check for errors by looking for an errors array in the parsed object.

Content chunks

During generation, each event contains a small piece of the response text in delta.text. Concatenate these chunks to build the full response:

data: {"taskUUID":"a770f077-f413-47de-9dac-be0b26a35da6","taskType":"textInference","delta":{"text":"The"},"finishReason":null}

data: {"taskUUID":"a770f077-f413-47de-9dac-be0b26a35da6","taskType":"textInference","delta":{"text":" answer"},"finishReason":null}

data: {"taskUUID":"a770f077-f413-47de-9dac-be0b26a35da6","taskType":"textInference","delta":{"text":" is"},"finishReason":null}

data: {"taskUUID":"a770f077-f413-47de-9dac-be0b26a35da6","taskType":"textInference","delta":{"text":" 42."},"finishReason":null}

Reasoning chunks

Some models perform internal reasoning before generating the final response. For these models, reasoning tokens arrive first in delta.reasoningContent, followed by the actual response in delta.text.

// Reasoning chunks - delta.reasoningContent
data: {"taskUUID":"6e879837-4b2a-4c1d-ae5f-8f3c21b07a92","taskType":"textInference","delta":{"reasoningContent":"The user asks: \"What is 2+2? Be brief.\" They want a short answer. It's a simple arithmetic: 4. Provide"}}

data: {"taskUUID":"6e879837-4b2a-4c1d-ae5f-8f3c21b07a92","taskType":"textInference","delta":{"reasoningContent":" short answer."}}

// Actual response - switches to delta.text
data: {"taskUUID":"6e879837-4b2a-4c1d-ae5f-8f3c21b07a92","taskType":"textInference","delta":{"text":"4"},"finishReason":null}

You can display reasoning content in a collapsible section or debug panel, while streaming the final response directly to the user.

Multiple results

When you set numberResults greater than 1, multiple completions stream on the same connection. Each chunk includes a resultIndex field so you can tell which result it belongs to, since all results share the same taskUUID:

data: {"taskUUID":"a770f077-f413-47de-9dac-be0b26a35da6","taskType":"textInference","resultIndex":0,"delta":{"text":"Paris"},"finishReason":null}

data: {"taskUUID":"a770f077-f413-47de-9dac-be0b26a35da6","taskType":"textInference","resultIndex":1,"delta":{"text":"The capital"},"finishReason":null}

data: {"taskUUID":"a770f077-f413-47de-9dac-be0b26a35da6","taskType":"textInference","resultIndex":0,"delta":{},"finishReason":"stop"}

data: {"taskUUID":"a770f077-f413-47de-9dac-be0b26a35da6","taskType":"textInference","resultIndex":1,"delta":{"text":" is Paris."},"finishReason":null}

data: {"taskUUID":"a770f077-f413-47de-9dac-be0b26a35da6","taskType":"textInference","resultIndex":1,"delta":{},"finishReason":"stop"}

data: [DONE]

Group chunks by resultIndex to reconstruct each result independently. Results may finish at different times.

Final chunk and finish reason

The last content-bearing event includes a finishReason value that tells you why the model stopped generating:

Finish reason	Meaning
`stop`	The model completed its response naturally.
`length`	The response hit the `maxTokens` limit.
`content_filter`	Content was filtered by the safety system.
`tool_calls`	The model is requesting a tool call.
`tool_use`	The model is requesting a tool use.
`unknown`	The model stopped for an unrecognized reason.

data: {"taskUUID":"6e879837-4b2a-4c1d-ae5f-8f3c21b07a92","taskType":"textInference","delta":{},"finishReason":"stop"}

data: [DONE]

Cost and usage

Cost and token usage are reported in the final chunk of the stream, but only when explicitly requested:

Set includeCost: true to receive the cost field with the total price of the request in USD. Useful for tracking spend and billing.
Set includeUsage: true to receive the usage object with detailed token counts and processing metadata. Useful for monitoring context window usage and optimizing prompts.

import { createClient } from '@runware/sdk'

const client = await createClient({ apiKey: process.env.RUNWARE_API_KEY })
await client.connect()

const [result] = await client.run({
  model: 'minimax:m2.7@0',
  deliveryMethod: 'stream',
  messages: [
    {
      role: 'user',
      content: 'What is 2+2? Be brief.'
    }
  ],
  settings: {
    maxTokens: 4096,
    temperature: 1
  },
  includeCost: true,
  includeUsage: true
})

import asyncio
import os

from runware import Runware


async def main():
    async with Runware(api_key=os.environ["RUNWARE_API_KEY"]) as client:
        results = await client.run({
            "model": "minimax:m2.7@0",
            "deliveryMethod": "stream",
            "messages": [
                {
                    "role": "user",
                    "content": "What is 2+2? Be brief."
                }
            ],
            "settings": {
                "maxTokens": 4096,
                "temperature": 1
            },
            "includeCost": True,
            "includeUsage": True
        })


asyncio.run(main())

curl https://api.runware.ai/v1 \
  -H "Authorization: Bearer $RUNWARE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '[
    {
      "taskType": "textInference",
      "taskUUID": "a770f077-f413-47de-9dac-be0b26a35da6",
      "model": "minimax:m2.7@0",
      "deliveryMethod": "stream",
      "messages": [
        {
          "role": "user",
          "content": "What is 2+2? Be brief."
        }
      ],
      "settings": {
        "maxTokens": 4096,
        "temperature": 1
      },
      "includeCost": true,
      "includeUsage": true
    }
  ]'

runware run minimax:m2.7@0 \
  deliveryMethod=stream \
  messages.0.role=user \
  messages.0.content="What is 2+2? Be brief." \
  settings.maxTokens=4096 \
  settings.temperature=1 \
  includeCost=true \
  includeUsage=true

{
  "taskType": "textInference",
  "taskUUID": "a770f077-f413-47de-9dac-be0b26a35da6",
  "model": "minimax:m2.7@0",
  "deliveryMethod": "stream",
  "messages": [
    {
      "role": "user",
      "content": "What is 2+2? Be brief."
    }
  ],
  "settings": {
    "maxTokens": 4096,
    "temperature": 1
  },
  "includeCost": true,
  "includeUsage": true
}

The final chunk before [DONE] will include both fields:

data: {"taskUUID":"6e879837-4b2a-4c1d-ae5f-8f3c21b07a92","taskType":"textInference","delta":{},"finishReason":"stop","usage":{"promptTokens":51,"completionTokens":38,"totalTokens":89},"cost":0.000061}

data: [DONE]

Error handling

If an error occurs during streaming, the event will contain an errors array instead of a delta object.

{
  "errors": [
    {
      "code": "timeoutProvider",
      "message": "The provider timed out while generating the response.",
      "taskType": "textInference",
      "taskUUID": "a770f077-f413-47de-9dac-be0b26a35da6"
    }
  ]
}

Check for the presence of errors in your parsing logic and handle them accordingly. Error fields follow the same structure as standard API errors.

Code examples

curl

# The -N flag disables curl's output buffering so chunks print as they arrive.

curl -N -X POST https://api.runware.ai/v1 \
  -H "Authorization: Bearer $RUNWARE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '[
    {
      "taskType": "textInference",
      "taskUUID": "550e8400-e29b-41d4-a716-446655440000",
      "model": "minimax:m2.7@0",
      "deliveryMethod": "stream",
      "messages": [{"role": "user", "content": "Tell me a joke"}],
      "settings": {"maxTokens": 512, "temperature": 1.0},
      "includeCost": true
    }
  ]'

JavaScript

const response = await fetch('https://api.runware.ai/v1', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer ' + RUNWARE_API_KEY,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify([{
    taskType: 'textInference',
    taskUUID: crypto.randomUUID(),
    model: 'minimax:m2.7@0',
    deliveryMethod: 'stream',
    messages: [{ role: 'user', content: 'Tell me a joke' }],
    settings: { maxTokens: 512, temperature: 1.0 },
  }]),
});

const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  buffer += decoder.decode(value, { stream: true });
  const lines = buffer.split('\n');
  buffer = lines.pop();

  for (const line of lines) {
    if (!line.trim() || line.startsWith(':')) continue;
    if (line === 'data: [DONE]') return;

    const json = JSON.parse(line.replace('data: ', ''));

    if (json.errors) {
      console.error(json.errors[0].message);
      return;
    }

    const text = json.delta?.text;
    if (text) process.stdout.write(text);
  }
}

Python

import json
import uuid
import httpx

response = httpx.post(
    'https://api.runware.ai/v1',
    headers={
        'Authorization': f'Bearer {RUNWARE_API_KEY}',
        'Content-Type': 'application/json',
    },
    json=[{
        'taskType': 'textInference',
        'taskUUID': str(uuid.uuid4()),
        'model': 'minimax:m2.7@0',
        'deliveryMethod': 'stream',
        'messages': [{'role': 'user', 'content': 'Tell me a joke'}],
        'settings': {'maxTokens': 512, 'temperature': 1.0},
    }],
    timeout=None,
)

for line in response.iter_lines():
    if not line or line.startswith(':'):
        continue
    if line == 'data: [DONE]':
        break

    data = json.loads(line.removeprefix('data: '))

    if 'errors' in data:
        raise Exception(data['errors'][0]['message'])

    text = data.get('delta', {}).get('text', '')
    if text:
        print(text, end='', flush=True)

Best practices

Buffer by line, not by byte. Network chunks may split a JSON event across multiple reads. Accumulate data in a buffer and process complete lines only.
Handle [DONE] explicitly. Always check for the data: [DONE] sentinel before attempting to parse JSON. Treating it as JSON will cause a parse error.
Separate reasoning from content. If you're working with reasoning models, track whether the stream is currently delivering delta.reasoningContent or delta.text and route them accordingly.
Implement timeouts. Set a reasonable timeout for the overall stream connection. If no events arrive within your timeout window, close the connection and retry.
Use the Fetch API for browser clients. The browser's native EventSource API only supports GET requests. Since text inference uses POST, use the Fetch API with ReadableStream instead.