Understanding Streaming LLM APIs: A Deep Dive into Server-Sent Events and HTTP POST Requests

Tools & Engineering

The Engineer

27 Sept 2024 · 2 min read

Explore the technical intricacies behind streaming Large Language Model APIs, focusing on server-sent events and HTTP POST requests, and learn how real-time communication powers interactive AI applications.

If you've been diving into the world of Large Language Models (LLMs), you might have noticed that many hosted LLM providers offer streaming APIs. These APIs allow for real-time data transmission, which is crucial for applications like chatbots or interactive systems where immediate responses are essential. In this article, we’ll explore how these streaming APIs work under the hood and provide a practical example using OpenAI’s GPT-4o Mini.

The General Pattern

Most of the LLM providers I looked at follow a similar pattern when it comes to their HTTP streaming APIs. They return data with a content-type: text/event-stream header, which aligns with the server-sent events (SSE) mechanism. The data is streamed in blocks separated by \r\n\r\n, and each block contains a data: line with JSON content. Some providers, like Anthropic, also include an event: line to specify the event type.

One key limitation of this approach is that these APIs use HTTP POST requests, which means they can't be directly consumed using the browser’s EventSource API because it only supports GET requests. This requires a bit more manual handling on the client side.

OpenAI Example: GPT-4o Mini

Let's dive into a practical example using OpenAI’s GPT-4o Mini model. The following curl command demonstrates how to send a prompt and request a streaming response:

curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [{"role": "user", "content": "Tell me a joke"}],
    "stream": true,
    "stream_options": {
      "include_usage": true
    }
  }' \
  --no-buffer

Breaking Down the Command

URL and Endpoint: https://api.openai.com/v1/chat/completions is the endpoint for OpenAI’s chat completions API.
Headers:
- Content-Type: application/json: Specifies that the request body is in JSON format.
- Authorization: Bearer $OPENAI_API_KEY: Authenticates the request using your API key. Replace $OPENAI_API_KEY with your actual API key.
Request Body:
- "model": "gpt-4o-mini": Specifies the model to use.
- "messages": [{"role": "user", "content": "Tell me a joke"}]: The prompt you want the model to respond to.
- "stream": true: Enables streaming mode.
- "stream_options": {"include_usage": true}: Requests that the final message in the stream include details of token usage.

Sample Response

The --no-buffer option ensures that curl outputs the stream to the console as it arrives. Here’s a truncated version of the response:

data: {"id":"chatcmpl-A8dyC7f6pKkQ516qqRHK6ep7Z3yG9","object":"chat.completion.chunk","created":1726623632,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_483d39d857","choices":[{"index":0,"delta":{"role":"assistant","content":"","refusal":null},"logprobs":null,"finish_reason":null}],"usage":null}

data: {"id":"chatcmpl-A8dyC7f6pKkQ516qqRHK6ep7Z3yG9","object":"chat.completion.chunk","created":1726623632,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_