
Share
Explore the technical intricacies behind streaming Large Language Model APIs, focusing on server-sent events and HTTP POST requests, and learn how real-time communication powers interactive AI applications.
If you've been diving into the world of Large Language Models (LLMs), you might have noticed that many hosted LLM providers offer streaming APIs. These APIs allow for real-time data transmission, which is crucial for applications like chatbots or interactive systems where immediate responses are essential. In this article, we’ll explore how these streaming APIs work under the hood and provide a practical example using OpenAI’s GPT-4o Mini.
Most of the LLM providers I looked at follow a similar pattern when it comes to their HTTP streaming APIs. They return data with a content-type: text/event-stream header, which aligns with the server-sent events (SSE) mechanism. The data is streamed in blocks separated by \r\n\r\n, and each block contains a data: line with JSON content. Some providers, like Anthropic, also include an event: line to specify the event type.
One key limitation of this approach is that these APIs use HTTP POST requests, which means they can't be directly consumed using the browser’s EventSource API because it only supports GET requests. This requires a bit more manual handling on the client side.
Let's dive into a practical example using OpenAI’s GPT-4o Mini model. The following curl command demonstrates how to send a prompt and request a streaming response:
curl https://api.openai.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": "Tell me a joke"}],
"stream": true,
"stream_options": {
"include_usage": true
}
}' \
--no-buffer

https://api.openai.com/v1/chat/completions is the endpoint for OpenAI’s chat completions API.Content-Type: application/json: Specifies that the request body is in JSON format.Authorization: Bearer $OPENAI_API_KEY: Authenticates the request using your API key. Replace $OPENAI_API_KEY with your actual API key."model": "gpt-4o-mini": Specifies the model to use."messages": [{"role": "user", "content": "Tell me a joke"}]: The prompt you want the model to respond to."stream": true: Enables streaming mode."stream_options": {"include_usage": true}: Requests that the final message in the stream include details of token usage.The --no-buffer option ensures that curl outputs the stream to the console as it arrives. Here’s a truncated version of the response:
data: {"id":"chatcmpl-A8dyC7f6pKkQ516qqRHK6ep7Z3yG9","object":"chat.completion.chunk","created":1726623632,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_483d39d857","choices":[{"index":0,"delta":{"role":"assistant","content":"","refusal":null},"logprobs":null,"finish_reason":null}],"usage":null}
data: {"id":"chatcmpl-A8dyC7f6pKkQ516qqRHK6ep7Z3yG9","object":"chat.completion.chunk","created":1726623632,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_
Tags
Original Sources
↗ https://til.simonwillison.net/llms/streaming-llm-apis?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
27 September 2024
88 articles
Related Articles
Related Articles
More Stories