Streaming AI Responses with Server-Sent Events
Users expect AI responses to appear progressively, token by token, rather than waiting for the entire response to generate before seeing anything. This progressive rendering creates a perception of speed and responsiveness that dramatically improves the user experience. Server-sent events (SSE) are the standard protocol for implementing this pattern.
This article covers the architecture of SSE-based streaming for AI chat applications, including tool execution during streams, error handling, and production-ready patterns.
Why SSE for AI Streaming
There are three common approaches to real-time communication in web applications: polling, WebSockets, and server-sent events. Each has tradeoffs for AI streaming.
| Protocol | Direction | Connection Overhead | Complexity | AI Streaming Fit |
|---|---|---|---|---|
| Polling | Client-initiated | High (repeated requests) | Low | Poor |
| WebSockets | Bidirectional | Low (persistent) | High | Overkill |
| SSE | Server to client | Low (persistent) | Low | Ideal |
SSE is ideal for AI streaming because the communication pattern is fundamentally unidirectional: the server sends tokens to the client. The client sends messages through normal HTTP POST requests. SSE provides the persistent connection needed for streaming without the complexity of WebSocket connection management, heartbeats, and reconnection logic.
Basic SSE Architecture
The flow for a streaming AI response follows this sequence:
- Client sends user message via HTTP POST
- Server initiates AI model inference with streaming enabled
- Server opens an SSE connection and begins sending events
- Each token (or chunk of tokens) from the model is sent as an SSE event
- Client renders tokens as they arrive
- Server sends a completion event when the response is finished
- Client closes the SSE connection
Event Format
SSE events follow a simple text-based format. Each event has an optional type, optional ID, and data payload:
The standard event types for AI chat streaming include:
- token: A chunk of generated text
- tool_call: The model is invoking a tool
- tool_result: A tool execution has completed
- thinking: The model's reasoning process (if exposed)
- error: An error occurred during generation
- done: The response is complete
Connection Management
SSE connections should include retry logic on the client side. The browser's native EventSource API handles reconnection automatically, but custom implementations (needed for POST-based SSE) must implement their own retry logic with exponential backoff.
The server should send periodic heartbeat events (empty comments in the SSE protocol) to keep the connection alive through proxies and load balancers that might terminate idle connections.
Streaming with Tool Execution
The most complex part of AI streaming is handling tool calls mid-stream. The model might generate partial text, then decide it needs to call a tool, wait for the result, and then continue generating text. The stream must communicate this entire lifecycle to the client.
The Tool Execution Flow
- Model generates text tokens (streamed to client)
- Model decides to call a tool (tool_call event sent)
- Server executes the tool (client shows loading state)
- Tool returns results (tool_result event sent)
- Model continues generating with tool results in context
- Additional text tokens stream to client
Client-Side State Machine
The client needs a state machine to handle the different phases of a streaming response:
Generating: Text tokens are arriving. Render them progressively.
Tool calling: The model has initiated a tool call. Show a loading indicator with the tool name and parameters so the user knows what is happening.
Tool executing: The tool is running. Show progress or a spinner. For long-running tools (database queries on large tables), show elapsed time.
Resuming: Tool results have been received and the model is generating again. Resume progressive text rendering.
Complete: The response is finished. Finalize the rendered content and enable user input.
Handling Multiple Tool Calls
A single response might involve multiple sequential or parallel tool calls. The stream should communicate each tool call independently, and the client should render them as distinct steps in the response. This transparency helps users understand what the AI is doing and builds trust in the results.
Progressive Rendering Patterns
How tokens are rendered affects the perceived quality of the streaming experience.
Word-Level Buffering
Rather than rendering each token individually (which can produce jarring partial words), buffer tokens until a complete word boundary is reached. This produces smoother text that reads naturally as it appears.
Markdown Rendering
AI responses often include markdown formatting. Progressive markdown rendering is challenging because formatting elements span multiple tokens. For example, a table header might arrive as: "|", " Header", " 1 ", "|", " Header", " 2 ", "|".
The practical approach is to buffer markdown blocks (tables, code blocks, lists) and render them once complete, while streaming inline text (paragraphs, sentences) progressively.
Code Block Handling
Code blocks require special handling. Stream the code content progressively, but apply syntax highlighting only when the code block is complete (or at reasonable intervals). This avoids the visual jarring of syntax colors changing as more code arrives.
Error Handling in Streams
Errors during streaming require careful handling because the response is partially rendered when the error occurs.
Model Errors
If the AI model returns an error mid-stream, send an error event with the error details. The client should display the partial response (it might still be useful) along with the error message. Do not discard the partial response.
Tool Execution Errors
If a tool call fails, send the error as a tool_result event with an error flag. The model can then decide how to proceed: retry with different parameters, try a different approach, or explain to the user what happened.
Connection Drops
If the SSE connection drops, the client should attempt reconnection. The server should support resumption by accepting a last-event-ID header and replaying missed events. Without resumption support, the client must discard the partial response and retry the entire request.
Production Considerations
Load Balancer Configuration
SSE connections are long-lived, which can cause issues with load balancers configured for short HTTP request/response cycles. Configure your load balancer to:
- Allow long connection timeouts (at least 5 minutes for complex AI responses)
- Disable response buffering (which defeats the purpose of streaming)
- Support HTTP/2 multiplexing for efficient connection usage
Rate Limiting
Rate limiting SSE connections requires tracking both the number of concurrent connections per user and the total token output rate. A user with too many concurrent streams can exhaust server resources even if each individual stream is within normal parameters.
Monitoring
Track these metrics for SSE-based AI streaming:
- Time to first token (TTFT): How long the user waits before seeing any response
- Tokens per second: The streaming rate, which should be consistent
- Stream completion rate: Percentage of streams that complete successfully
- Average stream duration: How long responses take end-to-end
- Tool execution latency: How long tool calls add to the stream
Platforms like Skopx implement SSE streaming across all AI interactions, handling the complexity of tool execution, error recovery, and progressive rendering so that end users experience fast, responsive conversations.
The User Experience Impact
Well-implemented streaming reduces perceived latency by 60-80% compared to waiting for complete responses. A response that takes 8 seconds to fully generate feels nearly instant when the first tokens appear within 200 milliseconds. For enterprise applications where users interact with AI dozens of times per day, this improvement in responsiveness compounds into meaningful productivity gains and higher user satisfaction.
Alexis Kelly
The Skopx engineering and product team