Why streaming chat is an architecture problem
Building a text box that sends a message and shows a response takes an afternoon. Building a streaming chat UI that handles cancellation, partial failures, mobile networks, scroll hijacking, markdown rendering, code blocks, accessibility, and conversation persistence takes weeks.
This guide covers the full architecture with production-ready patterns.
Transport: SSE vs fetch streaming vs WebSockets
┌───────────────────┬──────────────────────────────────────────┐
│ Transport │ When to use │
├───────────────────┼──────────────────────────────────────────┤
│ fetch + Response │ Default choice. Works with POST, sends │
│ body streaming │ structured payloads, supports │
│ │ AbortController natively. │
│ │ │
│ Server-Sent │ When you need automatic reconnection │
│ Events (SSE) │ and browser-managed event dispatch. │
│ │ GET-only (no POST body without wrapper). │
│ │ │
│ WebSockets │ When you need bidirectional streaming │
│ │ (voice, collaborative editing, real-time │
│ │ multiplayer). Overkill for one-way AI │
│ │ output. │
└───────────────────┴──────────────────────────────────────────┘
Recommendation: Use fetch streaming for AI chat. It gives you POST with a request body, native AbortController support, and no extra protocol overhead.
The complete fetch streaming implementation
interface StreamCallbacks {
onToken: (text: string) => void
onToolCall?: (call: ToolCall) => void
onError: (error: Error, partialText: string) => void
onComplete: (fullText: string) => void
}
async function streamChat(
messages: Message[],
signal: AbortSignal,
callbacks: StreamCallbacks
) {
const response = await fetch('/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ messages }),
signal,
})
if (!response.ok) {
const body = await response.text()
throw new Error(`${response.status}: ${body}`)
}
const reader = response.body!.getReader()
const decoder = new TextDecoder()
let buffer = ''
let fullText = ''
try {
while (true) {
const { done, value } = await reader.read()
if (done) break
// Decode with stream: true to handle multi-byte chars
buffer += decoder.decode(value, { stream: true })
// Parse SSE-formatted lines from the buffer
const lines = buffer.split('\n')
buffer = lines.pop() || '' // Keep incomplete line
for (const line of lines) {
if (!line.startsWith('data: ')) continue
const payload = line.slice(6)
if (payload === '[DONE]') {
callbacks.onComplete(fullText)
return
}
try {
const event = JSON.parse(payload)
if (event.type === 'token') {
fullText += event.text
callbacks.onToken(event.text)
} else if (event.type === 'tool_call') {
callbacks.onToolCall?.(event)
}
} catch {
// Malformed JSON — skip this line
}
}
}
// Stream ended without [DONE] — still complete
callbacks.onComplete(fullText)
} catch (err) {
if ((err as Error).name === 'AbortError') return
callbacks.onError(err as Error, fullText)
} finally {
reader.releaseLock()
}
}
Buffering: the difference between smooth and jittery
Fast models produce 30-80 tokens per second. Rendering each token individually causes:
- Layout thrashing from constant DOM updates
- Broken words ("architec" → "architecture" in two frames)
- Dropped frames on mobile
// ❌ Naive: render every token immediately
onToken: (token) => {
messageEl.textContent += token // 60 DOM writes per second
}
// ✅ Buffer to word boundaries, flush on animation frame
class TokenBuffer {
private buffer = ''
private rafId: number | null = null
private onFlush: (text: string) => void
constructor(onFlush: (text: string) => void) {
this.onFlush = onFlush
}
append(token: string) {
this.buffer += token
if (this.rafId === null) {
this.rafId = requestAnimationFrame(() => {
// Flush on word boundary if possible
const lastSpace = this.buffer.lastIndexOf(' ')
if (lastSpace > 0 && this.buffer.length > 20) {
// Flush up to the last complete word
this.onFlush(this.buffer.slice(0, lastSpace + 1))
this.buffer = this.buffer.slice(lastSpace + 1)
} else {
// Short buffer or no word boundary — flush all
this.onFlush(this.buffer)
this.buffer = ''
}
this.rafId = null
})
}
}
// Call when stream ends to flush remaining text
flush() {
if (this.rafId !== null) {
cancelAnimationFrame(this.rafId)
this.rafId = null
}
if (this.buffer) {
this.onFlush(this.buffer)
this.buffer = ''
}
}
}
// Usage
const buffer = new TokenBuffer((text) => {
displayedText += text
renderMessage(displayedText)
})
streamChat(messages, signal, {
onToken: (token) => buffer.append(token),
onComplete: () => buffer.flush(),
onError: () => buffer.flush(),
})
Why word boundaries matter: "The event loop schedu" → "les tasks from the" reads worse than "The event loop " → "schedules tasks from the". Users read words, not characters.
Scroll behavior that does not fight the user
This is the #1 UX complaint about streaming chat UIs. Auto-scroll is expected when the user is at the bottom watching output arrive. Auto-scroll is infuriating when the user has scrolled up to read earlier messages.
class ChatScroller {
private container: HTMLElement
private isUserScrolledUp = false
private lastScrollTop = 0
constructor(container: HTMLElement) {
this.container = container
container.addEventListener('scroll', () => {
const { scrollTop, scrollHeight, clientHeight } = container
const distanceFromBottom = scrollHeight - scrollTop - clientHeight
// If user scrolled up more than 100px, they're reading history
this.isUserScrolledUp = distanceFromBottom > 100
// Detect scroll direction
const scrolledDown = scrollTop > this.lastScrollTop
this.lastScrollTop = scrollTop
// If user manually scrolled to bottom, re-enable auto-scroll
if (scrolledDown && distanceFromBottom < 20) {
this.isUserScrolledUp = false
}
}, { passive: true })
}
// Call this after updating message content
onContentUpdated() {
if (!this.isUserScrolledUp) {
// Smooth scroll to bottom during streaming
this.container.scrollTo({
top: this.container.scrollHeight,
behavior: 'smooth',
})
}
}
get shouldShowJumpToBottom() {
return this.isUserScrolledUp
}
}
The "Jump to latest" button should appear when shouldShowJumpToBottom is true, with an unread count badge if new messages arrived while scrolled up.
Message state machine
A chat message is not just text. It has a lifecycle:
┌──────────┐ ┌───────────┐ ┌───────────┐
│ sending │───▶│ streaming │───▶│ complete │
└──────────┘ └─────┬─────┘ └───────────┘
│
▼
┌─────────────┐ ┌───────────┐
│ interrupted │───▶│ retrying │──▶ streaming
└─────────────┘ └───────────┘
States:
- sending: User hit send, request in flight, no tokens yet
- streaming: Tokens arriving, partial text visible
- complete: Stream finished, full response rendered
- interrupted: Stream failed mid-response, partial text preserved
- retrying: User clicked retry, new request in flight
Model this explicitly in your state:
type MessageStatus =
| { type: 'sending' }
| { type: 'streaming'; partialText: string }
| { type: 'complete'; text: string }
| { type: 'interrupted'; partialText: string; error: string }
| { type: 'retrying'; partialText: string; attempt: number }
interface ChatMessage {
id: string
role: 'user' | 'assistant'
status: MessageStatus
timestamp: number
}
Why this matters: Without explicit states, you end up with a mess of boolean flags (isLoading && !isError && hasPartialText && ...). The state machine makes every UI branch clear.
Error recovery
Streaming errors are not edge cases. They are part of the normal flow. Networks drop. Providers rate-limit. Mobile users background the app.
// Error handling strategy for each failure type
function handleStreamError(
error: Error,
partialText: string,
message: ChatMessage
): MessageStatus {
// User cancelled — not an error
if (error.name === 'AbortError') {
return partialText
? { type: 'interrupted', partialText, error: 'Cancelled' }
: { type: 'sending' } // Nothing received yet, remove the message
}
// Rate limited — tell the user to wait
if (error.message.includes('429')) {
return {
type: 'interrupted',
partialText,
error: 'Rate limited. Please wait a moment before retrying.',
}
}
// Server error — offer retry
if (error.message.match(/^5\d{2}/)) {
return {
type: 'interrupted',
partialText,
error: 'Server error. Your partial response has been saved.',
}
}
// Network error
if (!navigator.onLine) {
return {
type: 'interrupted',
partialText,
error: 'You appear to be offline. Retry when connected.',
}
}
// Unknown error
return {
type: 'interrupted',
partialText,
error: 'Something went wrong. Your partial response has been saved.',
}
}
Key UX decisions:
- Always preserve partial text. If the model generated 3 paragraphs before the error, do not throw them away. Show them with an "interrupted" indicator.
- Retry should append, not replace (when possible). Some APIs support
partialcontext to continue from where they left off. - Show the error inline on the message, not as a toast. The user needs to see which message failed.
Markdown and code block rendering during streaming
Rendering markdown while tokens are still arriving is tricky because the markdown is incomplete:
// The stream might deliver tokens that break markdown syntax:
// Frame 1: partial code fence opens but never closes
// Frame 2: fence still incomplete
// Frame 3: closing fence arrives — now it renders correctly
// The user saw 2 frames of broken rendering
Solutions:
Detect incomplete blocks — if the text contains an odd number of code fences, the last code block is still open. Temporarily append a closing fence before rendering.
Render code blocks only when complete — show raw text for in-progress blocks, render with syntax highlighting only after the closing fence arrives.
Use a streaming-aware markdown parser that handles incomplete syntax gracefully.
function renderStreamingMarkdown(partialText: string): string {
let processedText = partialText
// Count triple-backtick fences
const fenceRegex = /```/g
const fenceCount = (processedText.match(fenceRegex) || []).length
if (fenceCount % 2 !== 0) {
// Add a temporary closing fence for valid rendering
processedText += '\n' + '```'
}
// Count unmatched ** pairs
const boldCount = (processedText.match(/\*\*/g) || []).length
if (boldCount % 2 !== 0) {
processedText += '**'
}
return markdownToHtml(processedText)
}
Accessibility
Streaming text is inherently challenging for screen readers. Every DOM update can trigger an announcement, creating a wall of noise.
<!-- The message container -->
<div
role="log"
aria-label="Chat messages"
aria-live="polite"
aria-relevant="additions"
>
<!-- Individual messages -->
<div role="article" aria-label="Assistant message">
<!-- Content updates here. aria-live="polite" on the
parent means the screen reader will announce new
additions at a natural pause, not on every token. -->
</div>
</div>
<!-- Status announcements (separate from content) -->
<div
role="status"
aria-live="assertive"
class="sr-only"
>
<!-- Announce: "Generating response...",
"Response complete", "Error: connection lost" -->
</div>
<!-- Stop button must be keyboard-accessible -->
<button
aria-label="Stop generating"
aria-keyshortcuts="Escape"
>
Stop
</button>
Key accessibility decisions:
- Use
aria-live="polite"on the message container, not"assertive". Polite waits for a pause; assertive interrupts immediately. - Announce state changes (streaming started, completed, error) in a separate status region.
- Make the Stop button keyboard-focusable and bind Escape as a shortcut.
- After streaming completes, move focus to the message or input field.
Conversation persistence
If the user refreshes the page or navigates away, they expect their conversation to still be there.
// Minimal persistence with sessionStorage
function persistConversation(messages: ChatMessage[]) {
// Only persist completed messages — don't save streaming state
const completedMessages = messages.filter(
m => m.status.type === 'complete' || m.role === 'user'
)
sessionStorage.setItem(
'chat-messages',
JSON.stringify(completedMessages)
)
}
function restoreConversation(): ChatMessage[] {
const stored = sessionStorage.getItem('chat-messages')
if (!stored) return []
try {
return JSON.parse(stored)
} catch {
return []
}
}
For production: use IndexedDB for longer conversations, and sync to server for cross-device persistence.
Putting it all together
The full architecture looks like this:
┌─────────────────────────────────────────────────────────┐
│ ChatContainer │
│ ├── MessageList (scroll container, ChatScroller) │
│ │ ├── UserMessage │
│ │ ├── AssistantMessage (state machine) │
│ │ │ ├── streaming → TokenBuffer → Markdown render │
│ │ │ ├── interrupted → partial text + retry button │
│ │ │ └── complete → full markdown + copy button │
│ │ └── JumpToBottomButton (conditional) │
│ ├── InputArea │
│ │ ├── Textarea (auto-resize) │
│ │ ├── SendButton / StopButton (toggles on state) │
│ │ └── ModelSelector (optional) │
│ └── StatusRegion (aria-live, screen reader only) │
│ │
│ State: ChatStore │
│ ├── messages: ChatMessage[] │
│ ├── activeRequestId: string | null │
│ ├── abortController: AbortController | null │
│ └── persistConversation() / restoreConversation() │
└─────────────────────────────────────────────────────────┘
Where these patterns intersect with practice
The architecture patterns here directly apply to:
- RAG-Powered Smart FAQ — streaming grounded answers with citations
- Real-Time Notification Center — event streaming, cross-tab state, reconnection
The underlying concepts:
- The Event Loop — why buffering and scheduling matter for smooth streaming
- Async control flow — cancellation, retries, and error handling patterns
LLM-friendly summary
A complete frontend architecture guide for streaming AI chat UIs covering SSE vs fetch streaming, buffering, cancellation, error recovery, scroll behavior, and accessibility.