Building a Streaming AI Chat UI — The Complete Frontend Architecture Guide

streaming AI chat UIserver sent events reactAI chat interfacetoken streaming UX

Streaming AI Chat Architecture

Why streaming chat is an architecture problem

Building a text box that sends a message and shows a response takes an afternoon. Building a streaming chat UI that handles cancellation, partial failures, mobile networks, scroll hijacking, markdown rendering, code blocks, accessibility, and conversation persistence takes weeks.

This guide covers the full architecture with production-ready patterns.

Transport: SSE vs fetch streaming vs WebSockets

┌───────────────────┬──────────────────────────────────────────┐
│ Transport         │ When to use                              │
├───────────────────┼──────────────────────────────────────────┤
│ fetch + Response  │ Default choice. Works with POST, sends   │
│ body streaming    │ structured payloads, supports            │
│                   │ AbortController natively.                │
│                   │                                          │
│ Server-Sent       │ When you need automatic reconnection     │
│ Events (SSE)      │ and browser-managed event dispatch.      │
│                   │ GET-only (no POST body without wrapper). │
│                   │                                          │
│ WebSockets        │ When you need bidirectional streaming    │
│                   │ (voice, collaborative editing, real-time  │
│                   │ multiplayer). Overkill for one-way AI    │
│                   │ output.                                  │
└───────────────────┴──────────────────────────────────────────┘

Recommendation: Use fetch streaming for AI chat. It gives you POST with a request body, native AbortController support, and no extra protocol overhead.

The complete fetch streaming implementation

interface StreamCallbacks {
  onToken: (text: string) => void
  onToolCall?: (call: ToolCall) => void
  onError: (error: Error, partialText: string) => void
  onComplete: (fullText: string) => void
}

async function streamChat(
  messages: Message[],
  signal: AbortSignal,
  callbacks: StreamCallbacks
) {
  const response = await fetch('/api/chat', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ messages }),
    signal,
  })

  if (!response.ok) {
    const body = await response.text()
    throw new Error(`${response.status}: ${body}`)
  }

  const reader = response.body!.getReader()
  const decoder = new TextDecoder()
  let buffer = ''
  let fullText = ''

  try {
    while (true) {
      const { done, value } = await reader.read()
      if (done) break

      // Decode with stream: true to handle multi-byte chars
      buffer += decoder.decode(value, { stream: true })

      // Parse SSE-formatted lines from the buffer
      const lines = buffer.split('\n')
      buffer = lines.pop() || ''  // Keep incomplete line

      for (const line of lines) {
        if (!line.startsWith('data: ')) continue
        const payload = line.slice(6)
        if (payload === '[DONE]') {
          callbacks.onComplete(fullText)
          return
        }

        try {
          const event = JSON.parse(payload)
          if (event.type === 'token') {
            fullText += event.text
            callbacks.onToken(event.text)
          } else if (event.type === 'tool_call') {
            callbacks.onToolCall?.(event)
          }
        } catch {
          // Malformed JSON — skip this line
        }
      }
    }

    // Stream ended without [DONE] — still complete
    callbacks.onComplete(fullText)
  } catch (err) {
    if ((err as Error).name === 'AbortError') return
    callbacks.onError(err as Error, fullText)
  } finally {
    reader.releaseLock()
  }
}

Buffering: the difference between smooth and jittery

Fast models produce 30-80 tokens per second. Rendering each token individually causes:

Layout thrashing from constant DOM updates
Broken words ("architec" → "architecture" in two frames)
Dropped frames on mobile

// ❌ Naive: render every token immediately
onToken: (token) => {
  messageEl.textContent += token  // 60 DOM writes per second
}

// ✅ Buffer to word boundaries, flush on animation frame
class TokenBuffer {
  private buffer = ''
  private rafId: number | null = null
  private onFlush: (text: string) => void

  constructor(onFlush: (text: string) => void) {
    this.onFlush = onFlush
  }

  append(token: string) {
    this.buffer += token

    if (this.rafId === null) {
      this.rafId = requestAnimationFrame(() => {
        // Flush on word boundary if possible
        const lastSpace = this.buffer.lastIndexOf(' ')
        if (lastSpace > 0 && this.buffer.length > 20) {
          // Flush up to the last complete word
          this.onFlush(this.buffer.slice(0, lastSpace + 1))
          this.buffer = this.buffer.slice(lastSpace + 1)
        } else {
          // Short buffer or no word boundary — flush all
          this.onFlush(this.buffer)
          this.buffer = ''
        }
        this.rafId = null
      })
    }
  }

  // Call when stream ends to flush remaining text
  flush() {
    if (this.rafId !== null) {
      cancelAnimationFrame(this.rafId)
      this.rafId = null
    }
    if (this.buffer) {
      this.onFlush(this.buffer)
      this.buffer = ''
    }
  }
}

// Usage
const buffer = new TokenBuffer((text) => {
  displayedText += text
  renderMessage(displayedText)
})

streamChat(messages, signal, {
  onToken: (token) => buffer.append(token),
  onComplete: () => buffer.flush(),
  onError: () => buffer.flush(),
})

Why word boundaries matter: "The event loop schedu" → "les tasks from the" reads worse than "The event loop " → "schedules tasks from the". Users read words, not characters.

Scroll behavior that does not fight the user

This is the #1 UX complaint about streaming chat UIs. Auto-scroll is expected when the user is at the bottom watching output arrive. Auto-scroll is infuriating when the user has scrolled up to read earlier messages.

class ChatScroller {
  private container: HTMLElement
  private isUserScrolledUp = false
  private lastScrollTop = 0

  constructor(container: HTMLElement) {
    this.container = container

    container.addEventListener('scroll', () => {
      const { scrollTop, scrollHeight, clientHeight } = container
      const distanceFromBottom = scrollHeight - scrollTop - clientHeight

      // If user scrolled up more than 100px, they're reading history
      this.isUserScrolledUp = distanceFromBottom > 100

      // Detect scroll direction
      const scrolledDown = scrollTop > this.lastScrollTop
      this.lastScrollTop = scrollTop

      // If user manually scrolled to bottom, re-enable auto-scroll
      if (scrolledDown && distanceFromBottom < 20) {
        this.isUserScrolledUp = false
      }
    }, { passive: true })
  }

  // Call this after updating message content
  onContentUpdated() {
    if (!this.isUserScrolledUp) {
      // Smooth scroll to bottom during streaming
      this.container.scrollTo({
        top: this.container.scrollHeight,
        behavior: 'smooth',
      })
    }
  }

  get shouldShowJumpToBottom() {
    return this.isUserScrolledUp
  }
}

The "Jump to latest" button should appear when shouldShowJumpToBottom is true, with an unread count badge if new messages arrived while scrolled up.

Message state machine

A chat message is not just text. It has a lifecycle:

┌──────────┐    ┌───────────┐    ┌───────────┐
│ sending  │───▶│ streaming │───▶│ complete  │
└──────────┘    └─────┬─────┘    └───────────┘
                      │
                      ▼
               ┌─────────────┐    ┌───────────┐
               │ interrupted │───▶│ retrying  │──▶ streaming
               └─────────────┘    └───────────┘

States:
- sending:      User hit send, request in flight, no tokens yet
- streaming:    Tokens arriving, partial text visible
- complete:     Stream finished, full response rendered
- interrupted:  Stream failed mid-response, partial text preserved
- retrying:     User clicked retry, new request in flight

Model this explicitly in your state:

type MessageStatus =
  | { type: 'sending' }
  | { type: 'streaming'; partialText: string }
  | { type: 'complete'; text: string }
  | { type: 'interrupted'; partialText: string; error: string }
  | { type: 'retrying'; partialText: string; attempt: number }

interface ChatMessage {
  id: string
  role: 'user' | 'assistant'
  status: MessageStatus
  timestamp: number
}

Why this matters: Without explicit states, you end up with a mess of boolean flags (isLoading && !isError && hasPartialText && ...). The state machine makes every UI branch clear.

Error recovery

Streaming errors are not edge cases. They are part of the normal flow. Networks drop. Providers rate-limit. Mobile users background the app.

// Error handling strategy for each failure type
function handleStreamError(
  error: Error,
  partialText: string,
  message: ChatMessage
): MessageStatus {
  // User cancelled — not an error
  if (error.name === 'AbortError') {
    return partialText
      ? { type: 'interrupted', partialText, error: 'Cancelled' }
      : { type: 'sending' }  // Nothing received yet, remove the message
  }

  // Rate limited — tell the user to wait
  if (error.message.includes('429')) {
    return {
      type: 'interrupted',
      partialText,
      error: 'Rate limited. Please wait a moment before retrying.',
    }
  }

  // Server error — offer retry
  if (error.message.match(/^5\d{2}/)) {
    return {
      type: 'interrupted',
      partialText,
      error: 'Server error. Your partial response has been saved.',
    }
  }

  // Network error
  if (!navigator.onLine) {
    return {
      type: 'interrupted',
      partialText,
      error: 'You appear to be offline. Retry when connected.',
    }
  }

  // Unknown error
  return {
    type: 'interrupted',
    partialText,
    error: 'Something went wrong. Your partial response has been saved.',
  }
}

Key UX decisions:

Always preserve partial text. If the model generated 3 paragraphs before the error, do not throw them away. Show them with an "interrupted" indicator.
Retry should append, not replace (when possible). Some APIs support partial context to continue from where they left off.
Show the error inline on the message, not as a toast. The user needs to see which message failed.

Markdown and code block rendering during streaming

Rendering markdown while tokens are still arriving is tricky because the markdown is incomplete:

// The stream might deliver tokens that break markdown syntax:
// Frame 1: partial code fence opens but never closes
// Frame 2: fence still incomplete
// Frame 3: closing fence arrives — now it renders correctly
// The user saw 2 frames of broken rendering

Solutions:

Detect incomplete blocks — if the text contains an odd number of code fences, the last code block is still open. Temporarily append a closing fence before rendering.
Render code blocks only when complete — show raw text for in-progress blocks, render with syntax highlighting only after the closing fence arrives.
Use a streaming-aware markdown parser that handles incomplete syntax gracefully.

function renderStreamingMarkdown(partialText: string): string {
  let processedText = partialText

  // Count triple-backtick fences
  const fenceRegex = /```/g
  const fenceCount = (processedText.match(fenceRegex) || []).length
  if (fenceCount % 2 !== 0) {
    // Add a temporary closing fence for valid rendering
    processedText += '\n' + '```'
  }

  // Count unmatched ** pairs
  const boldCount = (processedText.match(/\*\*/g) || []).length
  if (boldCount % 2 !== 0) {
    processedText += '**'
  }

  return markdownToHtml(processedText)
}

Accessibility

Streaming text is inherently challenging for screen readers. Every DOM update can trigger an announcement, creating a wall of noise.

<!-- The message container -->
<div
  role="log"
  aria-label="Chat messages"
  aria-live="polite"
  aria-relevant="additions"
>
  <!-- Individual messages -->
  <div role="article" aria-label="Assistant message">
    <!-- Content updates here. aria-live="polite" on the
         parent means the screen reader will announce new
         additions at a natural pause, not on every token. -->
  </div>
</div>

<!-- Status announcements (separate from content) -->
<div
  role="status"
  aria-live="assertive"
  class="sr-only"
>
  <!-- Announce: "Generating response...",
       "Response complete", "Error: connection lost" -->
</div>

<!-- Stop button must be keyboard-accessible -->
<button
  aria-label="Stop generating"
  aria-keyshortcuts="Escape"
>
  Stop
</button>

Key accessibility decisions:

Use aria-live="polite" on the message container, not "assertive". Polite waits for a pause; assertive interrupts immediately.
Announce state changes (streaming started, completed, error) in a separate status region.
Make the Stop button keyboard-focusable and bind Escape as a shortcut.
After streaming completes, move focus to the message or input field.

Conversation persistence

If the user refreshes the page or navigates away, they expect their conversation to still be there.

// Minimal persistence with sessionStorage
function persistConversation(messages: ChatMessage[]) {
  // Only persist completed messages — don't save streaming state
  const completedMessages = messages.filter(
    m => m.status.type === 'complete' || m.role === 'user'
  )
  sessionStorage.setItem(
    'chat-messages',
    JSON.stringify(completedMessages)
  )
}

function restoreConversation(): ChatMessage[] {
  const stored = sessionStorage.getItem('chat-messages')
  if (!stored) return []
  try {
    return JSON.parse(stored)
  } catch {
    return []
  }
}

For production: use IndexedDB for longer conversations, and sync to server for cross-device persistence.

Putting it all together

The full architecture looks like this:

┌─────────────────────────────────────────────────────────┐
│  ChatContainer                                          │
│  ├── MessageList (scroll container, ChatScroller)       │
│  │   ├── UserMessage                                    │
│  │   ├── AssistantMessage (state machine)               │
│  │   │   ├── streaming → TokenBuffer → Markdown render  │
│  │   │   ├── interrupted → partial text + retry button  │
│  │   │   └── complete → full markdown + copy button     │
│  │   └── JumpToBottomButton (conditional)               │
│  ├── InputArea                                          │
│  │   ├── Textarea (auto-resize)                         │
│  │   ├── SendButton / StopButton (toggles on state)     │
│  │   └── ModelSelector (optional)                       │
│  └── StatusRegion (aria-live, screen reader only)       │
│                                                         │
│  State: ChatStore                                       │
│  ├── messages: ChatMessage[]                            │
│  ├── activeRequestId: string | null                     │
│  ├── abortController: AbortController | null            │
│  └── persistConversation() / restoreConversation()      │
└─────────────────────────────────────────────────────────┘

Where these patterns intersect with practice

The architecture patterns here directly apply to:

RAG-Powered Smart FAQ — streaming grounded answers with citations
Real-Time Notification Center — event streaming, cross-tab state, reconnection

The underlying concepts:

The Event Loop — why buffering and scheduling matter for smooth streaming
Async control flow — cancellation, retries, and error handling patterns

LLM-friendly summary

A complete frontend architecture guide for streaming AI chat UIs covering SSE vs fetch streaming, buffering, cancellation, error recovery, scroll behavior, and accessibility.