"Add AI to the product" is not a spec
When a PM says "let's add AI," they usually mean one of five completely different things. Each has different latency profiles, cost structures, trust requirements, and frontend architecture. Picking the wrong pattern costs months.
This guide breaks down the five patterns that are becoming standard in production apps, with enough technical depth that you can make the architecture call before writing code.
Pattern 1: Streaming generative UI
What it is: Progressive token rendering — the user sees the answer being "typed out" in real time.
Where you see it: ChatGPT, Cursor, Notion AI, customer support copilots, writing assistants.
The architecture:
Client Server LLM
│ │ │
│── POST /api/chat ──────▶│── prompt ───────────────▶│
│ │ │
│◀── SSE: token ──────────│◀── token ────────────────│
│◀── SSE: token ──────────│◀── token ────────────────│
│◀── SSE: token ──────────│◀── token ────────────────│
│◀── SSE: [DONE] ─────────│◀── finish_reason: stop ──│
│ │ │
│ Buffer → rAF → render │ │
Frontend responsibilities:
- Transport: SSE or fetch streaming (not WebSockets — you rarely need bidirectional for this)
- Buffering: Do not render every token. Buffer to word boundaries or use requestAnimationFrame batching. On a fast model, you get 30-80 tokens/second — rendering each one individually causes visible jank on mobile.
- Cancellation: AbortController on every request. When the user sends a new message, the old stream must die.
- Partial failure: If the stream drops at 60% completion, show what you have. Mark it as interrupted. Offer retry.
- Scroll management: Auto-scroll only when the user is at the bottom. Hijacking scroll while someone is reading earlier content is a top-3 AI UX complaint.
Cost to watch: Streaming responses cannot be cached the same way as complete responses. Each token is a billing event. If your feature lets users casually regenerate, cost per user can spike 5-10x.
Real code: See the full streaming implementation pattern in Building a Streaming AI Chat UI.
Pattern 2: Retrieval-augmented generation (RAG)
What it is: Answer questions using retrieved source material instead of the model's training data.
Where you see it: Internal knowledge bases, smart FAQ, policy assistants, documentation search, customer support bots.
The architecture:
User question
│
▼
┌──────────┐ ┌──────────────┐ ┌───────────┐
│ Embed │────▶│ Vector DB │────▶│ Top-K │
│ query │ │ similarity │ │ chunks │
└──────────┘ │ search │ └─────┬─────┘
└──────────────┘ │
▼
┌──────────────────┐
│ LLM generates │
│ answer grounded │
│ in chunks │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Frontend shows │
│ answer + │
│ citations + │
│ confidence │
└──────────────────┘
The frontend's real job in RAG:
Most RAG guides focus on the retrieval pipeline. They skip the part that determines whether users actually trust the output:
- Citation UI: Every claim should link back to a source chunk. If the user cannot verify, they will not trust. A "Sources" accordion is the minimum; inline citations are better.
- Confidence signaling: When retrieval returns low-similarity results, the UI should say "I'm not confident about this" rather than presenting weak answers with full confidence.
- Query refinement: Show the user what the system searched for. Let them see and edit the reformulated query.
- Empty state honesty: "I don't have information about this" is better than a hallucinated answer. The UI must distinguish between "no results found" and "results found but low quality."
Client-side vs server-side RAG:
┌──────────────────┬───────────────────┬───────────────────┐
│ │ Client-side │ Server-side │
├──────────────────┼───────────────────┼───────────────────┤
│ Corpus size │ < 10K chunks │ Any size │
│ Privacy │ Data stays local │ Data hits server │
│ Freshness │ Manual updates │ Real-time index │
│ First load │ Heavy (5-50MB) │ Instant │
│ Per-query cost │ Zero (local) │ Embedding + LLM │
│ Offline │ Works │ Needs network │
│ Good for │ Personal tools, │ Enterprise KB, │
│ │ static docs │ support, search │
└──────────────────┴───────────────────┴───────────────────┘
Deep dive: The Frontend Engineer's Guide to RAG
Pattern 3: Client-side inference
What it is: Running ML models directly in the browser using WebAssembly, WebGPU, or ONNX Runtime Web.
Where you see it: Smart compose suggestions, content classification, on-device embeddings, image processing, privacy-first features.
When it wins:
// Real example: local semantic search with Transformers.js
import { pipeline } from '@xenova/transformers'
// One-time setup (~30MB download, cached after first load)
const embedder = await pipeline(
'feature-extraction',
'Xenova/all-MiniLM-L6-v2'
)
// Embed a query — runs entirely in the browser
const queryEmbedding = await embedder(
'how do I handle auth in a React app',
{ pooling: 'mean', normalize: true }
)
// Compare against pre-computed document embeddings
const results = documents
.map(doc => ({
...doc,
score: cosineSimilarity(queryEmbedding.data, doc.embedding),
}))
.sort((a, b) => b.score - a.score)
.slice(0, 5)
The real trade-offs nobody talks about:
- Cold start is expensive. A 30MB model download on first visit is acceptable if the user will use the feature repeatedly. It is unacceptable for a one-time interaction. Consider: pre-load during onboarding, or use a service worker to cache.
- Mobile thermal throttling. Running inference on a phone heats the device. After 10-15 consecutive operations, performance drops significantly. Your UI needs to degrade gracefully.
- Quantized models are usually good enough. A quantized MiniLM model at 30MB gives you 95% of the quality of the full model at 130MB. For classification and embeddings, always start with the smallest model that meets your quality bar.
- Web Workers are mandatory. Never run inference on the main thread. Even a "fast" embedding takes 50-200ms, which is enough to drop multiple frames.
// Always run inference in a Worker
// main.ts
const worker = new Worker(new URL('./inference.worker.ts', globalThis._importMeta_.url))
worker.postMessage({ type: 'embed', text: query })
worker.onmessage = (e) => {
if (e.data.type === 'embedding') {
searchWithEmbedding(e.data.result)
}
}
// inference.worker.ts
import { pipeline } from '@xenova/transformers'
let embedder = null
self.onmessage = async (e) => {
if (!embedder) {
embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2')
}
if (e.data.type === 'embed') {
const result = await embedder(e.data.text, { pooling: 'mean', normalize: true })
self.postMessage({ type: 'embedding', result: Array.from(result.data) })
}
}
Pattern 4: AI form intelligence
What it is: Using AI to accelerate structured data entry — extraction, smart defaults, autofill from pasted content, field validation, and schema-aware suggestions.
Where you see it: CRM data entry, invoice processing, support ticket triage, job application forms, medical intake forms.
Why this pattern is underrated:
Chat UIs get all the attention, but form intelligence often delivers more measurable business value. Reducing a 5-minute form to 30 seconds of review has clear ROI. A chat interface that "helps you write" has fuzzy ROI.
The architecture pattern:
User pastes unstructured text
(email, invoice, description)
│
▼
┌─────────────────────┐
│ Extraction prompt │
│ + output schema │──▶ LLM returns structured JSON
│ (JSON mode) │
└─────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Frontend maps extracted values to form │
│ │
│ ┌─────────┐ ┌─────────┐ ┌──────────┐ │
│ │ Name ✓ │ │ Email ✓ │ │ Phone ⚠ │ │
│ │ (high) │ │ (high) │ │ (low │ │
│ │ │ │ │ │ confidence│ │
│ └─────────┘ └─────────┘ └──────────┘ │
│ │
│ ⚠ = "needs review" — user must confirm│
└─────────────────────────────────────────┘
Frontend responsibilities:
- Confidence thresholds: Low-confidence extractions should be visually distinct — yellow highlight, "review" badge, or unfilled with a suggestion tooltip. Never silently fill a field the model is not sure about.
- User override is sacred: Every AI-filled field must be editable. The AI suggests; the human decides. If users cannot easily override, adoption drops to zero.
- Explanation on hover: "Extracted from line 3 of the pasted email" builds trust. Black-box autofill feels suspicious.
- Incremental extraction: For long forms, extract as the user fills in context. Do not wait for a single "extract all" button.
Pattern 5: Agentic tool use
What it is: The AI does not just answer — it takes actions by calling tools, reading data, and producing visible side effects.
Where you see it: Claude Code, GitHub Copilot Workspace, Devin, internal operations copilots, customer support agents that can actually modify accounts.
Why this is the hardest pattern for frontend:
The other four patterns have a clear request-response shape. Agentic UIs have a fundamentally different interaction model:
User: "Find all invoices over $10K from Q1 and flag them for review"
Agent thinking: I need to...
1. Search invoices (tool call: searchInvoices)
2. Filter by amount > 10000 and date range
3. Flag each one (tool call: flagInvoice × N)
4. Report what I did
Frontend must show:
├── What the agent is planning to do (before it does it)
├── Which tool is currently executing (live status)
├── What each tool returned (inspectable results)
├── Whether any step failed (error per step, not per session)
└── What the agent concluded (summary of actions taken)
The frontend design challenge:
- Approval gates: Some actions need user confirmation before execution. The UI needs a clear "the agent wants to do X — approve or deny" pattern, not just a wall of auto-executing actions.
- Action history: A scrollable log of what the agent did, with expandable details per step. Users need to audit and understand.
- Partial rollback: If step 3 of 5 fails, what happens to steps 1-2? The UI must communicate this.
- Token/cost visibility: Agentic loops can burn through tokens fast. Show estimated or actual cost if the agent is running expensive multi-turn loops.
Choosing the right pattern: a decision tree
Instead of a checklist, walk through these questions in order:
1. Does the user need to VERIFY the output against sources?
→ Yes: RAG (Pattern 2)
→ No: continue
2. Does the user need to SEE the output being generated?
→ Yes: Streaming (Pattern 1)
→ No: continue
3. Does the AI need to TAKE ACTIONS, not just generate text?
→ Yes: Agentic (Pattern 5)
→ No: continue
4. Is the output STRUCTURED data that fills a known schema?
→ Yes: Form intelligence (Pattern 4)
→ No: continue
5. Must the data STAY ON DEVICE for privacy or cost reasons?
→ Yes: Client-side inference (Pattern 3)
→ No: Default to streaming (Pattern 1) — it's the most
flexible starting point
Many real features combine patterns. A smart FAQ might use RAG for retrieval and streaming for answer generation. A support copilot might use form intelligence for ticket creation and agentic tool use for account modifications.
The cost reality
┌─────────────────────────┬──────────────────────────────────┐
│ Pattern │ Cost driver │
├─────────────────────────┼──────────────────────────────────┤
│ Streaming │ ~$0.01-0.10 per response │
│ │ (scales with output length) │
│ │ │
│ RAG │ ~$0.005 embedding + $0.02-0.08 │
│ │ generation per query │
│ │ │
│ Client-side inference │ $0 per query after initial load │
│ │ (fixed CDN cost for model files) │
│ │ │
│ Form intelligence │ ~$0.01-0.03 per extraction │
│ │ (short prompts, JSON mode) │
│ │ │
│ Agentic │ ~$0.10-2.00 per session │
│ │ (multiple turns, tool calls) │
└─────────────────────────┴──────────────────────────────────┘
These numbers shift with model pricing changes, but the relative magnitudes stay consistent. Client-side is cheapest at scale. Agentic is most expensive and hardest to predict.
Practice designing these
Each pattern maps to a practice problem:
- RAG-Powered Smart FAQ — design the retrieval, citation, and trust UX
- Real-Time Notification Center — streaming event delivery and state synchronization
- Dynamic Form Builder — structured data, intelligent defaults, schema-aware suggestions
For the implementation deep-dives:
LLM-friendly summary
A survey of five AI product patterns for frontend engineers: streaming interfaces, RAG search, client-side inference, AI form intelligence, and tool-using conversational agents.