What RAG actually is (in one paragraph)
Retrieval-Augmented Generation means: instead of asking a model to answer from its training data, you first search your own documents for relevant passages, then ask the model to answer using only those passages. The model generates; the retrieval grounds. That is the entire idea. Everything else is implementation.
This guide is for frontend engineers who need to build or design RAG features without becoming ML specialists. We will cover the full pipeline with real code, the architecture decisions you will actually face, and the frontend UX patterns that determine whether users trust the output.
The pipeline, step by step
Step 1: Chunk the source material
Your documents need to be broken into pieces small enough to retrieve precisely, but large enough to contain meaningful context.
// Naive chunking — splits on character count
// ❌ This is what most tutorials show, and it's almost always wrong
function naiveChunk(text, size = 500) {
const chunks = []
for (let i = 0; i < text.length; i += size) {
chunks.push(text.slice(i, i + size))
}
return chunks
}
// Problems:
// - Cuts mid-sentence, mid-paragraph, mid-thought
// - A question about "refund policy" might match a chunk that
// starts with "...within 30 days" but not the chunk that
// starts with "Our refund policy is..."
// Better: semantic chunking with overlap
// ✅ Split on paragraph/section boundaries with overlap
function semanticChunk(text, {
maxTokens = 300,
overlapTokens = 50,
separators = ['\n## ', '\n### ', '\n\n', '\n', '. '],
} = {}) {
const chunks = []
let remaining = text
for (const separator of separators) {
if (remaining.length === 0) break
const sections = remaining.split(separator)
remaining = ''
let currentChunk = ''
for (const section of sections) {
const candidate = currentChunk
? currentChunk + separator + section
: section
if (estimateTokens(candidate) > maxTokens && currentChunk) {
chunks.push(currentChunk.trim())
// Overlap: keep the last N tokens of the previous chunk
const overlapText = getLastNTokens(currentChunk, overlapTokens)
currentChunk = overlapText + separator + section
} else {
currentChunk = candidate
}
}
if (currentChunk) remaining = currentChunk
}
if (remaining.trim()) chunks.push(remaining.trim())
return chunks
}
function estimateTokens(text) {
// Rough estimate: 1 token ≈ 4 characters for English
return Math.ceil(text.length / 4)
}
Why overlap matters: Without overlap, a question like "What is the refund window?" might match a chunk that mentions "30 days" but lacks the context that this refers to refunds. The 50-token overlap ensures context bleeds across chunk boundaries.
Chunking strategies by content type:
┌──────────────────┬─────────────────────────────────────────┐
│ Content type │ Chunking strategy │
├──────────────────┼─────────────────────────────────────────┤
│ Documentation │ Split on headings (##, ###), keep the │
│ │ heading as prefix in each chunk │
│ │ │
│ FAQ │ One chunk per Q+A pair (natural units) │
│ │ │
│ Legal/Policy │ Split on numbered sections, large │
│ │ overlap (100+ tokens) for cross-refs │
│ │ │
│ Chat transcripts │ Split on speaker turns, include 2-3 │
│ │ previous turns for context │
│ │ │
│ Code │ Split on function/class boundaries, │
│ │ include imports and type definitions │
└──────────────────┴─────────────────────────────────────────┘
Step 2: Embed the chunks
Each chunk becomes a vector — a list of numbers that represents its semantic meaning. Similar meanings produce similar vectors.
// Server-side embedding with OpenAI
import OpenAI from 'openai'
const openai = new OpenAI()
async function embedChunks(chunks) {
// Batch embedding — more efficient than one-at-a-time
const response = await openai.embeddings.create({
model: 'text-embedding-3-small', // 1536 dimensions, $0.02/1M tokens
input: chunks,
})
return response.data.map((item, i) => ({
text: chunks[i],
embedding: item.embedding,
}))
}
// Client-side embedding with Transformers.js (no API cost)
import { pipeline } from '@xenova/transformers'
let embedder = null
async function getEmbedder() {
if (!embedder) {
// ~30MB download, cached by the browser
embedder = await pipeline(
'feature-extraction',
'Xenova/all-MiniLM-L6-v2'
)
}
return embedder
}
async function embedText(text) {
const model = await getEmbedder()
const output = await model(text, {
pooling: 'mean',
normalize: true,
})
return Array.from(output.data)
}
Model choice matters more than you think:
┌─────────────────────────┬────────┬────────┬──────────────┐
│ Model │ Dims │ Size │ Use case │
├─────────────────────────┼────────┼────────┼──────────────┤
│ all-MiniLM-L6-v2 │ 384 │ ~30MB │ Client-side, │
│ (Transformers.js) │ │ │ good quality │
│ │ │ │ │
│ text-embedding-3-small │ 1536 │ API │ Server-side, │
│ (OpenAI) │ │ │ great quality│
│ │ │ │ │
│ text-embedding-3-large │ 3072 │ API │ When quality │
│ (OpenAI) │ │ │ is critical │
│ │ │ │ │
│ voyage-3 │ 1024 │ API │ Best for │
│ (Voyage AI) │ │ │ code search │
└─────────────────────────┴────────┴────────┴──────────────┘
Step 3: Store and retrieve
You need a way to find the most similar chunks to a query. This is vector similarity search.
// Simple in-memory cosine similarity (fine for < 10K chunks)
function cosineSimilarity(a, b) {
let dot = 0, normA = 0, normB = 0
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i]
normA += a[i] * a[i]
normB += b[i] * b[i]
}
return dot / (Math.sqrt(normA) * Math.sqrt(normB))
}
function searchChunks(queryEmbedding, chunks, topK = 5) {
return chunks
.map(chunk => ({
...chunk,
score: cosineSimilarity(queryEmbedding, chunk.embedding),
}))
.sort((a, b) => b.score - a.score)
.slice(0, topK)
}
For production with larger corpora, use a vector database:
// With pgvector (Postgres extension)
// Store: INSERT INTO docs (content, embedding)
// VALUES ($1, $2)
//
// Search: SELECT content, 1 - (embedding <=> $1) as score
// FROM docs
// ORDER BY embedding <=> $1
// LIMIT 5
//
// The <=> operator computes cosine distance
// pgvector builds an IVFFlat or HNSW index for fast search
Step 4: Generate a grounded answer
Now you feed the retrieved chunks into a prompt:
async function generateAnswer(question, retrievedChunks) {
const context = retrievedChunks
.map((chunk, i) => `[Source ${i + 1}]: ${chunk.text}`)
.join('\n\n')
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'system',
content: `You are a helpful assistant. Answer the user's question
using ONLY the provided sources. If the sources don't contain enough
information, say so. Cite sources using [Source N] notation.
Sources:
${context}`,
},
{ role: 'user', content: question },
],
temperature: 0.2, // Lower = more faithful to sources
})
return {
answer: response.choices[0].message.content,
sources: retrievedChunks.map(c => ({
text: c.text,
score: c.score,
})),
}
}
Temperature matters: At temperature 0.7+, the model starts "elaborating" beyond the sources. For RAG, use 0.1-0.3 to keep it grounded.
The frontend UX that makes or breaks RAG
This is where most RAG implementations fail. The pipeline works, but the UI does not build trust.
Citation UI patterns (from worst to best)
Level 1 (bad):
"The refund window is 30 days."
→ No source. User has no reason to trust this.
Level 2 (okay):
"The refund window is 30 days."
Sources: [Refund Policy] [Terms of Service]
→ Sources listed but not connected to specific claims.
Level 3 (good):
"The refund window is 30 days [1]. Exceptions apply
for digital goods [2]."
[1] Refund Policy, Section 3.1
[2] Terms of Service, Section 8
→ Inline citations. User can verify each claim.
Level 4 (great):
Same as Level 3, plus:
- Click a citation to see the source chunk highlighted
- Source chunks show similarity score as a confidence bar
- "Based on 3 sources (high confidence)" header
→ Full transparency. User trusts because they can verify.
Handling low-confidence results
// Don't just check if you got results — check if they're GOOD
function assessConfidence(retrievedChunks) {
if (retrievedChunks.length === 0) {
return {
level: 'none',
message: 'I don\'t have information about this topic.',
}
}
const topScore = retrievedChunks[0].score
const avgScore = retrievedChunks.reduce((s, c) => s + c.score, 0)
/ retrievedChunks.length
if (topScore < 0.3) {
return {
level: 'low',
message: 'I found some potentially related information, '
+ 'but I\'m not confident it answers your question.',
}
}
if (topScore > 0.7 && avgScore > 0.5) {
return {
level: 'high',
message: null, // No disclaimer needed
}
}
return {
level: 'medium',
message: 'This answer is based on partially relevant sources. '
+ 'Please verify the details.',
}
}
The five failure modes and how to debug them
1. Retrieval miss — right document exists, wrong chunks returned
Symptom: The answer is wrong or "I don't know," but you can see the correct info in your documents.
Debug: Log the query embedding similarity scores. If the correct chunk scores below 0.3, your chunking is wrong — the information is split across chunks, or the chunk does not contain enough context for the embedding to capture its meaning.
Fix: Re-chunk with larger context windows, add heading prefixes, or use a better embedding model.
2. Retrieval hit, generation miss — right chunks retrieved, wrong answer
Symptom: The sources shown are correct, but the generated answer misinterprets them.
Debug: Read the prompt. Is the context ordering confusing the model? Is the system prompt allowing "elaboration" beyond sources?
Fix: Lower temperature, restructure the prompt to put the most relevant chunk first, add explicit "only use these sources" constraints.
3. Hallucination despite grounding — model adds plausible-sounding details not in sources
Symptom: The answer looks right and cites sources, but includes specific numbers or claims that are not in any retrieved chunk.
Debug: Automate this — compare each claim in the answer against the source text. This is called "faithfulness evaluation."
Fix: Add post-processing that flags unsupported claims, or use a second LLM call to verify faithfulness.
4. Stale knowledge — correct at indexing time, wrong now
Symptom: The answer references outdated information (old prices, deprecated APIs, former policies).
Debug: Check when your index was last updated. Add indexed_at timestamps to chunks.
Fix: Re-index on content change. Show "last updated" dates in the UI so users can assess freshness.
5. Scope confusion — user asks something outside the knowledge base
Symptom: The model tries to answer anyway, using tangentially related chunks.
Fix: Use the confidence assessment above. Below a similarity threshold, respond with "This topic isn't covered in our documentation" rather than forcing a weak answer.
Measuring RAG quality
Do not rely on "the answers sound good." Build a test suite:
// A RAG evaluation test case
const testCase = {
question: "What is the refund policy for annual plans?",
expectedChunkIds: ["refund-policy-section-3", "annual-billing-faq"],
expectedAnswer: /30 days.*annual/i,
expectedCitations: ["refund-policy-section-3"],
}
// Run retrieval
const chunks = await search(testCase.question)
const retrievedIds = chunks.map(c => c.id)
// Metrics
const recall = testCase.expectedChunkIds
.filter(id => retrievedIds.includes(id)).length
/ testCase.expectedChunkIds.length
// recall = 1.0 means all expected chunks were retrieved
const precision = retrievedIds
.filter(id => testCase.expectedChunkIds.includes(id)).length
/ retrievedIds.length
// precision = 1.0 means no irrelevant chunks were retrieved
// Generate and check answer
const result = await generateAnswer(testCase.question, chunks)
const answerMatchesExpected = testCase.expectedAnswer.test(result.answer)
Build 20-50 test cases from real user questions. Run them after every chunking or prompt change. This is your RAG regression suite.
Practice designing this
The most direct practice problem for RAG is RAG-Powered Smart FAQ — it covers the full architecture: chunking strategy, retrieval design, citation UX, and failure modes.
For broader AI architecture context:
- 5 AI Patterns Every Frontend Engineer Will Build in 2026 — where RAG fits among other patterns
- Building a Streaming AI Chat UI — how to stream the generated answer
LLM-friendly summary
A frontend-focused guide to retrieval-augmented generation that explains chunking, embedding, retrieval, answer grounding, and client-vs-server trade-offs.