The pitch sounds too good
"Run AI models directly in the browser. No server costs. Complete privacy. Works offline."
That is the pitch for client-side ML inference, and it is real — but the gap between a demo and a production feature is enormous. The demo runs on a MacBook Pro with 32GB RAM. Your users are on a 2021 Android phone with 4GB RAM on a train in a tunnel.
This guide covers what actually matters: choosing the right model, loading it without ruining the first experience, running inference without freezing the UI, and degrading gracefully when the device cannot handle it.
Two frameworks, different strengths
The browser ML ecosystem has converged on two primary libraries:
TensorFlow.js — Google's port of TensorFlow to JavaScript. Best for vision tasks (image classification, object detection, pose estimation) and custom models trained in Python and converted.
Transformers.js — Hugging Face's port of the Transformers library to JavaScript via ONNX Runtime. Best for NLP tasks (sentiment analysis, text classification, feature extraction, translation) and using pre-trained Hugging Face models directly.
// TensorFlow.js — image classification
import * as tf from '@tensorflow/tfjs';
import * as mobilenet from '@tensorflow-models/mobilenet';
const model = await mobilenet.load({ version: 2, alpha: 1.0 });
const img = document.getElementById('photo');
const predictions = await model.classify(img);
// [{ className: 'golden retriever', probability: 0.93 }, ...]
// Transformers.js — sentiment analysis
import { pipeline } from '@xenova/transformers';
const classifier = await pipeline('sentiment-analysis');
const result = await classifier('This product is amazing!');
// [{ label: 'POSITIVE', score: 0.9998 }]
Both use WebGL (or WebGPU where available) for GPU-accelerated computation. The choice depends on your task, not preference.
| Factor | TensorFlow.js | Transformers.js |
|---|---|---|
| Best for | Vision, custom models | NLP, HF model zoo |
| Model format | TF SavedModel, TFJS layers | ONNX (converted from PyTorch/HF) |
| GPU backend | WebGL, WebGPU, WASM | WebGL via ONNX Runtime Web |
| Model zoo | ~20 pre-trained models | 1000s from Hugging Face |
| Bundle size | ~300KB core + model | ~50KB core + ONNX runtime (~2MB) + model |
| Quantization | Supported (int8, float16) | Supported (int8, int4 via ONNX) |
The model loading problem
This is the part that demo apps skip. A MobileNet v2 model is ~14MB. A sentiment classifier is ~60MB. A feature extraction model is ~90MB. Your user is staring at a blank screen while that downloads.
Strategy 1: Lazy load on first use
Do not load the model at app startup. Load it when the feature is first invoked.
// Lazy singleton pattern
let modelPromise: Promise<any> | null = null;
function getClassifier() {
if (!modelPromise) {
modelPromise = pipeline('sentiment-analysis', 'Xenova/distilbert-base-uncased-finetuned-sst-2-english', {
progress_callback: (progress: any) => {
// Update UI: "Loading AI model... 45%"
if (progress.status === 'progress') {
updateLoadingBar(progress.progress);
}
},
});
}
return modelPromise;
}
// First call: downloads + loads model (~3-8 seconds)
// Subsequent calls: instant (returns cached promise)
const classifier = await getClassifier();
Strategy 2: Cache in IndexedDB
Both TF.js and Transformers.js cache models in browser storage automatically. But you should verify the cache hit:
// Check if model is already cached before showing loading UI
async function isModelCached(modelId: string): Promise<boolean> {
try {
const cache = await caches.open('transformers-cache');
const keys = await cache.keys();
return keys.some(req => req.url.includes(modelId));
} catch {
return false;
}
}
// Usage
const cached = await isModelCached('distilbert-base-uncased');
if (cached) {
// Show feature immediately — model loads in ~200ms from cache
enableSentimentFeature();
} else {
// Show "Enable AI analysis? (60MB download)" prompt
showModelDownloadPrompt();
}
Strategy 3: Preload during idle time
If you know the user will likely use the AI feature, preload during idle time:
// Preload model during idle periods
if ('requestIdleCallback' in window) {
requestIdleCallback(() => {
getClassifier(); // Start download without blocking anything
}, { timeout: 10000 });
}
Moving inference off the main thread
This is non-negotiable for production. ML inference is CPU-intensive even with WebGL. A single classification can take 50-200ms. During that time, the main thread is blocked — no scrolling, no clicks, no animations.
The fix: Web Workers.
// inference.worker.ts
import { pipeline, type Pipeline } from '@xenova/transformers';
let classifier: Pipeline | null = null;
self.onmessage = async (e: MessageEvent) => {
const { type, payload, id } = e.data;
if (type === 'load') {
classifier = await pipeline('sentiment-analysis',
'Xenova/distilbert-base-uncased-finetuned-sst-2-english'
);
self.postMessage({ id, type: 'loaded' });
return;
}
if (type === 'classify') {
if (!classifier) {
self.postMessage({ id, type: 'error', error: 'Model not loaded' });
return;
}
const result = await classifier(payload.text);
self.postMessage({ id, type: 'result', data: result });
}
};
// useInference.ts — composable for Vue/React
const worker = new Worker(
new URL('./inference.worker.ts', globalThis._importMeta_.url),
{ type: 'module' }
);
let requestId = 0;
const pending = new Map<number, { resolve: Function; reject: Function }>();
worker.onmessage = (e) => {
const { id, type, data, error } = e.data;
const handler = pending.get(id);
if (!handler) return;
pending.delete(id);
if (type === 'error') handler.reject(new Error(error));
else handler.resolve(data);
};
export function classify(text: string): Promise<any> {
return new Promise((resolve, reject) => {
const id = ++requestId;
pending.set(id, { resolve, reject });
worker.postMessage({ type: 'classify', payload: { text }, id });
});
}
export function loadModel(): Promise<void> {
return new Promise((resolve, reject) => {
const id = ++requestId;
pending.set(id, { resolve, reject });
worker.postMessage({ type: 'load', id });
});
}
The main thread never blocks. The user can scroll, click, and interact while inference runs in the background.
Quantization — the size vs accuracy trade-off
Full-precision models are large. Quantization shrinks them by reducing numerical precision:
| Precision | Model size | Speed | Accuracy loss | When to use |
|---|---|---|---|---|
| float32 | 100% (baseline) | Baseline | None | Development, accuracy-critical |
| float16 | ~50% | ~1.2x faster | Negligible | Default for production |
| int8 | ~25% | ~1.5-2x faster | Small (~1-2%) | Mobile, bandwidth-constrained |
| int4 | ~12.5% | ~2-3x faster | Moderate (~3-5%) | Extremely constrained devices |
With Transformers.js, quantized models are a simple flag:
// Full precision: ~250MB
const pipe = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
// Quantized int8: ~65MB, nearly identical quality
const pipe = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', {
quantized: true // Uses ONNX int8 quantized variant
});
For most frontend use cases, int8 quantization is the sweet spot. The accuracy loss is imperceptible for classification and search tasks.
Device detection and graceful degradation
Not every device can run ML models. You need to detect capabilities and degrade gracefully.
interface DeviceCapabilities {
webgl: boolean;
webgpu: boolean;
memory: number | null; // GB, if available
hardwareConcurrency: number;
tier: 'high' | 'medium' | 'low' | 'unsupported';
}
function detectCapabilities(): DeviceCapabilities {
const canvas = document.createElement('canvas');
const webgl = !!(canvas.getContext('webgl2') || canvas.getContext('webgl'));
const webgpu = 'gpu' in navigator;
const memory = (navigator as any).deviceMemory ?? null;
const cores = navigator.hardwareConcurrency || 1;
let tier: DeviceCapabilities['tier'] = 'unsupported';
if (webgpu && memory && memory >= 8) tier = 'high';
else if (webgl && memory && memory >= 4) tier = 'medium';
else if (webgl) tier = 'low';
return { webgl, webgpu, memory, hardwareConcurrency: cores, tier };
}
// Adapt the experience based on device tier
function getInferenceStrategy(caps: DeviceCapabilities) {
switch (caps.tier) {
case 'high':
return { model: 'full', quantized: false, batchSize: 16 };
case 'medium':
return { model: 'full', quantized: true, batchSize: 4 };
case 'low':
return { model: 'tiny', quantized: true, batchSize: 1 };
case 'unsupported':
return { model: 'server-fallback', quantized: false, batchSize: 0 };
}
}
The key principle: never assume the device can run your model. Always have a server fallback path, and always let the user know what is happening.
The thermal throttling trap
Mobile devices throttle CPU/GPU performance when they get hot. Your model might run inference in 100ms on the first call, but after 30 consecutive classifications, the device throttles and each call takes 500ms.
The fix: batch and throttle your inference calls.
// Throttled batch inference
class InferenceThrottle {
private queue: Array<{ text: string; resolve: Function }> = [];
private processing = false;
private readonly batchSize: number;
private readonly cooldownMs: number;
constructor(batchSize = 4, cooldownMs = 100) {
this.batchSize = batchSize;
this.cooldownMs = cooldownMs;
}
async enqueue(text: string): Promise<any> {
return new Promise((resolve) => {
this.queue.push({ text, resolve });
if (!this.processing) this.processQueue();
});
}
private async processQueue() {
this.processing = true;
while (this.queue.length > 0) {
const batch = this.queue.splice(0, this.batchSize);
const texts = batch.map(item => item.text);
const results = await classify(texts); // Worker call
batch.forEach((item, i) => item.resolve(results[i]));
// Cool down between batches to prevent thermal throttling
if (this.queue.length > 0) {
await new Promise(r => setTimeout(r, this.cooldownMs));
}
}
this.processing = false;
}
}
When to use client-side vs server inference
| Factor | Client-side | Server-side |
|---|---|---|
| Privacy | Data never leaves device | Data sent to server |
| Cost | Free per inference | Per-call API cost |
| Latency | No network round-trip | 200-2000ms network + inference |
| Model size | Limited by device (~100MB practical) | Unlimited |
| Model quality | Smaller, quantized models | State-of-the-art models |
| Offline | Works offline | Requires connectivity |
| Consistency | Varies by device | Consistent results |
| Good for | Classification, search, simple NLP | Generation, complex reasoning, large models |
The hybrid pattern is often the best answer: use client-side inference for fast, privacy-sensitive tasks and fall back to server APIs for complex tasks or weak devices.
Practice designing this
Ready to apply these concepts?
- On-Device Image Classifier — design a TF.js image classification system with model loading, Workers, and device degradation
- Browser Sentiment Dashboard — design a real-time dashboard with batch inference and server fallback
For broader AI integration patterns, see 5 AI Patterns Every Frontend Engineer Will Build in 2026.
LLM-friendly summary
A frontend engineering guide to running ML models in the browser using TensorFlow.js and Transformers.js, covering model loading, Web Worker offloading, quantization, device detection, and graceful degradation.