Client-Side ML Inference — Running AI Models in the Browser Without Melting the UI

TensorFlow.jsTransformers.jsclient-side MLWeb Workers MLbrowser AI inferenceWebGL machine learning

The pitch sounds too good

"Run AI models directly in the browser. No server costs. Complete privacy. Works offline."

That is the pitch for client-side ML inference, and it is real — but the gap between a demo and a production feature is enormous. The demo runs on a MacBook Pro with 32GB RAM. Your users are on a 2021 Android phone with 4GB RAM on a train in a tunnel.

This guide covers what actually matters: choosing the right model, loading it without ruining the first experience, running inference without freezing the UI, and degrading gracefully when the device cannot handle it.

Two frameworks, different strengths

The browser ML ecosystem has converged on two primary libraries:

TensorFlow.js — Google's port of TensorFlow to JavaScript. Best for vision tasks (image classification, object detection, pose estimation) and custom models trained in Python and converted.

Transformers.js — Hugging Face's port of the Transformers library to JavaScript via ONNX Runtime. Best for NLP tasks (sentiment analysis, text classification, feature extraction, translation) and using pre-trained Hugging Face models directly.

// TensorFlow.js — image classification
import * as tf from '@tensorflow/tfjs';
import * as mobilenet from '@tensorflow-models/mobilenet';

const model = await mobilenet.load({ version: 2, alpha: 1.0 });
const img = document.getElementById('photo');
const predictions = await model.classify(img);
// [{ className: 'golden retriever', probability: 0.93 }, ...]

// Transformers.js — sentiment analysis
import { pipeline } from '@xenova/transformers';

const classifier = await pipeline('sentiment-analysis');
const result = await classifier('This product is amazing!');
// [{ label: 'POSITIVE', score: 0.9998 }]

Both use WebGL (or WebGPU where available) for GPU-accelerated computation. The choice depends on your task, not preference.

Factor	TensorFlow.js	Transformers.js
Best for	Vision, custom models	NLP, HF model zoo
Model format	TF SavedModel, TFJS layers	ONNX (converted from PyTorch/HF)
GPU backend	WebGL, WebGPU, WASM	WebGL via ONNX Runtime Web
Model zoo	~20 pre-trained models	1000s from Hugging Face
Bundle size	~300KB core + model	~50KB core + ONNX runtime (~2MB) + model
Quantization	Supported (int8, float16)	Supported (int8, int4 via ONNX)

The model loading problem

This is the part that demo apps skip. A MobileNet v2 model is ~14MB. A sentiment classifier is ~60MB. A feature extraction model is ~90MB. Your user is staring at a blank screen while that downloads.

Strategy 1: Lazy load on first use

Do not load the model at app startup. Load it when the feature is first invoked.

// Lazy singleton pattern
let modelPromise: Promise<any> | null = null;

function getClassifier() {
  if (!modelPromise) {
    modelPromise = pipeline('sentiment-analysis', 'Xenova/distilbert-base-uncased-finetuned-sst-2-english', {
      progress_callback: (progress: any) => {
        // Update UI: "Loading AI model... 45%"
        if (progress.status === 'progress') {
          updateLoadingBar(progress.progress);
        }
      },
    });
  }
  return modelPromise;
}

// First call: downloads + loads model (~3-8 seconds)
// Subsequent calls: instant (returns cached promise)
const classifier = await getClassifier();

Strategy 2: Cache in IndexedDB

Both TF.js and Transformers.js cache models in browser storage automatically. But you should verify the cache hit:

// Check if model is already cached before showing loading UI
async function isModelCached(modelId: string): Promise<boolean> {
  try {
    const cache = await caches.open('transformers-cache');
    const keys = await cache.keys();
    return keys.some(req => req.url.includes(modelId));
  } catch {
    return false;
  }
}

// Usage
const cached = await isModelCached('distilbert-base-uncased');
if (cached) {
  // Show feature immediately — model loads in ~200ms from cache
  enableSentimentFeature();
} else {
  // Show "Enable AI analysis? (60MB download)" prompt
  showModelDownloadPrompt();
}

Strategy 3: Preload during idle time

If you know the user will likely use the AI feature, preload during idle time:

// Preload model during idle periods
if ('requestIdleCallback' in window) {
  requestIdleCallback(() => {
    getClassifier(); // Start download without blocking anything
  }, { timeout: 10000 });
}

Moving inference off the main thread

This is non-negotiable for production. ML inference is CPU-intensive even with WebGL. A single classification can take 50-200ms. During that time, the main thread is blocked — no scrolling, no clicks, no animations.

The fix: Web Workers.

// inference.worker.ts
import { pipeline, type Pipeline } from '@xenova/transformers';

let classifier: Pipeline | null = null;

self.onmessage = async (e: MessageEvent) => {
  const { type, payload, id } = e.data;

  if (type === 'load') {
    classifier = await pipeline('sentiment-analysis',
      'Xenova/distilbert-base-uncased-finetuned-sst-2-english'
    );
    self.postMessage({ id, type: 'loaded' });
    return;
  }

  if (type === 'classify') {
    if (!classifier) {
      self.postMessage({ id, type: 'error', error: 'Model not loaded' });
      return;
    }
    const result = await classifier(payload.text);
    self.postMessage({ id, type: 'result', data: result });
  }
};

// useInference.ts — composable for Vue/React
const worker = new Worker(
  new URL('./inference.worker.ts', globalThis._importMeta_.url),
  { type: 'module' }
);

let requestId = 0;
const pending = new Map<number, { resolve: Function; reject: Function }>();

worker.onmessage = (e) => {
  const { id, type, data, error } = e.data;
  const handler = pending.get(id);
  if (!handler) return;
  pending.delete(id);
  if (type === 'error') handler.reject(new Error(error));
  else handler.resolve(data);
};

export function classify(text: string): Promise<any> {
  return new Promise((resolve, reject) => {
    const id = ++requestId;
    pending.set(id, { resolve, reject });
    worker.postMessage({ type: 'classify', payload: { text }, id });
  });
}

export function loadModel(): Promise<void> {
  return new Promise((resolve, reject) => {
    const id = ++requestId;
    pending.set(id, { resolve, reject });
    worker.postMessage({ type: 'load', id });
  });
}

The main thread never blocks. The user can scroll, click, and interact while inference runs in the background.

Quantization — the size vs accuracy trade-off

Full-precision models are large. Quantization shrinks them by reducing numerical precision:

Precision	Model size	Speed	Accuracy loss	When to use
float32	100% (baseline)	Baseline	None	Development, accuracy-critical
float16	~50%	~1.2x faster	Negligible	Default for production
int8	~25%	~1.5-2x faster	Small (~1-2%)	Mobile, bandwidth-constrained
int4	~12.5%	~2-3x faster	Moderate (~3-5%)	Extremely constrained devices

With Transformers.js, quantized models are a simple flag:

// Full precision: ~250MB
const pipe = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');

// Quantized int8: ~65MB, nearly identical quality
const pipe = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', {
  quantized: true // Uses ONNX int8 quantized variant
});

For most frontend use cases, int8 quantization is the sweet spot. The accuracy loss is imperceptible for classification and search tasks.

Device detection and graceful degradation

Not every device can run ML models. You need to detect capabilities and degrade gracefully.

interface DeviceCapabilities {
  webgl: boolean;
  webgpu: boolean;
  memory: number | null; // GB, if available
  hardwareConcurrency: number;
  tier: 'high' | 'medium' | 'low' | 'unsupported';
}

function detectCapabilities(): DeviceCapabilities {
  const canvas = document.createElement('canvas');
  const webgl = !!(canvas.getContext('webgl2') || canvas.getContext('webgl'));
  const webgpu = 'gpu' in navigator;
  const memory = (navigator as any).deviceMemory ?? null;
  const cores = navigator.hardwareConcurrency || 1;

  let tier: DeviceCapabilities['tier'] = 'unsupported';
  if (webgpu && memory && memory >= 8) tier = 'high';
  else if (webgl && memory && memory >= 4) tier = 'medium';
  else if (webgl) tier = 'low';

  return { webgl, webgpu, memory, hardwareConcurrency: cores, tier };
}

// Adapt the experience based on device tier
function getInferenceStrategy(caps: DeviceCapabilities) {
  switch (caps.tier) {
    case 'high':
      return { model: 'full', quantized: false, batchSize: 16 };
    case 'medium':
      return { model: 'full', quantized: true, batchSize: 4 };
    case 'low':
      return { model: 'tiny', quantized: true, batchSize: 1 };
    case 'unsupported':
      return { model: 'server-fallback', quantized: false, batchSize: 0 };
  }
}

The key principle: never assume the device can run your model. Always have a server fallback path, and always let the user know what is happening.

The thermal throttling trap

Mobile devices throttle CPU/GPU performance when they get hot. Your model might run inference in 100ms on the first call, but after 30 consecutive classifications, the device throttles and each call takes 500ms.

The fix: batch and throttle your inference calls.

// Throttled batch inference
class InferenceThrottle {
  private queue: Array<{ text: string; resolve: Function }> = [];
  private processing = false;
  private readonly batchSize: number;
  private readonly cooldownMs: number;

  constructor(batchSize = 4, cooldownMs = 100) {
    this.batchSize = batchSize;
    this.cooldownMs = cooldownMs;
  }

  async enqueue(text: string): Promise<any> {
    return new Promise((resolve) => {
      this.queue.push({ text, resolve });
      if (!this.processing) this.processQueue();
    });
  }

  private async processQueue() {
    this.processing = true;
    while (this.queue.length > 0) {
      const batch = this.queue.splice(0, this.batchSize);
      const texts = batch.map(item => item.text);
      const results = await classify(texts); // Worker call
      batch.forEach((item, i) => item.resolve(results[i]));
      // Cool down between batches to prevent thermal throttling
      if (this.queue.length > 0) {
        await new Promise(r => setTimeout(r, this.cooldownMs));
      }
    }
    this.processing = false;
  }
}

When to use client-side vs server inference

Factor	Client-side	Server-side
Privacy	Data never leaves device	Data sent to server
Cost	Free per inference	Per-call API cost
Latency	No network round-trip	200-2000ms network + inference
Model size	Limited by device (~100MB practical)	Unlimited
Model quality	Smaller, quantized models	State-of-the-art models
Offline	Works offline	Requires connectivity
Consistency	Varies by device	Consistent results
Good for	Classification, search, simple NLP	Generation, complex reasoning, large models

The hybrid pattern is often the best answer: use client-side inference for fast, privacy-sensitive tasks and fall back to server APIs for complex tasks or weak devices.

Practice designing this

Ready to apply these concepts?

On-Device Image Classifier — design a TF.js image classification system with model loading, Workers, and device degradation
Browser Sentiment Dashboard — design a real-time dashboard with batch inference and server fallback

For broader AI integration patterns, see 5 AI Patterns Every Frontend Engineer Will Build in 2026.

LLM-friendly summary

A frontend engineering guide to running ML models in the browser using TensorFlow.js and Transformers.js, covering model loading, Web Worker offloading, quantization, device detection, and graceful degradation.