Back to blog
◇ Tier 2 14 min read2026-04-05

AI Content Moderation for Frontend Engineers — Pipelines, Classification, and Safety UX

How to design a multi-stage content moderation system that combines client-side pre-screening, server classification, and human review with transparent, trustworthy UX.

AI content moderationcontent safety UXmoderation pipelinetrust and safety frontendOpenAI moderation API

Every AI product needs moderation

If your app accepts user input and generates AI output, you need moderation on both sides. Users will submit harmful content. Models will occasionally generate harmful output. The question is not whether it will happen — it is whether your system catches it before another user sees it.

This guide covers the architecture of content moderation from a frontend engineer's perspective: the pipeline stages, the APIs, and most importantly the UX decisions that determine whether your moderation feels fair or hostile.

The multi-stage pipeline

No single approach catches everything. Production moderation uses multiple layers:

User Input
    │
    ▼
┌─────────────────────────────┐
│ Stage 1: CLIENT PRE-SCREEN  │  ← Instant, blocks obvious violations
│ Regex, word lists, length   │     before any server cost
│ checks, rate limiting       │
└──────────────┬──────────────┘
               ▼
┌─────────────────────────────┐
│ Stage 2: API CLASSIFICATION │  ← Fast ML classification
│ OpenAI Moderation API,      │     ~100-200ms, catches nuanced
│ Perspective API, custom     │     harmful content
└──────────────┬──────────────┘
               ▼
┌─────────────────────────────┐
│ Stage 3: CONTEXTUAL CHECK   │  ← LLM-based analysis for edge
│ Is this harmful IN CONTEXT? │     cases where words are fine
│ Sarcasm, quotes, education  │     but context is not
└──────────────┬──────────────┘
               ▼
┌─────────────────────────────┐
│ Stage 4: HUMAN REVIEW       │  ← Final arbiter for appeals
│ Queue for borderline cases  │     and model disagreements
└─────────────────────────────┘

Each stage has different latency, cost, and accuracy characteristics:

Stage Latency Cost per check Accuracy Catches
Client pre-screen <1ms Free Low Obvious slurs, spam patterns, length violations
API classification 100-200ms ~$0.001 High Hate speech, violence, sexual content, self-harm
Contextual check 500-2000ms ~$0.01 Very high Sarcasm, quotes, educational content, context-dependent
Human review Hours-days $0.10-1.00 Highest Edge cases, appeals, cultural nuance

Stage 1: Client-side pre-screening

The fastest, cheapest layer. Catch obvious violations before hitting the network.

interface PreScreenResult {
  passed: boolean;
  reason?: string;
  severity?: 'block' | 'warn' | 'flag';
}

function preScreenContent(text: string): PreScreenResult {
  // 1. Length checks
  if (text.length > 10000) {
    return { passed: false, reason: 'Content exceeds maximum length', severity: 'block' };
  }
  if (text.trim().length === 0) {
    return { passed: false, reason: 'Empty content', severity: 'block' };
  }

  // 2. Rate limiting (client-side, not a security boundary)
  if (isRateLimited()) {
    return { passed: false, reason: 'Please wait before posting again', severity: 'block' };
  }

  // 3. Pattern matching for obvious violations
  // NOTE: This is NOT a security boundary — determined users bypass client checks.
  // This is a UX optimization to give instant feedback.
  const patterns = getBlockPatterns(); // Loaded from server, updated regularly
  for (const pattern of patterns) {
    if (pattern.regex.test(text)) {
      return { passed: false, reason: pattern.userMessage, severity: pattern.severity };
    }
  }

  return { passed: true };
}

Critical principle: client-side checks are a UX optimization, not a security boundary. All content must also be validated server-side. A user with DevTools can bypass any client check.

Stage 2: API classification

The workhorse of moderation. These APIs are trained on massive datasets of harmful content.

// Using OpenAI Moderation API (free with any OpenAI API key)
async function moderateWithOpenAI(text: string): Promise<ModerationResult> {
  const response = await fetch('https://api.openai.com/v1/moderations', {
    method: 'POST',
    headers: {
      'Authorization': 'Bearer ' + process.env.OPENAI_API_KEY,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({ input: text }),
  });
  const data = await response.json();
  const result = data.results[0];

  return {
    flagged: result.flagged,
    categories: result.categories, // { hate: true, violence: false, ... }
    scores: result.category_scores, // { hate: 0.92, violence: 0.01, ... }
  };
}
// Using Google Perspective API (focus on toxicity)
async function moderateWithPerspective(text: string): Promise<ModerationResult> {
  const response = await fetch(
    'https://commentanalyzer.googleapis.com/v1alpha1/comments:analyze?key=' + API_KEY,
    {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        comment: { text },
        requestedAttributes: {
          TOXICITY: {},
          SEVERE_TOXICITY: {},
          IDENTITY_ATTACK: {},
          INSULT: {},
          THREAT: {},
        },
      }),
    }
  );
  const data = await response.json();
  const scores = Object.fromEntries(
    Object.entries(data.attributeScores).map(([key, val]: [string, any]) => [
      key,
      val.summaryScore.value,
    ])
  );
  return {
    flagged: scores.TOXICITY > 0.7,
    scores,
  };
}

Which API to choose:

API Best for Latency Cost Categories
OpenAI Moderation General content ~100ms Free (with API key) 11 categories (hate, violence, sexual, etc.)
Perspective API Comments, forums ~200ms Free (quota limits) 6 attributes (toxicity, insult, threat, etc.)
Azure Content Safety Enterprise, compliance ~150ms $1 per 1K calls Text + image, severity levels
Custom classifier Domain-specific Varies Model hosting cost Whatever you train

For most apps, start with OpenAI Moderation — it is free, fast, and covers the standard categories well.

The UX that makes or breaks moderation

The hardest part of moderation is not the ML — it is the UX. How you communicate rejection determines whether users feel protected or persecuted.

Bad moderation UX:

  • "Your message was blocked." (no reason, no recourse)
  • Silent deletion (user thinks it posted, it didn't)
  • Overly aggressive blocking (blocking the word "kill" in a gaming context)

Good moderation UX:

// Moderation result UI patterns
interface ModerationFeedback {
  action: 'allow' | 'warn' | 'block' | 'queue_review';
  userMessage: string;
  details?: string;
  canAppeal: boolean;
  appealUrl?: string;
}

function getModerationFeedback(result: ModerationResult): ModerationFeedback {
  if (!result.flagged) {
    return { action: 'allow', userMessage: '', canAppeal: false };
  }

  const maxScore = Math.max(...Object.values(result.scores));

  // High confidence violation
  if (maxScore > 0.9) {
    return {
      action: 'block',
      userMessage: 'This content cannot be posted as it may violate our community guidelines.',
      details: 'Our automated system detected potential issues. If you believe this is an error, you can request a review.',
      canAppeal: true,
      appealUrl: '/support/content-appeal',
    };
  }

  // Medium confidence — warn but allow
  if (maxScore > 0.6) {
    return {
      action: 'warn',
      userMessage: 'This content may be flagged by our safety system. Would you like to revise it before posting?',
      canAppeal: false,
    };
  }

  // Low confidence — queue for review, post with notice
  return {
    action: 'queue_review',
    userMessage: 'Your content has been posted and is pending a brief review.',
    canAppeal: false,
  };
}

Handling false positives

Every moderation system has false positives. The question is how you handle them.

  1. Threshold tuning. Do not use the API default threshold for everything. Tune per category — you might want a low threshold (0.5) for CSAM but a higher threshold (0.85) for profanity in an adult-audience app.

  2. Contextual allowlists. A medical app should not block anatomical terms. A gaming app should not block "kill." Maintain per-context allowlists.

  3. Appeal flow. Every block should have an appeal path. Appeals go to the human review queue with the original content, the model's classification, and the user's explanation.

  4. Transparency reports. Track and share: total flags, false positive rate, average appeal resolution time, overturn rate. This builds trust.

Moderating AI output

Do not forget to moderate what your AI generates, not just user input.

async function generateWithSafetyNet(
  prompt: string,
  generateFn: (p: string) => Promise<string>,
  moderateFn: (text: string) => Promise<ModerationResult>
): Promise<{ text: string; safe: boolean; fallback?: string }> {
  // 1. Moderate the user's input
  const inputCheck = await moderateFn(prompt);
  if (inputCheck.flagged) {
    return {
      text: '',
      safe: false,
      fallback: 'I cannot respond to this type of request.',
    };
  }

  // 2. Generate the response
  const response = await generateFn(prompt);

  // 3. Moderate the AI's output
  const outputCheck = await moderateFn(response);
  if (outputCheck.flagged) {
    return {
      text: '',
      safe: false,
      fallback: 'I generated a response but it was flagged by our safety system. Please try rephrasing your question.',
    };
  }

  return { text: response, safe: true };
}

Practice designing this

For the broader landscape of AI features in frontend apps, see 5 AI Patterns Every Frontend Engineer Will Build in 2026.

LLM-friendly summary

A guide to building content moderation systems for user-generated content, covering multi-stage pipelines (client pre-screen, server classify, human review), moderation APIs, false positive handling, and trust & safety UX patterns.