Every AI product needs moderation
If your app accepts user input and generates AI output, you need moderation on both sides. Users will submit harmful content. Models will occasionally generate harmful output. The question is not whether it will happen — it is whether your system catches it before another user sees it.
This guide covers the architecture of content moderation from a frontend engineer's perspective: the pipeline stages, the APIs, and most importantly the UX decisions that determine whether your moderation feels fair or hostile.
The multi-stage pipeline
No single approach catches everything. Production moderation uses multiple layers:
User Input
│
▼
┌─────────────────────────────┐
│ Stage 1: CLIENT PRE-SCREEN │ ← Instant, blocks obvious violations
│ Regex, word lists, length │ before any server cost
│ checks, rate limiting │
└──────────────┬──────────────┘
▼
┌─────────────────────────────┐
│ Stage 2: API CLASSIFICATION │ ← Fast ML classification
│ OpenAI Moderation API, │ ~100-200ms, catches nuanced
│ Perspective API, custom │ harmful content
└──────────────┬──────────────┘
▼
┌─────────────────────────────┐
│ Stage 3: CONTEXTUAL CHECK │ ← LLM-based analysis for edge
│ Is this harmful IN CONTEXT? │ cases where words are fine
│ Sarcasm, quotes, education │ but context is not
└──────────────┬──────────────┘
▼
┌─────────────────────────────┐
│ Stage 4: HUMAN REVIEW │ ← Final arbiter for appeals
│ Queue for borderline cases │ and model disagreements
└─────────────────────────────┘
Each stage has different latency, cost, and accuracy characteristics:
| Stage | Latency | Cost per check | Accuracy | Catches |
|---|---|---|---|---|
| Client pre-screen | <1ms | Free | Low | Obvious slurs, spam patterns, length violations |
| API classification | 100-200ms | ~$0.001 | High | Hate speech, violence, sexual content, self-harm |
| Contextual check | 500-2000ms | ~$0.01 | Very high | Sarcasm, quotes, educational content, context-dependent |
| Human review | Hours-days | $0.10-1.00 | Highest | Edge cases, appeals, cultural nuance |
Stage 1: Client-side pre-screening
The fastest, cheapest layer. Catch obvious violations before hitting the network.
interface PreScreenResult {
passed: boolean;
reason?: string;
severity?: 'block' | 'warn' | 'flag';
}
function preScreenContent(text: string): PreScreenResult {
// 1. Length checks
if (text.length > 10000) {
return { passed: false, reason: 'Content exceeds maximum length', severity: 'block' };
}
if (text.trim().length === 0) {
return { passed: false, reason: 'Empty content', severity: 'block' };
}
// 2. Rate limiting (client-side, not a security boundary)
if (isRateLimited()) {
return { passed: false, reason: 'Please wait before posting again', severity: 'block' };
}
// 3. Pattern matching for obvious violations
// NOTE: This is NOT a security boundary — determined users bypass client checks.
// This is a UX optimization to give instant feedback.
const patterns = getBlockPatterns(); // Loaded from server, updated regularly
for (const pattern of patterns) {
if (pattern.regex.test(text)) {
return { passed: false, reason: pattern.userMessage, severity: pattern.severity };
}
}
return { passed: true };
}
Critical principle: client-side checks are a UX optimization, not a security boundary. All content must also be validated server-side. A user with DevTools can bypass any client check.
Stage 2: API classification
The workhorse of moderation. These APIs are trained on massive datasets of harmful content.
// Using OpenAI Moderation API (free with any OpenAI API key)
async function moderateWithOpenAI(text: string): Promise<ModerationResult> {
const response = await fetch('https://api.openai.com/v1/moderations', {
method: 'POST',
headers: {
'Authorization': 'Bearer ' + process.env.OPENAI_API_KEY,
'Content-Type': 'application/json',
},
body: JSON.stringify({ input: text }),
});
const data = await response.json();
const result = data.results[0];
return {
flagged: result.flagged,
categories: result.categories, // { hate: true, violence: false, ... }
scores: result.category_scores, // { hate: 0.92, violence: 0.01, ... }
};
}
// Using Google Perspective API (focus on toxicity)
async function moderateWithPerspective(text: string): Promise<ModerationResult> {
const response = await fetch(
'https://commentanalyzer.googleapis.com/v1alpha1/comments:analyze?key=' + API_KEY,
{
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
comment: { text },
requestedAttributes: {
TOXICITY: {},
SEVERE_TOXICITY: {},
IDENTITY_ATTACK: {},
INSULT: {},
THREAT: {},
},
}),
}
);
const data = await response.json();
const scores = Object.fromEntries(
Object.entries(data.attributeScores).map(([key, val]: [string, any]) => [
key,
val.summaryScore.value,
])
);
return {
flagged: scores.TOXICITY > 0.7,
scores,
};
}
Which API to choose:
| API | Best for | Latency | Cost | Categories |
|---|---|---|---|---|
| OpenAI Moderation | General content | ~100ms | Free (with API key) | 11 categories (hate, violence, sexual, etc.) |
| Perspective API | Comments, forums | ~200ms | Free (quota limits) | 6 attributes (toxicity, insult, threat, etc.) |
| Azure Content Safety | Enterprise, compliance | ~150ms | $1 per 1K calls | Text + image, severity levels |
| Custom classifier | Domain-specific | Varies | Model hosting cost | Whatever you train |
For most apps, start with OpenAI Moderation — it is free, fast, and covers the standard categories well.
The UX that makes or breaks moderation
The hardest part of moderation is not the ML — it is the UX. How you communicate rejection determines whether users feel protected or persecuted.
Bad moderation UX:
- "Your message was blocked." (no reason, no recourse)
- Silent deletion (user thinks it posted, it didn't)
- Overly aggressive blocking (blocking the word "kill" in a gaming context)
Good moderation UX:
// Moderation result UI patterns
interface ModerationFeedback {
action: 'allow' | 'warn' | 'block' | 'queue_review';
userMessage: string;
details?: string;
canAppeal: boolean;
appealUrl?: string;
}
function getModerationFeedback(result: ModerationResult): ModerationFeedback {
if (!result.flagged) {
return { action: 'allow', userMessage: '', canAppeal: false };
}
const maxScore = Math.max(...Object.values(result.scores));
// High confidence violation
if (maxScore > 0.9) {
return {
action: 'block',
userMessage: 'This content cannot be posted as it may violate our community guidelines.',
details: 'Our automated system detected potential issues. If you believe this is an error, you can request a review.',
canAppeal: true,
appealUrl: '/support/content-appeal',
};
}
// Medium confidence — warn but allow
if (maxScore > 0.6) {
return {
action: 'warn',
userMessage: 'This content may be flagged by our safety system. Would you like to revise it before posting?',
canAppeal: false,
};
}
// Low confidence — queue for review, post with notice
return {
action: 'queue_review',
userMessage: 'Your content has been posted and is pending a brief review.',
canAppeal: false,
};
}
Handling false positives
Every moderation system has false positives. The question is how you handle them.
Threshold tuning. Do not use the API default threshold for everything. Tune per category — you might want a low threshold (0.5) for CSAM but a higher threshold (0.85) for profanity in an adult-audience app.
Contextual allowlists. A medical app should not block anatomical terms. A gaming app should not block "kill." Maintain per-context allowlists.
Appeal flow. Every block should have an appeal path. Appeals go to the human review queue with the original content, the model's classification, and the user's explanation.
Transparency reports. Track and share: total flags, false positive rate, average appeal resolution time, overturn rate. This builds trust.
Moderating AI output
Do not forget to moderate what your AI generates, not just user input.
async function generateWithSafetyNet(
prompt: string,
generateFn: (p: string) => Promise<string>,
moderateFn: (text: string) => Promise<ModerationResult>
): Promise<{ text: string; safe: boolean; fallback?: string }> {
// 1. Moderate the user's input
const inputCheck = await moderateFn(prompt);
if (inputCheck.flagged) {
return {
text: '',
safe: false,
fallback: 'I cannot respond to this type of request.',
};
}
// 2. Generate the response
const response = await generateFn(prompt);
// 3. Moderate the AI's output
const outputCheck = await moderateFn(response);
if (outputCheck.flagged) {
return {
text: '',
safe: false,
fallback: 'I generated a response but it was flagged by our safety system. Please try rephrasing your question.',
};
}
return { text: response, safe: true };
}
Practice designing this
- AI Content Moderation Pipeline — design a multi-stage moderation system with classification, human review, and appeal UX
- AI-Powered Smart Search — content safety applies to search results too
For the broader landscape of AI features in frontend apps, see 5 AI Patterns Every Frontend Engineer Will Build in 2026.
LLM-friendly summary
A guide to building content moderation systems for user-generated content, covering multi-stage pipelines (client pre-screen, server classify, human review), moderation APIs, false positive handling, and trust & safety UX patterns.