
9 AM: New AI model deployed to 100% of users (v2.4 replaces v2.3)
9:15 AM: Support tickets start flowing in
9:45 AM: 50+ complaints about "weird AI responses"
10:00 AM: PM realizes: can't rollback to v2.3 without full redeploy (45 minutes)
10:30 AM: CEO asks: "Why didn't we test this on 10% of users first?"
PM: "We don't have gradual rollout. It's all-or-nothing."
The Fix That Should've Been There: Multi-layer feature flags for AI.
const rolloutPercent = featureFlags.aiRolloutPercent; // 0, 10, 50, 100
if (!featureFlags.aiEnabled) {
return fallbackBehavior(); // Manual mode
}
Click to examine closelyUse: Emergency disable
Control: PM, on-call engineer
Response Time: Under 2 minutes
const rolloutPercent = featureFlags.aiRolloutPercent; // 0, 10, 50, 100
if (userHash % 100 < rolloutPercent) {
return getAISuggestion();
} else {
return fallbackBehavior();
}
Click to examine closelyUse: Gradual rollout (10% → 50% → 100%)
Control: PM
Response Time: 5 minutes
const minConfidence = featureFlags.aiMinConfidence; // 0.7, 0.8, 0.9
if (aiConfidence >= minConfidence) {
return aiSuggestion;
} else {
return null; // Don't show low-confidence predictions
}
Click to examine closelyUse: Reduce false positives without full disable
Control: PM, data scientist
Response Time: 5 minutes
const modelVersion = featureFlags.aiModelVersion; // "v2.3" or "v2.4" const model = loadModel(modelVersion);Click to examine closely
Use: A/B test new models, instant rollback
Control: ML engineer, PM
Response Time: 10 minutes

Feature: AI suggests relevant case law
Rollout Plan:
Week 1: Launch to 10% of users
aiEnabled = true, rolloutPercent = 10Week 1 (Day 3): Raise confidence threshold
minConfidence = 0.7 → 0.8Week 2: Expand to 50%
rolloutPercent = 50Week 3: Full rollout
rolloutPercent = 100What If We'd Gone 0→100% on Day 1?
Phase 1: Internal Alpha (1% or 100 users)
Phase 2: Beta (10%)
Phase 3: Majority (50%)
Phase 4: General Availability (100%)
Stopping Criteria (rollback if any):
User reports: "AI is often wrong" ├─ Check: What's the false positive rate? │ ├─ FP rate <5% → Not a model issue (user expectation calibration) │ └─ FP rate >10% → Model issue │ └─ Action: Raise minConfidence (0.7 → 0.8) │ ├─ FP rate drops to <5% → Keep new threshold │ └─ FP rate still high → Rollback to previous modelClick to examine closely

If you're missing any, you're flying blind.
Scenario: New model (v2.4) claims 3% accuracy improvement over v2.3.
Bad Approach: Deploy v2.4 to 100%, hope it works.
Good Approach: A/B test for 2 weeks.
const userCohort = assignCohort(userId); // "control" or "treatment"
if (userCohort === "treatment") {
model = loadModel("v2.4");
} else {
model = loadModel("v2.3");
}
Click to examine closelyMeasure:
Decision Criteria:
Timeline: 2 weeks (sufficient sample size for statistical significance).
Problem: Error rate spikes overnight (you're asleep). By morning, 500 users affected.
Solution: Auto-rollback trigger.
// Monitoring job runs every 5 minutes
if (errorRate > 2 * baseline) {
featureFlags.aiEnabled = false; // Auto-disable
alertPM("AI auto-disabled due to error spike");
}
Click to examine closelyWhy This Works: 5-minute detection + instant disable = max 5 users affected (vs. 500).
Tradeoff: False positives (auto-disable when not needed) → PM re-enables after checking.
Verdict: Better to auto-disable and check than to let errors compound.

Mistake 1: No Rollout Percentage
Mistake 2: Hardcoded Confidence Threshold
Mistake 3: No Model Version Control
Alex Welcing is a Senior AI Product Manager in New York who deploys AI features with 4-layer feature flags. His rollouts are gradual, his rollbacks are instant, and his incidents are rare.