
General Counsel: "We have 500,000 documents to review for this litigation. At $400/hour for contract attorneys, that's $2M. Can your AI reduce that?"
You (PM): "Our AI achieves 90% recall—it'll find 90% of relevant documents."
GC: "What about the other 10%? If we miss a smoking-gun email, we lose the case. Not acceptable."
You: "We can tune for higher recall, but precision drops—you'll review more irrelevant documents."
GC: "So what's the right tradeoff?"
You: Googles "precision recall tradeoff legal". Finds TREC Legal Track. Spends 3 hours reading. Returns with an answer.
*TREC** = Text Retrieval Conference (NIST-sponsored, 1992-present)
TREC = Text Retrieval Conference (NIST-sponsored, 1992-present)
Legal Track (2006-2011): Benchmark competition for eDiscovery AI systems.
The Task: Given a litigation topic (e.g., "Find all emails about Project X price-fixing"), rank 500,000+ documents by relevance.
The Metrics: Precision, recall, F1, and cost-effectiveness (how much money does the AI save vs. manual review?).
Why PMs Care: TREC Legal Track spent 15 years solving the precision-recall tradeoff for high-stakes enterprise use cases. The lessons generalize to any AI feature where:

# of flagged docs × $400/hour × 6 minutes/doc# of missed docs × $10,000 (court sanctions, case loss, reputational harm)500,000 docs × $400/hour × 6 minutes/doc = $2MThe Optimization Problem: Minimize Review Cost + Miss Cost while staying under Baseline Cost.
Bad Metric:
Our AI achieves: - Precision: 85% - Recall: 90% - F1: 0.87Click to examine closely
So What? Is that good enough to deploy? Should we tune for higher recall? The metrics don't tell you.
Good Metric:
Scenario 1: Tune for Recall (95%) - Review 100,000 docs (20% of corpus) - Find 950 of 1,000 relevant docs (miss 50) - Review Cost: $400k - Miss Cost: $500k (50 × $10k) - Total Cost: $900k (saves $1.1M vs. manual review) Scenario 2: Tune for Precision (90%) - Review 20,000 docs (4% of corpus) - Find 800 of 1,000 relevant docs (miss 200) - Review Cost: $80k - Miss Cost: $2M (200 × $10k) - Total Cost: $2.08M (WORSE than manual review!) Scenario 3: Balanced (Precision 85%, Recall 90%) - Review 50,000 docs (10% of corpus) - Find 900 of 1,000 relevant docs (miss 100) - Review Cost: $200k - Miss Cost: $1M (100 × $10k) - Total Cost: $1.2M (saves $800k vs. manual review)Click to examine closely
Which scenario do you ship? Depends on the miss cost your stakeholder will tolerate.
Step 1: Define the Cost Function
Ask your stakeholder:
Step 2: Map Precision-Recall to Costs
Use this formula:
Total Cost = (False Positives × Review Cost per Item) + (False Negatives × Miss Cost per Item)Click to examine closely
Step 3: Find the Optimal Operating Point
Step 4: Validate with Stakeholders
Show them:

*Use Case**: AI flags risky clauses in vendor contracts (e.g., unlimited liability, auto-renewal).
Use Case: AI flags risky clauses in vendor contracts (e.g., unlimited liability, auto-renewal).
Stakeholder: Legal team (15 attorneys, 200 contracts/month)
False Positive (AI flags safe clause as risky):
False Negative (AI misses risky clause):
Baseline (manual review):
| Precision | Recall | FP (per 200 contracts) | FN (per 200 contracts) | Review Cost | Miss Cost | Total Cost | Savings |
|---|---|---|---|---|---|---|---|
| 70% | 95% | 60 | 1 | $4k | $50k | $54k | $106k |
| 85% | 90% | 30 | 2 | $2k | $100k | $102k | $58k |
| 95% | 80% | 10 | 4 | $670 | $200k | $201k | -$41k |
Insight: High-precision tuning (95%) loses money because miss cost dominates. Optimal: 70% precision, 95% recall.
GC's Reaction: "I'll tolerate 60 false positives/month if we catch 95% of risky clauses. Ship it."
When:
Examples:
Tradeoff: Users review more false positives, but tolerate it because missing a true positive is unacceptable.
When:
Examples:
Tradeoff: Some true positives get missed, but users tolerate it because false positives waste their time.
Key Finding from TREC Legal 2009-2011:
Continuous Active Learning (CAL) beats traditional keyword search by 10x cost reduction.
How It Works:
Result:
PM Takeaway: If your AI ranks items by relevance, prioritize review of high-confidence items first. Don't force users to review in random order.
Question: When do you stop reviewing AI-flagged items?
Bad Answer: "Review until we hit 95% recall."
Problem: You don't know when you've hit 95% recall until you review everything (defeats the purpose).
TREC Solution: Stopping rules based on diminishing returns.
Rule 1: Knee of the Curve
Rule 2: Budget-Based
Rule 3: Risk-Based
PM Takeaway: Ship AI with a recommended stopping rule. Don't make users guess when to stop reviewing.
Mistake 1: Optimizing F1 Without Knowing Cost Ratio
Total Cost = FP × C1 + FN × C2Mistake 2: Shipping Without Stopping Rule
Mistake 3: Ignoring Active Learning
2006: Keyword search dominates eDiscovery. AI is experimental.
2011: Technology-Assisted Review (AI + active learning) is standard practice.
2025: Courts accept TAR as defensible (no longer "experimental").
Timeline for Enterprise AI: 5-10 years from "experimental" to "industry standard."
If you're building enterprise AI in 2025, you're at the 2011 TREC moment. The companies that master precision-recall optimization now will set the industry standard for the next decade.
Alex Welcing is a Senior AI Product Manager who optimizes for cost-weighted precision-recall, not F1 scores. His features ship with stopping rules because users need to know when they've reviewed enough, not when they've reviewed everything.