(function(w,d,s,l,i){ w[l]=w[l]||[]; w[l].push({'gtm.start': new Date().getTime(),event:'gtm.js'}); var f=d.getElementsByTagName(s)[0], j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:''; j.async=true; j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl; f.parentNode.insertBefore(j,f); })(window,document,'script','dataLayer','GTM-W24L468');
Case Study: Scaling an AI Recommendation Engine to 100M Users
Polarity:Mixed/Knife-edge

Case Study: Scaling an AI Recommendation Engine to 100M Users

Visual Variations
fast sdxl
v2
kolors

Personalization is no longer a "nice to have"—it is the primary driver of retention for modern digital platforms. This case study details the journey of re-architecting a legacy recommendation system for a media platform with 100 million monthly active users (MAU), moving from simple heuristics to a state-of-the-art deep learning pipeline.

Executive Summary

  • The Challenge: A legacy rule-based system was failing to scale, resulting in stagnant engagement metrics and high churn among new users.
  • The Solution: We built a hybrid "Two-Tower" recommendation architecture capable of processing billions of events in real-time.
  • The Outcome: 42% increase in daily engagement, 15% boost in Day-30 retention, and a 35% lift in Click-Through Rate (CTR).

Our legacy system relied on "Collaborative Filtering" (Matrix Factorization) calculated once every 24 hours.

The Problem Space

Our legacy system relied on "Collaborative Filtering" (Matrix Factorization) calculated once every 24 hours.

  1. Staleness: If a user started watching a new genre in the morning, their recommendations wouldn't update until the next day.
  2. Scalability: The matrix factorization job was taking 18 hours to run, threatening to exceed the 24-hour window.
  3. Latency: The serving layer struggled to respond under 200ms during peak traffic.

Goal: Build a real-time system with <50ms latency at P99.

fast-sdxl artwork
fast sdxl

Solution Architecture

We adopted a classic Retrieval & Ranking funnel, common in high-scale systems like YouTube and TikTok.

1. Data Pipeline (The Nervous System)

We moved from batch processing to streaming.

  • Ingestion: Apache Kafka captures clickstream data (clicks, likes, dwell time).
  • Processing: Apache Flink aggregates features in real-time (e.g., "User X just watched 3 sci-fi videos in the last 10 minutes").
  • Feature Store: Redis stores these real-time user features for low-latency access.

2. Candidate Generation (Retrieval)

The goal: Narrow down 10 million items to 500 candidates.

  • Architecture: A "Two-Tower" Neural Network. One tower encodes User features, the other encodes Item features. The dot product of these vectors represents affinity.
  • Serving: We used Milvus (a vector database) for Approximate Nearest Neighbor (ANN) search. This allowed us to retrieve relevant items in <10ms.

3. Ranking Layer (Precision)

The goal: Sort the 500 candidates to find the top 10 to show the user.

  • Model: A Deep Learning Recommendation Model (DLRM) that takes into account complex interactions (e.g., "User likes Sci-Fi, but only on weekends").
  • Optimization: We used NVIDIA Triton Inference Server to serve the model, utilizing quantization (FP16) to speed up inference without losing accuracy.

Key Challenges & Solutions

The Cold Start Problem

New users have no history. Our collaborative filtering failed them.

  • Solution: We implemented a Multi-Armed Bandit algorithm for new users. It explores different popular categories (Exploration) while slowly converging on what the user clicks (Exploitation). This improved new user activation by 20%.

Bias & Echo Chambers

The model became too good at giving users what they wanted, trapping them in feedback loops.

  • Solution: We added a Diversity Re-ranking layer. If the top 10 results were all from the same category, the system would force-inject highly-rated items from adjacent categories to encourage discovery.

Results & Impact

The migration took 9 months, but the ROI was immediate.

  • Engagement: Total time spent on platform increased by 42%.
  • Latency: P99 latency dropped from 200ms to 45ms, despite the model being 10x more complex.
  • Cost: By optimizing our vector search and using GPU inference, we actually reduced infrastructure spend by 30% per request.
v2 artwork
v2

1. **Data > Models:** The biggest gains didn't come from tweaking the neural network architecture, but from engineering better real-time features (like "time of day" or "device type").

Lessons Learned

  1. Data > Models: The biggest gains didn't come from tweaking the neural network architecture, but from engineering better real-time features (like "time of day" or "device type").
  2. Progressive Delivery: We didn't flip a switch. We used "Shadow Deployment" (running the new model in the background) to verify performance, then slowly ramped up traffic via A/B testing.
  3. Observability is Key: Debugging a deep learning model is hard. We invested heavily in monitoring "Feature Drift" to know when our model was becoming stale.
AW
Alex Welcing
AI Product Expert
About
Discover related articles and explore the archive