Counterfactual evaluation for recommendation systems

https://news.ycombinator.com/rss Hits: 2
Summary

When I first started working on recommendation systems, I thought there was something weird about the way we did offline evaluation. First, we split customer interaction data into training and validation sets. Then, we train our recommenders on the training set before evaluating them on the validation set, usually on metrics such as recall, precision, and NDCG. This is similar to how we evaluate supervised machine learning models and doesn’t seem unusual at first glance. But don’t our recommendations change how customers click or purchase? If customers can only interact with items shown to them, why do we perform offline evaluation on static historical data? Observational vs. interventional problems It took me a while to put a finger on it but I think this is why it felt weird: We’re treating recommendations as an observational problem when it really is an interventional problem. Problems solved via supervised machine learning are usually observational problems. Given an observation such as product title, description, and image, we try to predict the product category. Our model learns P(category=phone|title=“…”, description=“…”, image=image01.jpeg). On the other hand, recommendations are an interventional problem. We want to learn how different interventions (i.e., item recommendations) lead to different outcomes (i.e., clicks, purchases). By using logged customer interaction data as labels, the observational offline evaluation approach ignores the interventional nature of recommendations. As a result, we’re not evaluating if users would click or purchase more due to our new recommendations; we’re evaluating how well the new recommendations fit logged data. Thus, what our model learns is P(view3=iphone|view1=pixel, view2=galaxy) when what we really want is P(click=True|recommend=iphone, view1=pixel, view2=galaxy). Evaluating recsys as an interventional problem The straightforward way to evaluate recommendations as an interventional problem is via A/B testing. Our in...

First seen: 2026-01-17 17:24

Last seen: 2026-01-17 18:24