A dirty dozen: twelve common metric interpretation pitfalls in online controlled experiments
P Dmitriev, S Gupta, DW Kim, G Vaz - Proceedings of the 23rd ACM …, 2017 - dl.acm.org
P Dmitriev, S Gupta, DW Kim, G Vaz
Proceedings of the 23rd ACM SIGKDD international conference on knowledge …, 2017•dl.acm.orgOnline controlled experiments (eg, A/B tests) are now regularly used to guide product
development and accelerate innovation in software. Product ideas are evaluated as
scientific hypotheses, and tested in web sites, mobile applications, desktop applications,
services, and operating systems. One of the key challenges for organizations that run
controlled experiments is to come up with the right set of metrics [1][2][3]. Having good
metrics, however, is not enough. In our experience of running thousands of experiments with …
development and accelerate innovation in software. Product ideas are evaluated as
scientific hypotheses, and tested in web sites, mobile applications, desktop applications,
services, and operating systems. One of the key challenges for organizations that run
controlled experiments is to come up with the right set of metrics [1][2][3]. Having good
metrics, however, is not enough. In our experience of running thousands of experiments with …
Online controlled experiments (e.g., A/B tests) are now regularly used to guide product development and accelerate innovation in software. Product ideas are evaluated as scientific hypotheses, and tested in web sites, mobile applications, desktop applications, services, and operating systems. One of the key challenges for organizations that run controlled experiments is to come up with the right set of metrics [1] [2] [3]. Having good metrics, however, is not enough.
In our experience of running thousands of experiments with many teams across Microsoft, we observed again and again how incorrect interpretations of metric movements may lead to wrong conclusions about the experiment's outcome, which if deployed could hurt the business by millions of dollars. Inspired by Steven Goodman's twelve p-value misconceptions [4], in this paper, we share twelve common metric interpretation pitfalls which we observed repeatedly in our experiments. We illustrate each pitfall with a puzzling example from a real experiment, and describe processes, metric design principles, and guidelines that can be used to detect and avoid the pitfall.
With this paper, we aim to increase the experimenters' awareness of metric interpretation issues, leading to improved quality and trustworthiness of experiment results and better data-driven decisions.
ACM Digital Library