research-article

Trustworthy Analysis of Online A/B Tests: Pitfalls, challenges and solutions

Authors:

Jonthan LitzAuthors Info & Claims

WSDM '17: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining

Pages 641 - 649

https://doi.org/10.1145/3018661.3018677

Published: 02 February 2017 Publication History

Abstract

A/B tests (or randomized controlled experiments) play an integral role in the research and development cycles of technology companies. As in classic randomized experiments (e.g., clinical trials), the underlying statistical analysis of A/B tests is based on assuming the randomization unit is independent and identically distributed (\iid). However, the randomization mechanisms utilized in online A/B tests can be quite complex and may render this assumption invalid. Analysis that unjustifiably relies on this assumption can yield untrustworthy results and lead to incorrect conclusions. Motivated by challenging problems arising from actual online experiments, we propose a new method of variance estimation that relies only on practically plausible assumptions, is directly applicable to a wide of range of randomization mechanisms, and can be implemented easily. We examine its performance and illustrate its advantages over two commonly used methods of variance estimation on both simulated and empirical datasets. Our results lead to a deeper understanding of the conditions under which the randomization unit can be treated as \iid In particular, we show that for purposes of variance estimation, the randomization unit can be approximated as \iid when the individual treatment effect variation is small; however, this approximation can lead to variance under-estimation when the individual treatment effect variation is large.

References

[1]

Athey, S. and Imbens, G. [2016], 'The econometrics of randomized experiments', arXiv:1607.00698 .

[2]

Bakshy, E. and Eckles, D. [2013], Uncertainty in online experiments with dependent data: An evaluation of bootstrap methods, in 'Proc. 19th ACM SIGKDD Int. Conf. Knowl. Discov. data Min.', ACM, pp. 1303--1311.

[3]

Bakshy, E., Eckles, D. and Bernstein, M. S. [2014], Designing and deploying online field experiments, in 'Proceedings of the 23rd international conference on World wide web', ACM, pp. 283--292.

Digital Library

[4]

Barber, D. [2012], Bayesian reasoning and machine learning, Cambridge University Press.

[5]

Chamandy, N., Muralidharan, O. and Wager, S. [2015], 'Teaching statistics at google-scale', The American Statistician 69(4), 283--291.

[6]

Cochran, W. G. [1977], Sampling Techniques, Third Edition, New York: W.W. Norton.

[7]

DasGupta, A. [2008], Asymptotic Theory of Statistics and Probability, Springer.

[8]

Deng, A., Lu, J. and Chen, S. [2016], Continuous monitoring of A/B tests without pain: Optional stopping in Bayesian testing, in 'Proceedings of the 3rd IEEE International Conference on Data Science and Advanced Analytics'.

[9]

Deng, A. and Shi, X. [2016], Data-driven metric development for online controlled experiments: Seven lessons learned, in 'Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining'.

Digital Library

[10]

Deng, S., Longbotham, R., Walker, T. and Xu, Y. [2011], 'Choice of the Randomization Unit in Online Controlled Experiment', JSM Proc. .

[11]

Efron, B. and Tibshirani, R. J. [1994], An introduction to the bootstrap, CRC press.

[12]

Fisher, R. A. [1925], Statistical Methods for Research Workers, First Edition, Edinburgh: Oliver and Boyd. [13] Fisher, R. A. [1935], The Design of Experiments, First Edition, Edinburgh: Oliver and Boyd.

[13]

Gomez-Uribe, C. A. and Hunt, N. [2016], 'The netflix recommender system: Algorithms, business value, and innovation', ACM Transactions on Management Information Systems (TMIS) 6(4), 13.

[14]

Hesterberg, T. C. [2015], 'What teachers should know about the bootstrap: Resampling in the undergraduate statistics curriculum', The American Statistician 69(4), 371--386.

[15]

Imbens, G. W. and Rubin, D. B. [2015], Causal Inference in Statistics, Social, and Biomedical Sciences: An Introduction, New York: Cambridge University Press.

[16]

Kohavi, R., Deng, A., Longbotham, R. and Xu, Y. [2014], Seven rules of thumb for web site experimenters, in 'Proc. 20th Conf. Knowl. Discov. Data Min.', KDD '14, New York, USA, pp. 1857--1866.

Digital Library

[17]

Kohavi, R. and Longbotham, R. [2015], 'Online controlled experiments and A/B tests', Encyclopedia of Meaning Learning and Data Mining .

[18]

Kohavi, R., Longbotham, R., Sommerfield, D. and Henne, R. M. [2009], 'Controlled experiments on the web: Survey and practical guide', Data Mining and Knowledge Discovery 18, 140--181.

Digital Library

[19]

Kohavi, R., Longbotham, R. and Walker, T. [2010], 'Online experiments: Practical lessons', Computer (Long. Beach. Calif). 43(9), 82--85.

Digital Library

[20]

Lin, W. [2013], 'Agnostic notes on regression adjustments to experimental data: Reexamining Freedman's critique', The Annals of Applied Statistics 7, 295--318.

[21]

Lu, J. and Deng, A. [2016], 'Demystifying the bias from selective inference: A revisit to Dawid's treatment selection problem', Statistics and Probability Letters 118, 8--15.

[22]

Morgan, K. L. and Rubin, D. B. [2012], 'Rerandomization to improve covariate balance in experiments', The Annals of Statistics 40, 1263--1282.

[23]

Morgan, K. L. and Rubin, D. B. [2015], 'Rerandomization to balance tiers of covariates', Journal of the American Statistical Association 110, 1412--1421.

[24]

Murphy, K. P. [2012], Machine learning: a probabilistic perspective, MIT press.

[25]

Neyman, J. [1923], 'On the application of probability theorey to agricultural experiments. Essay on principals. Section 9.', Statistical Science 5, 465--480. [Translated by D. Dabrowska and T. Speed].

[26]

Owen, A. B. et al. [2007], 'The pigeonhole bootstrap', The Annals of Applied Statistics 1(2), 386--411.

[27]

Rubin, D. B. [1974], 'Estimating causal effects of treatments in randomized and nonrandomized studies.', Journal of Educational Psychology 66, 688--701.

[28]

Rubin, D. B. [1980], 'Comment on Randomization analysis of experimental data: The Fisher randomization test? by D. Basu.', Journal of the American Statistical Association 75, 591--593.

[29]

Rubin, D. B. [2008], 'For objective causal inference, design trumps analysis', The Annals of Applied Statistics 2, 808--840.

[30]

Smirnov, N. [1948], 'Table for estimating the goodness of fit of empirical distributions', The Annals of Mathematical Statistics 19, 279--281.

[31]

Tang, C., Kooburat, T., Venkatachalam, P., Chander, A., Wen, Z., Narayanan, A., Dowell, P. and Karl, R. [2015], Holistic configuration management at facebook, in 'Proceedings of the 25th Symposium on Operating Systems Principles', ACM, pp. 328--343.

Digital Library

[32]

Tang, D., Agarwal, A., O'Brien, D. and Meyer, M. [2010], 'Overlapping Experiment Infrastructure: More, Better, Faster Experimentation', Proc. 16th Conf. Knowl. Discov. Data Min.

Digital Library

[33]

Wasserman, L. [2003], All of Statistics: A Concise Course in Statistical Inference, Springer.

[34]

Xu, Y., Chen, N., Fernandez, A., Sinno, O. and Bhasin, A. [2015], From infrastructure to culture: A/B testing challenges in large scale social networks, in 'Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining', pp. 2227--2236.

Digital Library

Cited By

Xiong TWang Y(2024)Large-Scale Metric Computation in Online Controlled Experiment PlatformProceedings of the VLDB Endowment10.14778/3685800.368582317:12(4014-4024)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685823
Rome SChen TTang RZhou LTure FHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)"Ask Me Anything": How Comcast Uses LLMs to Assist Agents in Real TimeProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3661345(2827-2831)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3661345
Quin FWeyns DGalster MSilva C(2024)A/B testingJournal of Systems and Software10.1016/j.jss.2024.112011211:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.jss.2024.112011
Show More Cited By

Index Terms

Trustworthy Analysis of Online A/B Tests: Pitfalls, challenges and solutions
1. General and reference
  1. Cross-computing tools and techniques
2. Mathematics of computing
  1. Probability and statistics
    1. Probabilistic inference problems
      1. Hypothesis testing and confidence interval computation

Recommendations

Short communication: A revisit to the common mean problem: Comparing the maximum likelihood estimator with the Graybill-Deal estimator

For estimating the common mean of two normal populations with unknown and possibly unequal variances the well-known Graybill-Deal estimator (GDE) has been a motivating factor for research over the last five decades. Surprisingly the literature does not ...
Generalized MLE of a Joint Distribution Function with Multivariate Interval-Censored Data

We consider the problem of estimation of a joint distribution function of a multivariate random vector with interval-censored data. The generalized maximum likelihood estimator of the distribution function is studied and its consistency and asymptotic ...
Multivariate Locally Weighted Polynomial Fitting and Partial Derivative Estimation

Nonparametric regression estimator based on locally weighted least squares fitting has been studied by Fan and Ruppert and Wand. The latter paper also studies, in the univariate case, nonparametric derivative estimators given by a locally weighted ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '17: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining

February 2017

868 pages

ISBN:9781450346757

DOI:10.1145/3018661

General Chairs:
Maarten de Rijke
University of Amsterdam
,
Milad Shokouhi
Microsoft
,
Program Chairs:
Andrew Tomkins
Google
,
Min Zhang
Tsinghua University

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 February 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WSDM 2017

Sponsor:

WSDM 2017: Tenth ACM International Conference on Web Search and Data Mining

February 6 - 10, 2017

Cambridge, United Kingdom

Acceptance Rates

WSDM '17 Paper Acceptance Rate 80 of 505 submissions, 16%;

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
538
Total Downloads

Downloads (Last 12 months)61
Downloads (Last 6 weeks)5

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xiong TWang Y(2024)Large-Scale Metric Computation in Online Controlled Experiment PlatformProceedings of the VLDB Endowment10.14778/3685800.368582317:12(4014-4024)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685823
Rome SChen TTang RZhou LTure FHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)"Ask Me Anything": How Comcast Uses LLMs to Assist Agents in Real TimeProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3661345(2827-2831)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3661345
Quin FWeyns DGalster MSilva C(2024)A/B testingJournal of Systems and Software10.1016/j.jss.2024.112011211:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.jss.2024.112011
Zhou JLu JShallah AFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)All about Sample-Size Calculations for A/B Testing: Novel Extensions & Practical GuideProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614779(3574-3583)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3614779
Liu CMcCoy E(2023)Measuring e-Commerce Metric Changes in Online ExperimentsCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3584654(495-499)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543873.3584654
Musgrave PHan CGupta PChen HDuh WHuang HKato MMothe JPoblete B(2023)Measuring Service-Level Learning Effects in Search Via Query-Randomized ExperimentsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3592020(2169-2173)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3592020
Larsen NStallrich JSengupta SDeng AKohavi RStevens N(2023)Statistical Challenges in Online Controlled Experiments: A Review of A/B Testing MethodologyThe American Statistician10.1080/00031305.2023.225723778:2(135-149)Online publication date: 18-Oct-2023
https://doi.org/10.1080/00031305.2023.2257237
Kohavi RLongbotham R(2023)Online Controlled Experiments and A/B TestsEncyclopedia of Machine Learning and Data Science10.1007/978-1-4899-7502-7_891-2(1-13)Online publication date: 8-Mar-2023
https://doi.org/10.1007/978-1-4899-7502-7_891-2
Jin YBa S(2022)Toward Optimal Variance Reduction in Online Controlled ExperimentsTechnometrics10.1080/00401706.2022.214267065:2(231-242)Online publication date: 1-Dec-2022
https://doi.org/10.1080/00401706.2022.2142670
Deng ALi YLu JRamamurthy VZhu FChin Ooi BMiao CWang HSkrypnyk IHsu WChawla S(2021)On Post-selection Inference in A/B TestingProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining10.1145/3447548.3467129(2743-2752)Online publication date: 14-Aug-2021
https://dl.acm.org/doi/10.1145/3447548.3467129
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents