skip to main content
10.1145/3038912.3052664acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Using the Delay in a Treatment Effect to Improve Sensitivity and Preserve Directionality of Engagement Metrics in A/B Experiments

Published: 03 April 2017 Publication History

Abstract

State-of-the-art user engagement metrics (such as session-per-user) are widely used by modern Internet companies to evaluate ongoing updates of their web services via A/B testing. These metrics are predictive of companies' long-term goals, but suffer from this property due to slow user learning of an evaluated treatment, which causes a delay in the treatment effect. That, in turn, causes low sensitivity of the metrics and requires to conduct A/B experiments with longer duration or larger set of users from a limited traffic. In this paper, we study how the delay property of user learning can be used to improve sensitivity of several popular metrics of user loyalty and activity. We consider both novel and previously known modifications of these metrics, including different methods of quantifying a trend in a metric's time series and delaying its calculation. These modifications are analyzed with respect to their sensitivity and directionality on a large set of A/B tests run on real users of Yandex. We discover that mostly loyalty metrics gain profit from the considered modifications. We find such modifications that both increase sensitivity of the source metric and are consistent with the sign of its average treatment effect as well.

References

[1]
O. Arkhipova, L. Grauer, I. Kuralenok, and P. Serdyukov. Search engine evaluation based on search engine switching prediction. In SIGIR'2015, pages 723--726. ACM, 2015.
[2]
E. Bakshy and D. Eckles. Uncertainty in online experiments with dependent data: An evaluation of bootstrap methods. In KDD'2013, pages 1303--1311, 2013.
[3]
S. Chakraborty, F. Radlinski, M. Shokouhi, and P. Baecke. On correlation of absence time and search effectiveness. In SIGIR'2014, pages 1163--1166, 2014.
[4]
S. Chawla, J. Hartline, and D. Nekipelov. A/B testing of auctions. In EC'2016, 2016.
[5]
T. Crook, B. Frasca, R. Kohavi, and R. Longbotham. Seven pitfalls to avoid when running controlled experiments on the web. In KDD'2009, pages 1105--1114, 2009.
[6]
A. Deng. Objective bayesian two sample hypothesis testing for online controlled experiments. In WWW'2015 Companion, pages 923--928, 2015.
[7]
A. Deng and V. Hu. Diluted treatment effect estimation for trigger analysis in online controlled experiments. In WSDM'2015, pages 349--358, 2015.
[8]
A. Deng, T. Li, and Y. Guo. Statistical inference in two-stage online controlled experiments with treatment selection and validation. In WWW'2014, pages 609--618, 2014.
[9]
A. Deng, J. Lu, and S. Chen. Continuous monitoring of A/B tests without pain: Optional stopping in bayesian testing. In DSAA'2016, 2016.
[10]
A. Deng and X. Shi. Data-driven metric development for online controlled experiments:Seven lessons learned. In KDD'2016, 2016.
[11]
A. Deng, Y. Xu, R. Kohavi, and T. Walker. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In WSDM'2013, pages 123--132, 2013.
[12]
A. Drutsa. Sign-aware periodicity metrics of user engagement for online search quality evaluation. In SIGIR'2015, pages 779--782, 2015.
[13]
A. Drutsa, G. Gusev, and P. Serdyukov. Engagement periodicity in search engine usage: Analysis and its application to search quality evaluation. In WSDM'2015, pages 27--36, 2015.
[14]
A. Drutsa, G. Gusev, and P. Serdyukov. Future user engagement prediction and its application to improve the sensitivity of online experiments. In WWW'2015, pages 256--266, 2015.
[15]
A. Drutsa, G. Gusev, and P. Serdyukov. Periodicity in user engagement with a search engine and its application to online controlled experiments. ACM Transactions on the Web (TWEB), 11, 2017.
[16]
A. Drutsa, A. Ufliand, and G. Gusev. Practical aspects of sensitivity in online experimentation with user engagement metrics. In CIKM'2015, pages 763--772, 2015.
[17]
G. Dupret and M. Lalmas. Absence time and user engagement: evaluating ranking functions. In WSDM'2013, pages 173--182, 2013.
[18]
B. Efron and R. J. Tibshirani. An introduction to the bootstrap. CRC press, 1994.
[19]
D. A. Freedman, D. Collier, J. S. Sekhon, and P. B. Stark. Statistical models and causal inference: a dialogue with the social sciences. Cambridge University Press, 2010.
[20]
H. Hohnhold, D. O'Brien, and D. Tang. Focusing on the long-term: It's good for users and business. In KDD'2015, pages 1849--1858, 2015.
[21]
B. J. Jansen, A. Spink, and V. Kathuria. How to define searching sessions on web search engines. In Advances in Web Mining and Web Usage Analysis, pages92--109. Springer, 2007.
[22]
E. Kharitonov, A. Drutsa, and P. Serdyukov. Learning sensitive combinations of a/b test metrics. In WSDM'2017, 2017.
[23]
E. Kharitonov, C. Macdonald, P. Serdyukov, and I. Ounis. Optimised scheduling of online experiments. In SIGIR'2015, pages 453--462, 2015.
[24]
E. Kharitonov, A. Vorobev, C. Macdonald, P. Serdyukov, and I. Ounis. Sequential testing for early stopping of online experiments. In SIGIR'2015, pages 473--482, 2015.
[25]
R. Kohavi, T. Crook, R. Longbotham, B. Frasca, R. Henne, J. L. Ferres, and T. Melamed. Online experimentation at microsoft. Data Mining Case Studies, page 11, 2009.
[26]
R. Kohavi, A. Deng, B. Frasca, R. Longbotham, T. Walker, and Y. Xu. Trustworthy online controlled experiments: Five puzzling outcome sex plained. In KDD'2012, pages 786--794, 2012.
[27]
R. Kohavi, A. Deng, B. Frasca, T. Walker, Y. Xu, and N. Pohlmann. Online controlled experiments at large scale. In KDD'2013, pages 1168--1176, 2013.
[28]
R. Kohavi, A. Deng, R. Longbotham, and Y. Xu. Seven rules of thumb for web site experimenters. In KDD'2014, 2014.
[29]
R. Kohavi, R. M. Henne, and D. Sommerfield. Practical guide to controlled experiments on the web: listen to your customers not to the hippo. In KDD'2007, pages 959--967, 2007.
[30]
R. Kohavi, R. Longbotham, D. Sommerfield, and R. M. Henne. Controlled experiments on the web: survey and practical guide. Data Min. Knowl. Discov., 18(1):140--181, 2009.
[31]
R. Kohavi, D. Messner, S. Eliot, J. L. Ferres, R. Henne, V. Kannappan, and J. Wang. Tracking users' clicks and submits: Trade offs between user experience and data loss, 2010.
[32]
S. L. Morgan and C. Winship. Counter factuals and causal inference. Cambridge University Press, 2014.
[33]
K. Nikolaev, A. Drutsa, E. Gladkikh, A. Ulianov, G. Gusev, and P. Serdyukov. Extreme states distribution decomposition method for search engine online evaluation. In KDD'2015, pages 845--854, 2015.
[34]
E. T. Peterson. Web analytics demystified: a marketer's guide to understanding how your web site affects your business. Ingram, 2004.
[35]
A. Poyarkov, A. Drutsa, A. Khalyavin, G. Gusev, and P. Serdyukov. Boosted decision tree regression adjustment for variance reduction inonline controlled experiments. In KDD'2016, pages 235--244, 2016.
[36]
K. Rodden, H. Hutchinson, and X. Fu. Measuring the user experience on a large scale: user-centered metrics for web applications. In CHI'2010, pages 2395--2398, 2010.
[37]
T. Sakai. Evaluating evaluation metrics based on the bootstrap. In SIGIR'2006, pages 525--532, 2006.
[38]
Y. Song, X. Shi, and X. Fu. Evaluating and predicting user engagement change with degraded search relevance. In WWW'2013, pages 1213--1224, 2013.
[39]
D. Tang, A. Agarwal, D. O'Brien, and M. Meyer. Overlapping experiment infrastructure: More, better, faster experimentation. In KDD'2010, pages 17--26, 2010.
[40]
H. Xie and J. Aurisset. Improving the sensitivity of online controlled experiments: Case studies at netflix. In KDD'2016, 2016.
[41]
Y. Xu and N. Chen. Evaluating mobile apps with A/B and quasi A/B tests. In KDD'2016, 2016.
[42]
Y. Xu, N. Chen, A. Fernandez, O. Sinno, and A. Bhasin. From infrastructure to culture: A/b testing challenges in large scale social networks. In KDD'2015, 2015.

Cited By

View all

Index Terms

  1. Using the Delay in a Treatment Effect to Improve Sensitivity and Preserve Directionality of Engagement Metrics in A/B Experiments

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Other conferences
          WWW '17: Proceedings of the 26th International Conference on World Wide Web
          April 2017
          1678 pages
          ISBN:9781450349130

          Sponsors

          • IW3C2: International World Wide Web Conference Committee

          In-Cooperation

          Publisher

          International World Wide Web Conferences Steering Committee

          Republic and Canton of Geneva, Switzerland

          Publication History

          Published: 03 April 2017

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. a/b test
          2. delay
          3. dft
          4. directionality
          5. online controlled experiment
          6. quality metric
          7. sensitivity
          8. time series
          9. trend
          10. user engagement

          Qualifiers

          • Research-article

          Conference

          WWW '17
          Sponsor:
          • IW3C2

          Acceptance Rates

          WWW '17 Paper Acceptance Rate 164 of 966 submissions, 17%;
          Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)15
          • Downloads (Last 6 weeks)3
          Reflects downloads up to 17 Jan 2025

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)A/B testingJournal of Systems and Software10.1016/j.jss.2024.112011211:COnline publication date: 2-Jul-2024
          • (2022)Using Survival Models to Estimate User Engagement in Online ExperimentsProceedings of the ACM Web Conference 202210.1145/3485447.3512038(3186-3195)Online publication date: 25-Apr-2022
          • (2020)Prediction of Hourly Earnings and Completion Time on a Crowdsourcing PlatformProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3394486.3403369(3172-3182)Online publication date: 23-Aug-2020
          • (2019)Top Challenges from the first Practical Online Controlled Experiments SummitACM SIGKDD Explorations Newsletter10.1145/3331651.333165521:1(20-35)Online publication date: 13-May-2019
          • (2019)Effective Online Evaluation for Web SearchProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331378(1399-1400)Online publication date: 18-Jul-2019
          • (2018)Consistent Transformation of Ratio Metrics for Efficient Online Controlled ExperimentsProceedings of the Eleventh ACM International Conference on Web Search and Data Mining10.1145/3159652.3159699(55-63)Online publication date: 2-Feb-2018
          • (2017)Machine Learning Powered A/B TestingProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3096468(1365-1365)Online publication date: 7-Aug-2017
          • (undefined)Estimation of Average Treatment Effect on Residuals: Bias Derivation.SSRN Electronic Journal10.2139/ssrn.3953160

          View Options

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media