skip to main content
10.1145/3404835.3462892acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Counterfactual Reward Modification for Streaming Recommendation with Delayed Feedback

Published: 11 July 2021 Publication History

Abstract

The user feedbacks could be delayed in many streaming recommendation scenarios. As an example, the user feedbacks to a recommended coupon consist of the immediate feedback on the click event and the delayed feedback on the resultant conversion. The delayed feedbacks pose a challenge of training recommendation models using instances with incomplete labels. When being applied to real products, the challenge becomes more severe as the streaming recommendation models need to be retrained very frequently and the training instances need to be collected over very short time scales. Existing approaches either simply ignore the unobserved feedbacks or heuristically adjust the feedbacks on a static instance set, resulting in biases in the training data and hurting the accuracy of the learned recommenders. In this paper, we propose a novel and theoretic sound counterfactual approach to adjusting the user feedbacks and learning the recommendation models, called CBDF (Counterfactual Bandit with Delayed Feedback). CBDF formulates the streaming recommendation with delayed feedback as a problem of sequential decision making and models it with a batched bandit. To deal with the issue of delayed feedback, at each iteration (episode), a counterfactual importance sampling model is employed to re-weight the original feedbacks and generate the modified rewards. Based on the modified rewards, a batched bandit is learned for conducting online recommendation at the next iteration. Theoretical analysis showed that the modified rewards are statistically unbiased, and the learned bandit policy enjoys a sub-linear regret bound. Experimental results demonstrated that CBDF can outperform the state-of-the-art baselines on a synthetic dataset, the Criteo dataset, and a dataset from Tencent's WeChat app.

Supplementary Material

MP4 File (video-453.mp4)
Presentation video.

References

[1]
Qingyao Ai, Tao Yang, Huazheng Wang, and Jiaxin Mao. 2020. Unbiased Learning to Rank: Online or Offline? CoRR, Vol. abs/2004.13574 (2020).
[2]
Peter Auer. 2002. Using Confidence Bounds for Exploitation-Exploration Trade-offs. Journal of Machine Learning Research, Vol. 3 (2002), 397--422.
[3]
Santiago Balseiro, Negin Golrezaei, Mohammad Mahdian, Vahab Mirrokni, and Jon Schneider. 2019. Contextual Bandits with Cross-Learning. In Advances in Neural Information Processing Systems 32. 9679--9688.
[4]
Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert E. Schapire. 2011. Contextual Bandit Algorithms with Supervised Learning Guarantees. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, Vol. 15. 19--26.
[5]
Ilai Bistritz, Zhengyuan Zhou, Xi Chen, Nicholas Bambos, and Jose Blanchet. 2019. Online EXP3 Learning in Adversarial Bandits with Delayed Feedback. In Advances in Neural Information Processing Systems 32. 11349--11358.
[6]
Lé on Bottou, Jonas Peters, Joaquin Qui n onero Candela, Denis Xavier Charles, Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Y. Simard, and Ed Snelson. 2013. Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising. Journal of Machine Learning Research, Vol. 14 (2013), 3207--3260.
[7]
Sébastien Bubeck and Nicolò Cesa-Bianchi. 2012. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Foundations and Trends$^®$ in Machine Learning, Vol. 5, 1 (2012), 1--122.
[8]
Badrish Chandramouli, Justin J. Levandoski, Ahmed Eldawy, and Mohamed F. Mokbel. 2011. StreamRec: A Real-time Recommender System. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1243--1246.
[9]
Shiyu Chang, Yang Zhang, Jiliang Tang, Dawei Yin, Yi Chang, Mark A Hasegawa-Johnson, and Thomas S Huang. 2017. Streaming Recommender Systems. In Proceedings of the 26th International Conference on World Wide Web. 381--389.
[10]
Olivier Chapelle. 2014. Modeling Delayed Feedback in Display Advertising. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . 1097--1105.
[11]
Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. 2011. Contextual Bandits with Linear Payoff Functions. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. 208--214.
[12]
Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou. 2019. Batched Multi-armed Bandits Problem. In Advances in Neural Information Processing Systems 32. 501--511.
[13]
Aditya Grover, Todor M. Markov, Peter M. Attia, Norman Jin, Nicolas Perkins, Bryan Cheong, Michael H. Chen, Zi Yang, Stephen J. Harris, William C. Chueh, and Stefano Ermon. 2018. Best Arm Identification in Multi-armed Bandits with Delayed Feedback. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics . 833--842.
[14]
Yanjun Han, Zhengqing Zhou, Zhengyuan Zhou, Jose H. Blanchet, Peter W. Glynn, and Yinyu Ye. 2020. Sequential Batch Learning in Finite-Action Linear Contextual Bandits. CoRR, Vol. abs/2004.06321 (2020).
[15]
Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, and Joaquin Qui n onero Candela. 2014. Practical Lessons from Predicting Clicks on Ads at Facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising . 5:1--5:9.
[16]
Amé lie Hé liou, Panayotis Mertikopoulos, and Zhengyuan Zhou. 2020. Gradient-free Online Learning in Games with Delayed Rewards. In Proceedings of the 37th International Conference on Machine Learning .
[17]
Rolf Jagerman, Harrie Oosterhuis, and Maarten de Rijke. 2019. To Model or to Intervene: A Comparison of Counterfactual and Online Learning to Rank from User Interactions. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval . 15--24.
[18]
Martin Jakomin, Zoran Bosnic, and Tomaz Curk. 2020. Simultaneous Incremental Matrix Factorization for Streaming Recommender Systems. Expert Systems with Applications, Vol. 160 (2020), 113685.
[19]
J. D. Kalbfleisch and R. L. Prentice. 2002. The Statistical Analysis of Failure Time Data (2nd Edition) .John Wiley & Sons.
[20]
Konstantinos V Katsikopoulos and Sascha E Engelbrecht. 2003. Markov Decision Processes with Delays and Asynchronous Cost Collection. IEEE Trans. Automat. Control, Vol. 48, 4 (2003), 568--574.
[21]
Sofia Ira Ktena, Alykhan Tejani, Lucas Theis, Pranay Kumar Myana, Deepak Dilipkumar, Ferenc Huszá r, Steven Yoo, and Wenzhe Shi. 2019. Addressing Delayed Feedback for Continuous Training with Neural Networks in CTR Prediction. In Proceedings of the 13th ACM Conference on Recommender Systems. 187--195.
[22]
Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A Contextual-bandit Approach to Personalized News Article Recommendation. In Proceedings of the 19th International Conference on World Wide Web. 661--670.
[23]
Xiao Ma, Liqin Zhao, Guan Huang, Zhi Wang, Zelin Hu, Xiaoqiang Zhu, and Kun Gai. 2018. Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval . 1137--1140.
[24]
Gergely Neu and Julia Olkhovskaya. 2020. Efficient and Robust Algorithms for Adversarial Linear Contextual Bandits. In Proceedings of the 33rd Conference on Learning Theory. 3049--3068.
[25]
Vianney Perchet, Philippe Rigollet, Sylvain Chassang, and Erik Snowberg. 2016. Batched Bandit Problems. The Annals of Statistics, Vol. 44, 2 (2016), 660--681.
[26]
Noveen Sachdeva, Yi Su, and Thorsten Joachims. 2020. Off-policy Bandits with Deficient Support. CoRR, Vol. abs/2006.09438 (2020). arxiv: 2006.09438 https://arxiv.org/abs/2006.09438
[27]
Yuta Saito, Gota Morishita, and Shota Yasui. 2020. Dual Learning Algorithm for Delayed Conversions. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval . 1849--1852.
[28]
Eren Sezener, Marcus Hutter, David Budden, Jianan Wang, and Joel Veness. 2020. Online Learning in Contextual Bandits using Gated Linear Networks. In Advances in Neural Information Processing Systems 33 .
[29]
Yi Su, Lequn Wang, Michele Santacatterina, and Thorsten Joachims. 2019. CAB: Continuous Adaptive Blending for Policy Evaluation and Learning. In Proceedings of the 36th International Conference on Machine Learning . 6005--6014.
[30]
Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Mü ller. 2007. Covariate Shift Adaptation by Importance Weighted Cross Validation. Journal of Machine Learning Research, Vol. 8 (2007), 985--1005.
[31]
Adith Swaminathan and Thorsten Joachims. 2015a. Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization. Journal of Machine Learning Research, Vol. 16 (2015), 1731--1755.
[32]
Adith Swaminathan and Thorsten Joachims. 2015b. Counterfactual Risk Minimization: Learning from Logged Bandit Feedback. In Proceedings of the 32nd International Conference on Machine Learning. 814--823.
[33]
Adith Swaminathan and Thorsten Joachims. 2015c. The Self-Normalized Estimator for Counterfactual Learning. In Advances in Neural Information Processing Systems 28. 3231--3239.
[34]
Claire Vernade, Olivier Cappé, and Vianney Perchet. 2017. Stochastic Bandit Models for Delayed Conversions. In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence .
[35]
Thomas J Walsh, Ali Nouri, Lihong Li, and Michael L Littman. 2009. Learning and Planning in Environments with Delayed Feedback. Autonomous Agents and Multi-Agent Systems, Vol. 18, 1 (2009), 83.
[36]
Chi-Hua Wang and Guang Cheng. 2020. Online Batch Decision-Making with High-Dimensional Covariates. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics . 3848--3857.
[37]
Qinyong Wang, Hongzhi Yin, Zhiting Hu, Defu Lian, Hao Wang, and Zi Huang. 2018a. Neural Memory Streaming Recommender Networks with Adversarial Training. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2467--2475.
[38]
Weiqing Wang, Hongzhi Yin, Zi Huang, Qinyong Wang, Xingzhong Du, and Quoc Viet Hung Nguyen. 2018b. Streaming Ranking Based Recommender Systems. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval . 525--534.
[39]
Qingyun Wu, Hongning Wang, Liangjie Hong, and Yue Shi. 2017. Returning is Believing: Optimizing Long-term User Engagement in Recommender Systems. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management . 1927--1936.
[40]
Shota Yasui, Gota Morishita, Komei Fujita, and Masashi Shibata. 2020. A Feedback Shift Correction in Predicting Conversion Rates under Delayed Feedback. In Proceedings of the Web Conference 2020 . 2740--2746.
[41]
Yuya Yoshikawa and Yusaku Imai. 2018. A Nonparametric Delayed Feedback Model for Conversion Rate Prediction. arXiv:1802.00255v1 (2018).
[42]
Kelly W. Zhang, Lucas Janson, and Susan A. Murphy. 2020. Inference for Batched Bandits. In Advances in Neural Information Processing Systems 33 .
[43]
Xiao Zhang and Shizhong Liao. 2019. Incremental Randomized Sketching for Online Kernel Learning. In Proceedings of the 36th International Conference on Machine Learning. 7394--7403.
[44]
Xiao Zhang, Shizhong Liao, Jun Xu, and Ji-Rong Wen. 2021. Regret Bounds of Online Kernel Selection in Continuous Kernel Space. In Proceedings of the 35th AAAI Conference on Artificial Intelligence .
[45]
Shengyao Zhuang and Guido Zuccon. 2020. Counterfactual Online Learning to Rank. In Advances in Information Retrieval 42nd European Conference on IR Research. 415--430.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2021
2998 pages
ISBN:9781450380379
DOI:10.1145/3404835
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. batched bandit
  2. counterfactual learning
  3. delayed feedback

Qualifiers

  • Research-article

Funding Sources

  • Tencent WeChat Rhino-Bird Focused Research Program
  • National Natural Science Foundation of China
  • National Key R&D Program of China

Conference

SIGIR '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)100
  • Downloads (Last 6 weeks)3
Reflects downloads up to 07 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media