research-article

Counterfactual Reward Modification for Streaming Recommendation with Delayed Feedback

Authors:

Jun Xu,

Ji-Rong WenAuthors Info & Claims

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 41 - 50

https://doi.org/10.1145/3404835.3462892

Published: 11 July 2021 Publication History

Get Access

Abstract

The user feedbacks could be delayed in many streaming recommendation scenarios. As an example, the user feedbacks to a recommended coupon consist of the immediate feedback on the click event and the delayed feedback on the resultant conversion. The delayed feedbacks pose a challenge of training recommendation models using instances with incomplete labels. When being applied to real products, the challenge becomes more severe as the streaming recommendation models need to be retrained very frequently and the training instances need to be collected over very short time scales. Existing approaches either simply ignore the unobserved feedbacks or heuristically adjust the feedbacks on a static instance set, resulting in biases in the training data and hurting the accuracy of the learned recommenders. In this paper, we propose a novel and theoretic sound counterfactual approach to adjusting the user feedbacks and learning the recommendation models, called CBDF (Counterfactual Bandit with Delayed Feedback). CBDF formulates the streaming recommendation with delayed feedback as a problem of sequential decision making and models it with a batched bandit. To deal with the issue of delayed feedback, at each iteration (episode), a counterfactual importance sampling model is employed to re-weight the original feedbacks and generate the modified rewards. Based on the modified rewards, a batched bandit is learned for conducting online recommendation at the next iteration. Theoretical analysis showed that the modified rewards are statistically unbiased, and the learned bandit policy enjoys a sub-linear regret bound. Experimental results demonstrated that CBDF can outperform the state-of-the-art baselines on a synthetic dataset, the Criteo dataset, and a dataset from Tencent's WeChat app.

Supplementary Material

MP4 File (video-453.mp4)

Presentation video.

Download
21.37 MB

References

[1]

Qingyao Ai, Tao Yang, Huazheng Wang, and Jiaxin Mao. 2020. Unbiased Learning to Rank: Online or Offline? CoRR, Vol. abs/2004.13574 (2020).

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Counterfactual contextual bandit for recommendation under delayed feedback

Unbiased Delayed Feedback Label Correction for Conversion Rate Prediction

Transitional feedback schedules during computer-based problem-solving practice

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations