research-article

Safe Exploration for Optimizing Contextual Bandits

Authors:

Maarten De RijkeAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 38, Issue 3

Article No.: 24, Pages 1 - 23

https://doi.org/10.1145/3385670

Published: 21 April 2020 Publication History

Abstract

Contextual bandit problems are a natural fit for many information retrieval tasks, such as learning to rank, text classification, recommendation, and so on. However, existing learning methods for contextual bandit problems have one of two drawbacks: They either do not explore the space of all possible document rankings (i.e., actions) and, thus, may miss the optimal ranking, or they present suboptimal rankings to a user and, thus, may harm the user experience. We introduce a new learning method for contextual bandit problems, Safe Exploration Algorithm (SEA), which overcomes the above drawbacks. SEA starts by using a baseline (or production) ranking system (i.e., policy), which does not harm the user experience and, thus, is safe to execute but has suboptimal performance and, thus, needs to be improved. Then SEA uses counterfactual learning to learn a new policy based on the behavior of the baseline policy. SEA also uses high-confidence off-policy evaluation to estimate the performance of the newly learned policy. Once the performance of the newly learned policy is at least as good as the performance of the baseline policy, SEA starts using the new policy to execute new actions, allowing it to actively explore favorable regions of the action space. This way, SEA never performs worse than the baseline policy and, thus, does not harm the user experience, while still exploring the action space and, thus, being able to find an optimal policy. Our experiments using text classification and document retrieval confirm the above by comparing SEA (and a boundless variant called BSEA) to online and offline learning methods for contextual bandit problems.

References

[1]

Gediminas Adomavicius, Ramesh Sankaranarayanan, Shahana Sen, and Alexander Tuzhilin. 2005. Incorporating contextual information in recommender systems using a multidimensional approach. ACM Trans. Inf. Syst. 23, 1 (2005), 103--145.

Digital Library

[2]

Shipra Agrawal and Navin Goyal. 2013. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the International Conference on Machine Learning. 127--135.

[3]

Alina Beygelzimer and John Langford. 2009. The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 129--138.

Digital Library

[4]

Nicolò Cesa-Bianchi, Claudio Gentile, Gergely Neu, and Gabor Lugosi. 2017. Boltzmann exploration done right. In Advances in Neural Information Processing Systems. 6275--6284.

[5]

Olivier Chapelle and Yi Chang. 2011. Yahoo! learning to rank challenge overview. In Proceedings of the Learning to Rank Challenge. 1--24.

Digital Library

[6]

Aleksandr Chuklin, Ilya Markov, and Maarten de Rijke. 2015. Click Models for Web Search. Morgan 8 Claypool.

[7]

Aleksandr Chuklin, Pavel Serdyukov, and Maarten de Rijke. 2013. Modeling clicks beyond the first result page. In Proceedings of the 22nd ACM Conference on Information and Knowledge Management. ACM, 1217--1220.

Digital Library

[8]

Nicolas Galichet, Michele Sebag, and Olivier Teytaud. 2013. Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. In Proceedings of the 5th Asian Conference on Machine Learning. PMLR, 245--260.

[9]

Javier Garcia and Fernando Fernández. 2012. Safe exploration of state and action spaces in reinforcement learning. J. Artif. Intell. Res. 45 (2012), 515--564.

[10]

Dorota Glowacka. 2019. Bandit algorithms in information retrieval. Found. Trends Inf. Retriev. 13, 4 (2019), 299--424.

Digital Library

[11]

Artem Grotov, Shimon Whiteson, and Maarten de Rijke. 2015. Bayesian ranker comparison based on historical user interactions. In Proceeedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 273--282.

Digital Library

[12]

Katja Hofmann, Anne Schuth, Shimon Whiteson, and Maarten de Rijke. 2013. Reusing historical interaction data for faster online learning to rank for information retrieval. In Proceedings of the International Conference on Web Search and Data Mining (WSDM’13). ACM, 183--192.

Digital Library

[13]

Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2011. Balancing exploration and exploitation in learning to rank online. In Advances in Information Retrieval: Proceedings of the 33rd European Conference on IR Research. Springer, 251--263.

Digital Library

[14]

Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2011. Contextual bandits for information retrieval. In Proceedings of the Conference on Neural Information Processing Systems Workshop on Bayesian Optimization, Experimental Design, and Bandits (NIPS’11).

[15]

Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2013. Fidelity, soundness, and efficiency of interleaved comparison methods. ACM Trans. Inf. Syst. 31, 4 (2013), 17.

Digital Library

[16]

Jonathan J. Hull. 1994. A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 16, 5 (1994), 550--554.

Digital Library

[17]

Xiaoguang Huo and Feng Fu. 2017. Risk-aware multi-armed bandit problem with application to portfolio selection. Arxiv Preprint Arxiv:1709.04415 (2017).

[18]

Rolf Jagerman, Harrie Oosterhuis, and Maarten de Rijke. 2019. To model or to intervene: A comparison of counterfactual and online learning to rank from user interactions. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 15--24.

Digital Library

[19]

Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4 (2002), 422--446.

Digital Library

[20]

Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 133--142.

Digital Library

[21]

Thorsten Joachims, Laura A. Granka, Bing Pan, Helene Hembrooke, and Geri Gay. 2005. Accurately interpreting clickthrough data as implicit feedback. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 154--161.

Digital Library

[22]

Thorsten Joachims, Adith Swaminathan, and Maarten de Rijke. 2018. Deep learning from logged bandit feedback. In Proceedings of the International Conference on Learning Representations.

[23]

Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased learning-to-rank with biased feedback. In Proceedings of the 10th ACM International Conference on Web Search and Data Mining. ACM, 781--789.

Digital Library

[24]

Leslie Pack Kaelbling. 1994. Associative reinforcement learning: Functions in k-DNF. Mach. Learn. 15, 3 (1994), 279--298.

Digital Library

[25]

Sham M. Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. 2008. Efficient bandit algorithms for online multiclass prediction. In Proceedings of the 25th International Conference on Machine Learning. ACM, 440--447.

Digital Library

[26]

Abbas Kazerouni, Mohammad Ghavamzadeh, and Benjamin Van Roy. 2016. Conservative contextual linear bandits. In Advances in Neural Information Processing Systems 30. Curran Associates, Inc., 3910--3919.

[27]

Ken Lang. 1995. Newsweeder: Learning to filter netnews. In Proceedings of the 12th International Conference on Machine Learning, Vol. 10. 331--339.

Digital Library

[28]

John Langford, Alexander Strehl, and Jennifer Wortman. 2008. Exploration scavenging. In Proceedings of the 25th International Conference on Machine Learning. ACM, 528--535.

Digital Library

[29]

David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. 2004. RCV1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5 (Apr. 2004), 361--397.

[30]

Chang Li, Branislav Kveton, Tor Lattimore, Ilya Markov, Maarten de Rijke, Csaba Szepesvari, and Masrour Zoghi. 2019. BubbleRank: Safe online learning to re-rank via implicit click feedback. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI’19).

[31]

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web. ACM, 661--670.

[32]

Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. 2011. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the 4th ACM International Conference on Web search and Data Mining. ACM, 297--306.

Digital Library

[33]

Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Fabrizio Silvestri, and Salvatore Trani. 2016. Post-learning optimization of tree ensembles for efficient ranking. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 949--952.

Digital Library

[34]

Harrie Oosterhuis and Maarten de Rijke. 2017. Balancing speed and quality in online learning to rank for information retrieval. In Proceedings of the 26th ACM Conference on Information and Knowledge Management. ACM, 277--286.

Digital Library

[35]

Harrie Oosterhuis, Anne Schuth, and Maarten de Rijke. 2016. Probabilistic multileave gradient descent. In Advances in Information Retrieva: Proceedings of the 38th European Conference on IR Research. Springer, 661--668.

[36]

Art B. Owen. 2013. Monte carlo theory, methods and examples. Monte Carlo Theory, Methods and Examples (2013).

[37]

Tao Qin and Tie-Yan Liu. 2013. Introducing LETOR 4.0 Datasets. Arxiv Preprint Arxiv:1306.2597 (2013).

[38]

Filip Radlinski, Madhu Kurup, and Thorsten Joachims. 2008. How does clickthrough data reflect retrieval quality?. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM’08). ACM, New York, NY, 43--52.

Digital Library

[39]

Alex Strehl, John Langford, Lihong Li, and Sham M. Kakade. 2010. Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems. 2217--2225.

[40]

Alexander L. Strehl, Chris Mesterharm, Michael L. Littman, and Haym Hirsh. 2006. Experience-efficient learning in associative bandit problems. In Proceedings of the 23rd International Conference on Machine learning. ACM, 889--896.

Digital Library

[41]

Wen Sun, Debadeepta Dey, and Ashish Kapoor. 2016. Risk-aware algorithms for adversarial contextual bandits. Arxiv Preprint Arxiv:1610.05129 (2016).

[42]

Adith Swaminathan and Thorsten Joachims. 2015. Batch learning from logged bandit feedback through counterfactual risk minimization. J. Mach. Learn. Res. 16, 1 (2015), 1731--1755.

Digital Library

[43]

Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual risk minimization: Learning from logged bandit feedback. In Proceedings of the International Conference on Machine Learning. 814--823.

Digital Library

[44]

Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, and Imed Zitouni. 2016. Off-policy evaluation for slate recommendation. Arxiv Preprint Arxiv:1605.04812 (2016).

[45]

Philip S. Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. 2015. High-confidence off-policy evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’15). 3000--3006.

[46]

Xuanhui Wang, Nadav Golbandi, Michael Bendersky, Donald Metzler, and Marc Najork. 2018. Position bias estimation for unbiased learning to rank in personal search. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining. ACM, 610--618.

Digital Library

[47]

Zeng Wei, Jun Xu, Yanyan Lan, Jiafeng Guo, and Xueqi Cheng. 2017. Reinforcement learning to rank with markov decision process. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 945--948.

Digital Library

[48]

Qiang Wu, Christopher J. C. Burges, Krysta M. Svore, and Jianfeng Gao. 2010. Adapting boosting for information retrieval measures. Inf. Retriev. 13, 3 (2010), 254--270.

Digital Library

[49]

Yifan Wu, Roshan Shariff, Tor Lattimore, and Csaba Szepesvári. 2016. Conservative bandits. In Proceedings of the International Conference on Machine Learning. 1254--1262.

[50]

Yisong Yue and Thorsten Joachims. 2009. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). ACM, New York, NY, 1201--1208.

Digital Library

Cited By

Gupta SOosterhuis Hde Rijke MSerra ESpezzano F(2024)Practical and Robust Safety Guarantees for Advanced Counterfactual Learning to RankProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679531(737-747)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679531
van den Akker BJeunen OLi YLondon BNazari ZParekh DAngélica LLattanzi SMuñoz Medina AAkoglu LGionis AVassilvitskii S(2024)Practical Bandits: An Industry PerspectiveProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3636449(1132-1135)Online publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1145/3616855.3636449
De Blasi SBahrami MEngels EGepperth A(2024)Safe contextual Bayesian optimization integrated in industrial control for self-learning machinesJournal of Intelligent Manufacturing10.1007/s10845-023-02087-335:2(885-903)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1007/s10845-023-02087-3
Show More Cited By

Index Terms

Safe Exploration for Optimizing Contextual Bandits
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Learning to rank

Recommendations

Policy-Aware Unbiased Learning to Rank for Top-k Rankings
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

Counterfactual Learning to Rank (LTR) methods optimize ranking systems using logged user interactions that contain interaction biases. Existing methods are only unbiased if users are presented with all relevant items in every ranking. There is currently ...
Non-Compliant Bandits
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

Bandit algorithms arose as a standard approach to learning better models online. As they become more popular, they are increasingly deployed in complex machine learning pipelines, where their actions can be overwritten. For example, in ranking problems, ...
BanditRank: Learning to Rank Using Contextual Bandits
Advances in Knowledge Discovery and Data Mining
Abstract
We propose an extensible deep learning method that uses reinforcement learning to train neural networks for offline ranking in information retrieval (IR). We call our method BanditRank as it treats ranking as a contextual bandit problem. In the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 38, Issue 3

July 2020

311 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/3394096

Editor:
Maarten de Rijke
University of Amsterdam, The Netherlands

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 April 2020

Accepted: 01 February 2020

Revised: 01 December 2019

Received: 01 July 2019

Published in TOIS Volume 38, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

VSNU Vereniging van Universiteiten
Ahold Delhaize
ICAI Innovation Center for Artificial Intelligence
Nederlandse Organisatie voor Wetenschappelijk Onderzoek

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
442
Total Downloads

Downloads (Last 12 months)105
Downloads (Last 6 weeks)15

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Gupta SOosterhuis Hde Rijke MSerra ESpezzano F(2024)Practical and Robust Safety Guarantees for Advanced Counterfactual Learning to RankProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679531(737-747)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679531
van den Akker BJeunen OLi YLondon BNazari ZParekh DAngélica LLattanzi SMuñoz Medina AAkoglu LGionis AVassilvitskii S(2024)Practical Bandits: An Industry PerspectiveProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3636449(1132-1135)Online publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1145/3616855.3636449
De Blasi SBahrami MEngels EGepperth A(2024)Safe contextual Bayesian optimization integrated in industrial control for self-learning machinesJournal of Intelligent Manufacturing10.1007/s10845-023-02087-335:2(885-903)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1007/s10845-023-02087-3
Oosterhuis H(2023)Doubly Robust Estimation for Correcting Position Bias in Click Feedback for Unbiased Learning to RankACM Transactions on Information Systems10.1145/356945341:3(1-33)Online publication date: 7-Feb-2023
https://dl.acm.org/doi/10.1145/3569453
Musgrave PHan CGupta PChen HDuh WHuang HKato MMothe JPoblete B(2023)Measuring Service-Level Learning Effects in Search Via Query-Randomized ExperimentsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3592020(2169-2173)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3592020
Gupta SOosterhuis Hde Rijke MChen HDuh WHuang HKato MMothe JPoblete B(2023)Safe Deployment for Counterfactual Learning to Rank with Exposure-Based Risk MinimizationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591760(249-258)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591760
Han CCastells PGupta PXu XSalaka VAl Hasan MXiong L(2022)Addressing Cold Start in Product Search via Empirical BayesProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557066(3141-3151)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557066
Zhuang SQiao ZZuccon G(2022)Reinforcement online learning to rank with unbiased reward shapingInformation Retrieval10.1007/s10791-022-09413-y25:4(386-413)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1007/s10791-022-09413-y
Oosterhuis Hde Rijke M(2021)Robust Generalization and Safe Query-Specializationin Counterfactual Learning to RankProceedings of the Web Conference 202110.1145/3442381.3450018(158-170)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3442381.3450018

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents