skip to main content
research-article

Safe Exploration for Optimizing Contextual Bandits

Published: 21 April 2020 Publication History

Abstract

Contextual bandit problems are a natural fit for many information retrieval tasks, such as learning to rank, text classification, recommendation, and so on. However, existing learning methods for contextual bandit problems have one of two drawbacks: They either do not explore the space of all possible document rankings (i.e., actions) and, thus, may miss the optimal ranking, or they present suboptimal rankings to a user and, thus, may harm the user experience. We introduce a new learning method for contextual bandit problems, Safe Exploration Algorithm (SEA), which overcomes the above drawbacks. SEA starts by using a baseline (or production) ranking system (i.e., policy), which does not harm the user experience and, thus, is safe to execute but has suboptimal performance and, thus, needs to be improved. Then SEA uses counterfactual learning to learn a new policy based on the behavior of the baseline policy. SEA also uses high-confidence off-policy evaluation to estimate the performance of the newly learned policy. Once the performance of the newly learned policy is at least as good as the performance of the baseline policy, SEA starts using the new policy to execute new actions, allowing it to actively explore favorable regions of the action space. This way, SEA never performs worse than the baseline policy and, thus, does not harm the user experience, while still exploring the action space and, thus, being able to find an optimal policy. Our experiments using text classification and document retrieval confirm the above by comparing SEA (and a boundless variant called BSEA) to online and offline learning methods for contextual bandit problems.

References

[1]
Gediminas Adomavicius, Ramesh Sankaranarayanan, Shahana Sen, and Alexander Tuzhilin. 2005. Incorporating contextual information in recommender systems using a multidimensional approach. ACM Trans. Inf. Syst. 23, 1 (2005), 103--145.
[2]
Shipra Agrawal and Navin Goyal. 2013. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the International Conference on Machine Learning. 127--135.
[3]
Alina Beygelzimer and John Langford. 2009. The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 129--138.
[4]
Nicolò Cesa-Bianchi, Claudio Gentile, Gergely Neu, and Gabor Lugosi. 2017. Boltzmann exploration done right. In Advances in Neural Information Processing Systems. 6275--6284.
[5]
Olivier Chapelle and Yi Chang. 2011. Yahoo! learning to rank challenge overview. In Proceedings of the Learning to Rank Challenge. 1--24.
[6]
Aleksandr Chuklin, Ilya Markov, and Maarten de Rijke. 2015. Click Models for Web Search. Morgan 8 Claypool.
[7]
Aleksandr Chuklin, Pavel Serdyukov, and Maarten de Rijke. 2013. Modeling clicks beyond the first result page. In Proceedings of the 22nd ACM Conference on Information and Knowledge Management. ACM, 1217--1220.
[8]
Nicolas Galichet, Michele Sebag, and Olivier Teytaud. 2013. Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. In Proceedings of the 5th Asian Conference on Machine Learning. PMLR, 245--260.
[9]
Javier Garcia and Fernando Fernández. 2012. Safe exploration of state and action spaces in reinforcement learning. J. Artif. Intell. Res. 45 (2012), 515--564.
[10]
Dorota Glowacka. 2019. Bandit algorithms in information retrieval. Found. Trends Inf. Retriev. 13, 4 (2019), 299--424.
[11]
Artem Grotov, Shimon Whiteson, and Maarten de Rijke. 2015. Bayesian ranker comparison based on historical user interactions. In Proceeedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 273--282.
[12]
Katja Hofmann, Anne Schuth, Shimon Whiteson, and Maarten de Rijke. 2013. Reusing historical interaction data for faster online learning to rank for information retrieval. In Proceedings of the International Conference on Web Search and Data Mining (WSDM’13). ACM, 183--192.
[13]
Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2011. Balancing exploration and exploitation in learning to rank online. In Advances in Information Retrieval: Proceedings of the 33rd European Conference on IR Research. Springer, 251--263.
[14]
Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2011. Contextual bandits for information retrieval. In Proceedings of the Conference on Neural Information Processing Systems Workshop on Bayesian Optimization, Experimental Design, and Bandits (NIPS’11).
[15]
Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2013. Fidelity, soundness, and efficiency of interleaved comparison methods. ACM Trans. Inf. Syst. 31, 4 (2013), 17.
[16]
Jonathan J. Hull. 1994. A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 16, 5 (1994), 550--554.
[17]
Xiaoguang Huo and Feng Fu. 2017. Risk-aware multi-armed bandit problem with application to portfolio selection. Arxiv Preprint Arxiv:1709.04415 (2017).
[18]
Rolf Jagerman, Harrie Oosterhuis, and Maarten de Rijke. 2019. To model or to intervene: A comparison of counterfactual and online learning to rank from user interactions. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 15--24.
[19]
Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4 (2002), 422--446.
[20]
Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 133--142.
[21]
Thorsten Joachims, Laura A. Granka, Bing Pan, Helene Hembrooke, and Geri Gay. 2005. Accurately interpreting clickthrough data as implicit feedback. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 154--161.
[22]
Thorsten Joachims, Adith Swaminathan, and Maarten de Rijke. 2018. Deep learning from logged bandit feedback. In Proceedings of the International Conference on Learning Representations.
[23]
Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased learning-to-rank with biased feedback. In Proceedings of the 10th ACM International Conference on Web Search and Data Mining. ACM, 781--789.
[24]
Leslie Pack Kaelbling. 1994. Associative reinforcement learning: Functions in k-DNF. Mach. Learn. 15, 3 (1994), 279--298.
[25]
Sham M. Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. 2008. Efficient bandit algorithms for online multiclass prediction. In Proceedings of the 25th International Conference on Machine Learning. ACM, 440--447.
[26]
Abbas Kazerouni, Mohammad Ghavamzadeh, and Benjamin Van Roy. 2016. Conservative contextual linear bandits. In Advances in Neural Information Processing Systems 30. Curran Associates, Inc., 3910--3919.
[27]
Ken Lang. 1995. Newsweeder: Learning to filter netnews. In Proceedings of the 12th International Conference on Machine Learning, Vol. 10. 331--339.
[28]
John Langford, Alexander Strehl, and Jennifer Wortman. 2008. Exploration scavenging. In Proceedings of the 25th International Conference on Machine Learning. ACM, 528--535.
[29]
David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. 2004. RCV1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5 (Apr. 2004), 361--397.
[30]
Chang Li, Branislav Kveton, Tor Lattimore, Ilya Markov, Maarten de Rijke, Csaba Szepesvari, and Masrour Zoghi. 2019. BubbleRank: Safe online learning to re-rank via implicit click feedback. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI’19).
[31]
Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web. ACM, 661--670.
[32]
Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. 2011. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the 4th ACM International Conference on Web search and Data Mining. ACM, 297--306.
[33]
Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Fabrizio Silvestri, and Salvatore Trani. 2016. Post-learning optimization of tree ensembles for efficient ranking. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 949--952.
[34]
Harrie Oosterhuis and Maarten de Rijke. 2017. Balancing speed and quality in online learning to rank for information retrieval. In Proceedings of the 26th ACM Conference on Information and Knowledge Management. ACM, 277--286.
[35]
Harrie Oosterhuis, Anne Schuth, and Maarten de Rijke. 2016. Probabilistic multileave gradient descent. In Advances in Information Retrieva: Proceedings of the 38th European Conference on IR Research. Springer, 661--668.
[36]
Art B. Owen. 2013. Monte carlo theory, methods and examples. Monte Carlo Theory, Methods and Examples (2013).
[37]
Tao Qin and Tie-Yan Liu. 2013. Introducing LETOR 4.0 Datasets. Arxiv Preprint Arxiv:1306.2597 (2013).
[38]
Filip Radlinski, Madhu Kurup, and Thorsten Joachims. 2008. How does clickthrough data reflect retrieval quality?. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM’08). ACM, New York, NY, 43--52.
[39]
Alex Strehl, John Langford, Lihong Li, and Sham M. Kakade. 2010. Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems. 2217--2225.
[40]
Alexander L. Strehl, Chris Mesterharm, Michael L. Littman, and Haym Hirsh. 2006. Experience-efficient learning in associative bandit problems. In Proceedings of the 23rd International Conference on Machine learning. ACM, 889--896.
[41]
Wen Sun, Debadeepta Dey, and Ashish Kapoor. 2016. Risk-aware algorithms for adversarial contextual bandits. Arxiv Preprint Arxiv:1610.05129 (2016).
[42]
Adith Swaminathan and Thorsten Joachims. 2015. Batch learning from logged bandit feedback through counterfactual risk minimization. J. Mach. Learn. Res. 16, 1 (2015), 1731--1755.
[43]
Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual risk minimization: Learning from logged bandit feedback. In Proceedings of the International Conference on Machine Learning. 814--823.
[44]
Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, and Imed Zitouni. 2016. Off-policy evaluation for slate recommendation. Arxiv Preprint Arxiv:1605.04812 (2016).
[45]
Philip S. Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. 2015. High-confidence off-policy evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’15). 3000--3006.
[46]
Xuanhui Wang, Nadav Golbandi, Michael Bendersky, Donald Metzler, and Marc Najork. 2018. Position bias estimation for unbiased learning to rank in personal search. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining. ACM, 610--618.
[47]
Zeng Wei, Jun Xu, Yanyan Lan, Jiafeng Guo, and Xueqi Cheng. 2017. Reinforcement learning to rank with markov decision process. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 945--948.
[48]
Qiang Wu, Christopher J. C. Burges, Krysta M. Svore, and Jianfeng Gao. 2010. Adapting boosting for information retrieval measures. Inf. Retriev. 13, 3 (2010), 254--270.
[49]
Yifan Wu, Roshan Shariff, Tor Lattimore, and Csaba Szepesvári. 2016. Conservative bandits. In Proceedings of the International Conference on Machine Learning. 1254--1262.
[50]
Yisong Yue and Thorsten Joachims. 2009. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). ACM, New York, NY, 1201--1208.

Cited By

View all

Index Terms

  1. Safe Exploration for Optimizing Contextual Bandits

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 38, Issue 3
    July 2020
    311 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/3394096
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 April 2020
    Accepted: 01 February 2020
    Revised: 01 December 2019
    Received: 01 July 2019
    Published in TOIS Volume 38, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Counterfactual learning
    2. Exploration
    3. Learning to rank

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)105
    • Downloads (Last 6 weeks)15
    Reflects downloads up to 01 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media