skip to main content
10.1145/3589334.3645501acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article
Open access

Off-Policy Evaluation for Large Action Spaces via Policy Convolution

Published: 13 May 2024 Publication History

Abstract

Developing accurate off-policy estimators is crucial for both evaluating and optimizing for new policies. The main challenge in off-policy estimation is the distribution shift between the logging policy that generates data and the target policy that we aim to evaluate. Typically, techniques for correcting distribution shift involve some form of importance sampling. This approach results in unbiased value estimation but often comes with the trade-off of high variance. Furthermore, importance sampling relies on the common support assumption, which becomes impractical when the action space is large. To address these challenges, we introduce the Policy Convolution (PC) family of estimators for the contextual bandit setting. These methods leverage latent structure within actions---made available through action embeddings---to strategically convolve the logging and target policies. This convolution introduces a unique bias-variance trade-off, that can be controlled via the amount of convolution. Our experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC, especially when either the action space or policy mismatch becomes large, with gains of up to 5-6 orders of magnitude over existing estimators.

Supplemental Material

MP4 File
Supplemental video

References

[1]
Jacob Buckman, Carles Gelada, and Marc G Bellemare. The importance of pessimism in ?xed-dataset policy optimization. In International Conference on Learning Representations, 2021.
[2]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597--1607. PMLR, 2020.
[3]
Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. Doubly robust policy evaluation and optimization. Statistical Science, pages 485--511, 2014.
[4]
Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML'11, 2011.
[5]
Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pages 1447--1456. PMLR, 2018.
[6]
Nicolò Felicioni, Maurizio Ferrari Dacrema, Marcello Restelli, and Paolo Cremonesi. Off-policy evaluation with deffcient support using side information. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
[7]
F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 2015.
[8]
Tim Hesterberg. Weighted average importance sampling and defensive mixture distributions. Technometrics, 37(2):185--194, 1995.
[9]
Keisuke Hirano and Guido W Imbens. The propensity score with continuous treatments. Applied Bayesian modeling and causal inference from incomplete-data perspectives, 226164:73--84, 2004.
[10]
Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a "nite universe. Journal of the American statistical Association, 47(260):663--685, 1952.
[11]
Edward L Ionides. Truncated importance sampling. Journal of Computational and Graphical Statistics, 17(2):295--311, 2008.
[12]
Thorsten Joachims, Adith Swaminathan, and Maarten De Rijke. Deep learning with logged bandit feedback. In International Conference on Learning Representations, 2018.
[13]
Nathan Kallus and Masatoshi Uehara. E?ciently breaking the curse of horizon in o?-policy evaluation with double reinforcement learning. Operations Research, 2022.
[14]
Nathan Kallus and Angela Zhou. Policy evaluation and optimization with continuous treatments. In International conference on artificial intelligence and statistics, pages 1243--1251. PMLR, 2018.
[15]
Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 681--690, 2008.
[16]
Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30--37, 2009.
[17]
Haanvid Lee, Jongmin Lee, Yunseon Choi, Wonseok Jeon, Byung-Jun Lee, Yung- Kyun Noh, and Kee-Eung Kim. Local metric learning for off-policy evaluation in contextual bandits with continuous actions. Advances in Neural Information Processing Systems, 35:3913--3925, 2022.
[18]
Jonathan N Lee, Bo Dai, and Emma Brunskill. Oracle inequalities for model selection in offline reinforcement learning. Advances in Neural Information Processing Systems, 35:28194--28207, 2022.
[19]
Romain Lopez, Inderjit S Dhillon, and Michael I Jordan. Learning from extreme bandit feedback. In Proceedings of the AAAI Conference on Arti?cial Intelligence, 2021.
[20]
Jiaqi Ma, Zhe Zhao, Xinyang Yi, Ji Yang, Minmin Chen, Jiaxi Tang, Lichan Hong, and Ed H Chi. O?-policy learning in two-stage recommender systems. In Proceedings of The Web Conference 2020, 2020.
[21]
Alberto Maria Metelli, Alessio Russo, and Marcello Restelli. Subgaussian and differentiable importance sampling for o?-policy evaluation and learning. Advances in Neural Information Processing Systems, 34:8119--8132, 2021.
[22]
Anshul Mittal, Noveen Sachdeva, Sheshansh Agrawal, Sumeet Agarwal, Purushottam Kar, and Manik Varma. Eclare: Extreme classification with label graph correlations. In Proceedings of the Web Conference 2021, WWW '21. Association for Computing Machinery, 2021.
[23]
Jie Peng, Hao Zou, Jiashuo Liu, Shaoming Li, Yibao Jiang, Jian Pei, and Peng Cui. Offline policy evaluation in large action spaces via outcome-oriented action grouping. In Proceedings of the ACM Web Conference 2023, 2023.
[24]
James Robins, Mariela Sued, Quanhong Lei-Gomez, and Andrea Rotnitzky. Comment: Performance of double-robust estimators when" inverse probability" weights are highly variable. Statistical Science, 22(4):544--559, 2007.
[25]
Noveen Sachdeva, Mehak Preet Dhaliwal, Carole-Jean Wu, and Julian McAuley. Infinite recommendation networks: A data-centric approach. In Advances in Neural Information Processing Systems, 2022.
[26]
Noveen Sachdeva, Yi Su, and Thorsten Joachims. Off-policy bandits with deficient support. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '20, 2020.
[27]
Yuta Saito and Thorsten Joachims. Off-policy evaluation for large action spaces via embeddings. In International Conference on Machine Learning, pages 19089--19122. PMLR, 2022.
[28]
Yuta Saito, Qingyang Ren, and Thorsten Joachims. Off-policy evaluation for large action spaces via conjunct effect modeling. In international conference on Machine learning, 2023.
[29]
Yuta Saito, Suguru Yaginuma, Yuta Nishino, Hayato Sakata, and Kazuhide Nakata. Unbiased recommender learning from missing-not-at-random implicit feedback. In Proceedings of the 13th International Conference onWeb Search and Data Mining, 2020.
[30]
Otmane Sakhi, Stephen Bonner, David Rohde, and Flavian Vasile. Blob: A probabilistic model for recommendation that combines organic and bandit signals. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 783--793, 2020.
[31]
Rajat Sen, Alexander Rakhlin, Lexing Ying, Rahul Kidambi, Dean Foster, Daniel N Hill, and Inderjit S Dhillon. Top-k extreme contextual bandits with arm hierarchy. In International Conference on Machine Learning, pages 9422--9433. PMLR, 2021.
[32]
Aleksandrs Slivkins. Contextual bandits with similarity information. In Proceedings of the 24th annual Conference On Learning Theory, pages 679--702. JMLR Workshop and Conference Proceedings, 2011.
[33]
Aleksandrs Slivkins, Filip Radlinski, and Sreenivas Gollapudi. Learning optimally diverse rankings over large document collections. In Proc. of the 27th International Conference on Machine Learning (ICML 2010), 2010.
[34]
Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. Doubly robust off-policy evaluation with shrinkage. In International Conference on Machine Learning, pages 9167--9176. PMLR, 2020.
[35]
Yi Su, Pavithra Srinath, and Akshay Krishnamurthy. Adaptive estimator selection for off-policy evaluation. In International Conference on Machine Learning, pages 9196--9205. PMLR, 2020.
[36]
Yi Su, Lequn Wang, Michele Santacatterina, and Thorsten Joachims. Cab: Continuous adaptive blending for policy evaluation and learning. In International Conference on Machine Learning, pages 6005--6014. PMLR, 2019.
[37]
Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pages 814--823. PMLR, 2015.
[38]
Adith Swaminathan and Thorsten Joachims. The self-normalized estimator for counterfactual learning. advances in neural information processing systems, 28, 2015.
[39]
Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. Off-policy evaluation for slate recommendation. Advances in Neural Information Processing Systems, 30, 2017.
[40]
Yunhao Tang and Shipra Agrawal. Discretizing continuous action space for onpolicy optimization. In Proceedings of the aaai conference on artificial intelligence, 2020.
[41]
Arnaud Doucet, Rob Cornish, and Jean-Francois Ton. Marginal density ratio for off-policy evaluation in contextual bandits. In Neural Information Processing Systems, 2023.
[42]
Philip Thomas and Emma Brunskill. Data-effcient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139--2148. PMLR, 2016.
[43]
Masatoshi Uehara, Chengchun Shi, and Nathan Kallus. A review of off-policy evaluation in reinforcement learning. arXiv preprint arXiv:2212.06355, 2022.
[44]
Matt P Wand and M Chris Jones. Kernel smoothing. CRC press, 1994.
[45]
Lequn Wang, Akshay Krishnamurthy, and Aleksandrs Slivkins. Oracle-effcient pessimism: Offine policy optimization in contextual bandits. In International Conference on Artificial Intelligence and Statistics, 2024.
[46]
Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dud?k. Optimal and adaptive o?- policy evaluation in contextual bandits. In International Conference on Machine Learning, pages 3589--3597. PMLR, 2017.
[47]
Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 974--983, 2018.

Cited By

View all

Index Terms

  1. Off-Policy Evaluation for Large Action Spaces via Policy Convolution

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '24: Proceedings of the ACM Web Conference 2024
    May 2024
    4826 pages
    ISBN:9798400701719
    DOI:10.1145/3589334
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 May 2024

    Check for updates

    Author Tags

    1. inverse propensity score
    2. off-policy evaluation

    Qualifiers

    • Research-article

    Conference

    WWW '24
    Sponsor:
    WWW '24: The ACM Web Conference 2024
    May 13 - 17, 2024
    Singapore, Singapore

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)420
    • Downloads (Last 6 weeks)88
    Reflects downloads up to 09 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media