research-article

Open access

Off-Policy Evaluation for Large Action Spaces via Policy Convolution

Authors:

Noveen Sachdeva,

Julian McAuleyAuthors Info & Claims

WWW '24: Proceedings of the ACM Web Conference 2024

Pages 3576 - 3585

https://doi.org/10.1145/3589334.3645501

Published: 13 May 2024 Publication History

Abstract

Developing accurate off-policy estimators is crucial for both evaluating and optimizing for new policies. The main challenge in off-policy estimation is the distribution shift between the logging policy that generates data and the target policy that we aim to evaluate. Typically, techniques for correcting distribution shift involve some form of importance sampling. This approach results in unbiased value estimation but often comes with the trade-off of high variance. Furthermore, importance sampling relies on the common support assumption, which becomes impractical when the action space is large. To address these challenges, we introduce the Policy Convolution (PC) family of estimators for the contextual bandit setting. These methods leverage latent structure within actions---made available through action embeddings---to strategically convolve the logging and target policies. This convolution introduces a unique bias-variance trade-off, that can be controlled via the amount of convolution. Our experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC, especially when either the action space or policy mismatch becomes large, with gains of up to 5-6 orders of magnitude over existing estimators.

Supplemental Material

MP4 File

Supplemental video

Download
84.38 MB

References

[1]

Jacob Buckman, Carles Gelada, and Marc G Bellemare. The importance of pessimism in ?xed-dataset policy optimization. In International Conference on Learning Representations, 2021.

[2]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597--1607. PMLR, 2020.

[3]

Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. Doubly robust policy evaluation and optimization. Statistical Science, pages 485--511, 2014.

[4]

Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML'11, 2011.

[5]

Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pages 1447--1456. PMLR, 2018.

[6]

Nicolò Felicioni, Maurizio Ferrari Dacrema, Marcello Restelli, and Paolo Cremonesi. Off-policy evaluation with deffcient support using side information. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.

[7]

F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 2015.

[8]

Tim Hesterberg. Weighted average importance sampling and defensive mixture distributions. Technometrics, 37(2):185--194, 1995.

[9]

Keisuke Hirano and Guido W Imbens. The propensity score with continuous treatments. Applied Bayesian modeling and causal inference from incomplete-data perspectives, 226164:73--84, 2004.

[10]

Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a "nite universe. Journal of the American statistical Association, 47(260):663--685, 1952.

[11]

Edward L Ionides. Truncated importance sampling. Journal of Computational and Graphical Statistics, 17(2):295--311, 2008.

[12]

Thorsten Joachims, Adith Swaminathan, and Maarten De Rijke. Deep learning with logged bandit feedback. In International Conference on Learning Representations, 2018.

[13]

Nathan Kallus and Masatoshi Uehara. E?ciently breaking the curse of horizon in o?-policy evaluation with double reinforcement learning. Operations Research, 2022.

[14]

Nathan Kallus and Angela Zhou. Policy evaluation and optimization with continuous treatments. In International conference on artificial intelligence and statistics, pages 1243--1251. PMLR, 2018.

[15]

Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 681--690, 2008.

Digital Library

[16]

Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30--37, 2009.

Digital Library

[17]

Haanvid Lee, Jongmin Lee, Yunseon Choi, Wonseok Jeon, Byung-Jun Lee, Yung- Kyun Noh, and Kee-Eung Kim. Local metric learning for off-policy evaluation in contextual bandits with continuous actions. Advances in Neural Information Processing Systems, 35:3913--3925, 2022.

[18]

Jonathan N Lee, Bo Dai, and Emma Brunskill. Oracle inequalities for model selection in offline reinforcement learning. Advances in Neural Information Processing Systems, 35:28194--28207, 2022.

[19]

Romain Lopez, Inderjit S Dhillon, and Michael I Jordan. Learning from extreme bandit feedback. In Proceedings of the AAAI Conference on Arti?cial Intelligence, 2021.

[20]

Jiaqi Ma, Zhe Zhao, Xinyang Yi, Ji Yang, Minmin Chen, Jiaxi Tang, Lichan Hong, and Ed H Chi. O?-policy learning in two-stage recommender systems. In Proceedings of The Web Conference 2020, 2020.

Digital Library

[21]

Alberto Maria Metelli, Alessio Russo, and Marcello Restelli. Subgaussian and differentiable importance sampling for o?-policy evaluation and learning. Advances in Neural Information Processing Systems, 34:8119--8132, 2021.

[22]

Anshul Mittal, Noveen Sachdeva, Sheshansh Agrawal, Sumeet Agarwal, Purushottam Kar, and Manik Varma. Eclare: Extreme classification with label graph correlations. In Proceedings of the Web Conference 2021, WWW '21. Association for Computing Machinery, 2021.

Digital Library

[23]

Jie Peng, Hao Zou, Jiashuo Liu, Shaoming Li, Yibao Jiang, Jian Pei, and Peng Cui. Offline policy evaluation in large action spaces via outcome-oriented action grouping. In Proceedings of the ACM Web Conference 2023, 2023.

Digital Library

[24]

James Robins, Mariela Sued, Quanhong Lei-Gomez, and Andrea Rotnitzky. Comment: Performance of double-robust estimators when" inverse probability" weights are highly variable. Statistical Science, 22(4):544--559, 2007.

[25]

Noveen Sachdeva, Mehak Preet Dhaliwal, Carole-Jean Wu, and Julian McAuley. Infinite recommendation networks: A data-centric approach. In Advances in Neural Information Processing Systems, 2022.

[26]

Noveen Sachdeva, Yi Su, and Thorsten Joachims. Off-policy bandits with deficient support. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '20, 2020.

Digital Library

[27]

Yuta Saito and Thorsten Joachims. Off-policy evaluation for large action spaces via embeddings. In International Conference on Machine Learning, pages 19089--19122. PMLR, 2022.

[28]

Yuta Saito, Qingyang Ren, and Thorsten Joachims. Off-policy evaluation for large action spaces via conjunct effect modeling. In international conference on Machine learning, 2023.

[29]

Yuta Saito, Suguru Yaginuma, Yuta Nishino, Hayato Sakata, and Kazuhide Nakata. Unbiased recommender learning from missing-not-at-random implicit feedback. In Proceedings of the 13th International Conference onWeb Search and Data Mining, 2020.

Digital Library

[30]

Otmane Sakhi, Stephen Bonner, David Rohde, and Flavian Vasile. Blob: A probabilistic model for recommendation that combines organic and bandit signals. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 783--793, 2020.

Digital Library

[31]

Rajat Sen, Alexander Rakhlin, Lexing Ying, Rahul Kidambi, Dean Foster, Daniel N Hill, and Inderjit S Dhillon. Top-k extreme contextual bandits with arm hierarchy. In International Conference on Machine Learning, pages 9422--9433. PMLR, 2021.

[32]

Aleksandrs Slivkins. Contextual bandits with similarity information. In Proceedings of the 24th annual Conference On Learning Theory, pages 679--702. JMLR Workshop and Conference Proceedings, 2011.

[33]

Aleksandrs Slivkins, Filip Radlinski, and Sreenivas Gollapudi. Learning optimally diverse rankings over large document collections. In Proc. of the 27th International Conference on Machine Learning (ICML 2010), 2010.

Digital Library

[34]

Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. Doubly robust off-policy evaluation with shrinkage. In International Conference on Machine Learning, pages 9167--9176. PMLR, 2020.

[35]

Yi Su, Pavithra Srinath, and Akshay Krishnamurthy. Adaptive estimator selection for off-policy evaluation. In International Conference on Machine Learning, pages 9196--9205. PMLR, 2020.

[36]

Yi Su, Lequn Wang, Michele Santacatterina, and Thorsten Joachims. Cab: Continuous adaptive blending for policy evaluation and learning. In International Conference on Machine Learning, pages 6005--6014. PMLR, 2019.

[37]

Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pages 814--823. PMLR, 2015.

[38]

Adith Swaminathan and Thorsten Joachims. The self-normalized estimator for counterfactual learning. advances in neural information processing systems, 28, 2015.

[39]

Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. Off-policy evaluation for slate recommendation. Advances in Neural Information Processing Systems, 30, 2017.

[40]

Yunhao Tang and Shipra Agrawal. Discretizing continuous action space for onpolicy optimization. In Proceedings of the aaai conference on artificial intelligence, 2020.

[41]

Arnaud Doucet, Rob Cornish, and Jean-Francois Ton. Marginal density ratio for off-policy evaluation in contextual bandits. In Neural Information Processing Systems, 2023.

[42]

Philip Thomas and Emma Brunskill. Data-effcient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139--2148. PMLR, 2016.

[43]

Masatoshi Uehara, Chengchun Shi, and Nathan Kallus. A review of off-policy evaluation in reinforcement learning. arXiv preprint arXiv:2212.06355, 2022.

[44]

Matt P Wand and M Chris Jones. Kernel smoothing. CRC press, 1994.

[45]

Lequn Wang, Akshay Krishnamurthy, and Aleksandrs Slivkins. Oracle-effcient pessimism: Offine policy optimization in contextual bandits. In International Conference on Artificial Intelligence and Statistics, 2024.

[46]

Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dud?k. Optimal and adaptive o?- policy evaluation in contextual bandits. In International Conference on Machine Learning, pages 3589--3597. PMLR, 2017.

[47]

Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 974--983, 2018.

Digital Library

Cited By

Shimizu TTanaka KKishimoto RKiyohara HNomura MSaito Y(2024)Effective Off-Policy Evaluation and Learning in Contextual Combinatorial BanditsProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688099(733-741)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3640457.3688099

Index Terms

Off-Policy Evaluation for Large Action Spaces via Policy Convolution
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction
WWW '24: Proceedings of the ACM Web Conference 2024

We study off-policy evaluation (OPE) in the problem of slate contextual bandits where a policy selects multi-dimensional actions known as slates. This problem is widespread in recommender systems, search engines, marketing, to medical applications, ...
Policy enforcement via program monitoring
Evaluating the Robustness of Off-Policy Evaluation
RecSys '21: Proceedings of the 15th ACM Conference on Recommender Systems

Off-policy Evaluation (OPE), or offline evaluation in general, evaluates the performance of hypothetical policies leveraging only offline log data. It is particularly useful in applications where the online interaction involves high stakes and expensive ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '24: Proceedings of the ACM Web Conference 2024

May 2024

4826 pages

ISBN:9798400701719

DOI:10.1145/3589334

General Chairs:
Tat-Seng Chua
National University of Singapore
,
Chong-Wah Ngo
Singapore Management University
,
Proceedings Chair:
Roy Ka-Wei Lee
Singapore University of Technology and Design
,
Program Chairs:
Ravi Kumar
Google
,
Hady W. Lauw
Singapore Management University

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WWW '24

Sponsor:

SIGWEB

WWW '24: The ACM Web Conference 2024

May 13 - 17, 2024

Singapore, Singapore

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
420
Total Downloads

Downloads (Last 12 months)420
Downloads (Last 6 weeks)88

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shimizu TTanaka KKishimoto RKiyohara HNomura MSaito Y(2024)Effective Off-Policy Evaluation and Learning in Contextual Combinatorial BanditsProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688099(733-741)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3640457.3688099

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents