research-article

On the choice of effectiveness measures for learning to rank

Authors:

Stephen RobertsonAuthors Info & Claims

Information Retrieval, Volume 13, Issue 3

Pages 271 - 290

https://doi.org/10.1007/s10791-009-9116-x

Published: 01 June 2010 Publication History

Abstract

Most current machine learning methods for building search engines are based on the assumption that there is a target evaluation metric that evaluates the quality of the search engine with respect to an end user and the engine should be trained to optimize for that metric. Treating the target evaluation metric as a given, many different approaches (e.g. LambdaRank, SoftRank, RankingSVM, etc.) have been proposed to develop methods for optimizing for retrieval metrics. Target metrics used in optimization act as bottlenecks that summarize the training data and it is known that some evaluation metrics are more informative than others. In this paper, we consider the effect of the target evaluation metric on learning to rank. In particular, we question the current assumption that retrieval systems should be designed to directly optimize for a metric that is assumed to evaluate user satisfaction. We show that even if user satisfaction can be measured by a metric X, optimizing the engine on a training set for a more informative metric Y may result in a better test performance according to X (as compared to optimizing the engine directly for X on the training set). We analyze the situations as to when there is a significant difference in the two cases in terms of the amount of available training data and the number of dimensions of the feature space.

References

[1]

Aslam, J. A., Yilmaz, E., & Pavlu V. (2005). The maximum entropy method for analyzing retrieval measures. In Marchionini, G., Moffat A., Tait, J., Baeza-Yates, R., & Ziviani, N., (Eds.), Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (pp. 27–34). ACM Press, August 2005.

[2]

Burges C. J. C., Ragno R., and Le Q. V. Schölkopf B., Platt J. C., and Hoffman T. Learning to rank with nonsmooth cost functions NIPS 2006 Cambridge, MA MIT Press 193-200

[3]

Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., et al. (2005). Learning to rank using gradient descent. In ICML ’05: Proceedings of the 22nd international conference on machine learning (pp. 89–96). New York, NY, USA: ACM.

[4]

Donmez, P., Svore, K., & Burges, C. J. (2008). On the optimality of lambdarank. Technical report, Microsoft Research.

[5]

Järvelin, K., & Kekäläinen, J. (2000). Ir evaluation methods for retrieving highly relevant documents. In SIGIR ’00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 41–48). New York, NY, USA: ACM.

[6]

Joachims, T. (2005). A support vector method for multivariate performance measures. In ICML ’05: Proceedings of the 22nd international conference on Machine learning (pp 377–384). New York, NY, USA: ACM.

[7]

Le, Q. V., & Smola. A. J. (2007). Direct optimization of ranking measures. CoRR, abs/0704.3359.

[8]

Ling, C. X., Huang, J., & Zhang, H. (2003). Auc: A statistically consistent and more discriminating measure than accuracy. In IJCAI ’03: Proceedings of the 18th international conference on artificial intelligence (pp. 329–341).

[9]

Liu, T.-Y., & He Y. (2008). Are algorithms directly optimizing ir measures really direct? Technical report, Microsoft Research.

[10]

Robertson, S. (2008). A new interpretation of average precision. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 689–690). New York, NY, USA: ACM.

[11]

Robertson S. and Zaragoza H. On rank-based effectiveness measures and optimization Information Retrieval 2007 10 3 321-339

[12]

Sanderson, M., & Zobel J. (2005). Information retrieval system evaluation: Effort, sensitivity, and reliability. In Baeza-Yates, R. A., Ziviani, N., Marchionini, G., Moffat, A., & Tait, J. (Eds.), SIGIR (pp. 162–169). ACM.

[13]

Taylor, M., Guiver, J., Robertson, S., & Minka, T. (2008). Softrank: optimizing non-smooth rank metrics. In WSDM ’08: Proceedings of the international conference on Web search and web data mining (pp. 77–86) New York, NY, USA: ACM.

[14]

Taylor, M., Zaragoza, H., Craswell, N., Robertson, S., & Burges, C. (2006). Optimisation methods for ranking functions with multiple parameters. In CIKM ’06: Proceedings of the 15th ACM international conference on Information and knowledge management (pp. 585–593). New York, NY, USA: ACM

[15]

Webber, W., Moffat, A., Zobel, J., & Sakai, T. (2008). Precision-at-ten considered redundant. In SIGIR (pp. 695–696). New York, NY, USA: ACM.

[16]

Xu, J., & Li, H. (2007). Adarank: a boosting algorithm for information retrieval. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 391–398). New York, NY, USA: ACM.

[17]

Yilmaz E. and Aslam J. A. Estimating average precision when judgments are incomplete Knowledge and Information Systems 2008 16 2 173-211

[18]

Yue, Y., Finleym, T., Radlinski, F., & Joachims, T. (2007). A support vector method for optimizing average precision. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 271–278). New York, NY, USA: ACM.

Cited By

Vrijenhoek SBénédict GGutierrez Granada MOdijk D(2024)RADio* – An Introduction to Measuring Normative Diversity in News RecommendationsACM Transactions on Recommender Systems10.1145/36364653:1(1-29)Online publication date: 2-Aug-2024
https://dl.acm.org/doi/10.1145/3636465
Vrijenhoek SBénédict GGutierrez Granada MOdijk DDe Rijke M(2022)RADio – Rank-Aware Divergence Metrics to Measure Normative Diversity in News RecommendationsProceedings of the 16th ACM Conference on Recommender Systems10.1145/3523227.3546780(208-219)Online publication date: 12-Sep-2022
https://dl.acm.org/doi/10.1145/3523227.3546780
Li RUrbano JHanjalic ADiaz FShah CSuel TCastells PJones RSakai T(2021)New Insights into Metric Optimization for Ranking-based RecommendationProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3462973(932-941)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3404835.3462973
Show More Cited By

Index Terms

On the choice of effectiveness measures for learning to rank
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
    2. Retrieval models and ranking

Index terms have been assigned to the content through auto-classification.

Recommendations

Learning to rank by optimizing expected reciprocal rank
AIRS'11: Proceedings of the 7th Asia conference on Information Retrieval Technology

Learning to rank is one of the most hot research areas in information retrieval, among which listwise approach is an important research direction and the methods that directly optimizing evaluation metrics in listwise approach have been used for ...
Top-k learning to rank: labeling, ranking and evaluation
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

In this paper, we propose a novel top-k learning to rank framework, which involves labeling strategy, ranking model and evaluation measure. The motivation comes from the difficulty in obtaining reliable relevance judgments from human assessors when ...
Learning to rank code examples for code search engines

Source code examples are used by developers to implement unfamiliar tasks by learning from existing solutions. To better support developers in finding existing solutions, code search engines are designed to locate and rank code examples relevant to user'...

Comments

Information & Contributors

Information

Published In

cover image Information Retrieval

Information Retrieval Volume 13, Issue 3

Jun 2010

118 pages

ISSN:1386-4564

Issue’s Table of Contents

© Springer Science+Business Media, LLC 2009.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 June 2010

Accepted: 28 August 2009

Received: 24 April 2009

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Vrijenhoek SBénédict GGutierrez Granada MOdijk D(2024)RADio* – An Introduction to Measuring Normative Diversity in News RecommendationsACM Transactions on Recommender Systems10.1145/36364653:1(1-29)Online publication date: 2-Aug-2024
https://dl.acm.org/doi/10.1145/3636465
Vrijenhoek SBénédict GGutierrez Granada MOdijk DDe Rijke M(2022)RADio – Rank-Aware Divergence Metrics to Measure Normative Diversity in News RecommendationsProceedings of the 16th ACM Conference on Recommender Systems10.1145/3523227.3546780(208-219)Online publication date: 12-Sep-2022
https://dl.acm.org/doi/10.1145/3523227.3546780
Li RUrbano JHanjalic ADiaz FShah CSuel TCastells PJones RSakai T(2021)New Insights into Metric Optimization for Ranking-based RecommendationProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3462973(932-941)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3404835.3462973
Zhang FMao JLiu YXie XMa WZhang MMa SHuang JChang YCheng XKamps JMurdock VWen JLiu Y(2020)Models Versus SatisfactionProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401162(379-388)Online publication date: 25-Jul-2020
https://dl.acm.org/doi/10.1145/3397271.3401162
Ghanbari EShakery A(2019)ERR.RankApplied Intelligence10.1007/s10489-018-1330-z49:3(1185-1199)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1007/s10489-018-1330-z
Karmaker Santu SSondhi PZhai CKando NSakai TJoho HLi Hde Vries AWhite R(2017)On Application of Learning to Rank for E-Commerce SearchProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080838(475-484)Online publication date: 7-Aug-2017
https://dl.acm.org/doi/10.1145/3077136.3080838
Chen YZhou KLiu YZhang MMa SKando NSakai TJoho HLi Hde Vries AWhite R(2017)Meta-evaluation of Online and Offline Web Search Evaluation MetricsProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080804(15-24)Online publication date: 7-Aug-2017
https://dl.acm.org/doi/10.1145/3077136.3080804
Cormack GGrossman MPerego RSebastiani FAslam JRuthven IZobel J(2016)Engineering Quality and Reliability in Technology-Assisted ReviewProceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval10.1145/2911451.2911510(75-84)Online publication date: 7-Jul-2016
https://dl.acm.org/doi/10.1145/2911451.2911510
(2016)On interactive learning-to-rank for IRNeurocomputing10.1016/j.neucom.2016.03.084208:C(3-24)Online publication date: 5-Oct-2016
https://dl.acm.org/doi/10.1016/j.neucom.2016.03.084
Lu XMoffat ACulpepper J(2016)The effect of pooling and evaluation depth on IR metricsInformation Retrieval10.1007/s10791-016-9282-619:4(416-445)Online publication date: 1-Aug-2016
https://dl.acm.org/doi/10.1007/s10791-016-9282-6
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents