skip to main content
research-article

A Revisiting Study of Appropriate Offline Evaluation for Top-N Recommendation Algorithms

Published: 21 December 2022 Publication History

Abstract

In recommender systems, top-N recommendation is an important task with implicit feedback data. Although the recent success of deep learning largely pushes forward the research on top-N recommendation, there are increasing concerns on appropriate evaluation of recommendation algorithms. It therefore is important to study how recommendation algorithms can be reliably evaluated and thoroughly verified. This work presents a large-scale, systematic study on six important factors from three aspects for evaluating recommender systems. We carefully select 12 top-N recommendation algorithms and eight recommendation datasets. Our experiments are carefully designed and extensively conducted with these algorithms and datasets. In particular, all the experiments in our work are implemented based on an open sourced recommendation library, Recbole [139], which ensures the reproducibility and reliability of our results. Based on the large-scale experiments and detailed analysis, we derive several key findings on the experimental settings for evaluating recommender systems. Our findings show that some settings can lead to substantial or significant differences in performance ranking of the compared algorithms. In response to recent evaluation concerns, we also provide several suggested settings that are specially important for performance comparison.

References

[1]
Fabio Aiolli. 2013. Efficient top-n recommendation for very large scale binary rated datasets. In Proceedings of the 7th ACM Conference on Recommender Systems (RecSys’13). Association for Computing Machinery, New York, NY, 273–280.
[2]
Zafar Ali, Irfan Ullah, Amin Khan, Asim Ullah Jan, and Khan Muhammad. 2021. An overview and evaluation of citation recommendation models. Scientometrics 126, 5 (2021), 4083–4119.
[3]
Vito Walter Anelli, Alejandro Bellogín, Antonio Ferrara, Daniele Malitesta, Felice Antonio Merra, Claudio Pomo, Francesco Maria Donini, and Tommaso Di Noia. 2021. Elliot: A comprehensive and rigorous framework for reproducible recommender systems evaluation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’21), Fernando Diaz, Chirag Shah, Torsten Suel, Pablo Castells, Rosie Jones, and Tetsuya Sakai (Eds.). ACM, 2405–2414.
[4]
Prasanna Balaprakash, Michael Salim, Thomas D. Uram, Venkat Vishwanath, and Stefan M. Wild. 2018. DeepHyper: Asynchronous hyperparameter search for deep neural networks. In Proceedings of the IEEE 25th International Conference on High Performance Computing (HiPC’18). 42–51.
[5]
Oren Barkan, Yonatan Fuchs, Avi Caciularu, and Noam Koenigstein. 2020. Explainable recommendations via attentive multi-persona collaborative filtering. In Proceedings of the 14th ACM Conference on Recommender Systems (RecSys’20). Association for Computing Machinery, New York, NY, 468–473.
[6]
T. Bartz-Beielstein, C. W. G. Lasarczyk, and M. Preuss. 2005. Sequential parameter optimization. In Proceedings of the IEEE Congress on Evolutionary Computation, Vol. 1. 773–780.
[7]
James Bennett, Stan Lanning, et al. 2007. The netflix prize. In Proceedings of the KDD Cup and Workshop, Vol. 2007. Citeseer, 35.
[8]
James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, null (Feb.2012), 281–305.
[9]
Rocío Cañamares and Pablo Castells. 2020. On target item sampling in offline recommender system evaluation. In Proceedingsof the 14th ACM Conference on Recommender Systems (RecSys’20). Association for Computing Machinery, New York, NY, 259–268.
[10]
Pedro G. Campos, Fernando Díez, and Manuel Sánchez-Montañés. 2011. Towards a more realistic evaluation: Testing the ability to predict future tastes of matrix factorization-based recommenders. In Proceedings of the 5th ACM Conference on Recommender Systems (RecSys’11). Association for Computing Machinery, New York, NY, 309–312.
[11]
Rocío Cañamares, Pablo Castells, and Alistair Moffat. 2020. Offline evaluation options for recommender systems. Inf. Retriev. J. (2020), 1–24.
[12]
Iván Cantador, Peter Brusilovsky, and Tsvi Kuflik. 2011. 2nd workshop on information heterogeneity and fusion in recommender systems (hetrec 2011). In Proceedings of the 5th ACM Conference on Recommender Systems (RecSys’11). ACM, New York, NY.
[13]
Sonny Han Seng Chee, Jiawei Han, and Ke Wang. 2001. Rectree: An efficient collaborative filtering method. In Proceedings of the International Conference on Data Warehousing and Knowledge Discovery. Springer, 141–151.
[14]
Chong Chen, Min Zhang, Yiqun Liu, and Shaoping Ma. 2019. Social attentional memory network: Modeling aspect- and friend-level differences in recommendation. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining (WSDM’19). Association for Computing Machinery, New York, NY, 177–185.
[15]
Chong Chen, Min Zhang, Yongfeng Zhang, Yiqun Liu, and Shaoping Ma. 2020. Efficient neural matrix factorization without sampling for recommendation. ACM Trans. Inf. Syst. 38, 2, Article 14 (Jan.2020), 28 pages.
[16]
Chong Chen, Min Zhang, Yongfeng Zhang, Weizhi Ma, Yiqun Liu, and Shaoping Ma. 2020. Efficient heterogeneous collaborative filtering without negative sampling for recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 19–26.
[17]
Jiawei Chen, Hande Dong, Xiang Wang, Fuli Feng, Meng Wang, and Xiangnan He. 2020. Bias and debias in recommender system: A survey and future directions. arXiv:2010.03240. Retrieved from https://arxiv.org/abs/2010.03240.
[18]
Weijian Chen, Yulong Gu, Zhaochun Ren, Xiangnan He, Hongtao Xie, Tong Guo, Dawei Yin, and Yongdong Zhang. 2019. Semi-supervised user profiling with heterogeneous graph attention networks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’19), Vol. 19. 2116–2122.
[19]
Yihong Chen, Bei Chen, Xiangnan He, Chen Gao, Yong Li, Jian-Guang Lou, and Yue Wang. 2019. \(\lambda\) Opt: Learn to regularize recommender models in finer levels. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’19). Association for Computing Machinery, New York, NY, 978–986.
[20]
Yifan Chen and Maarten de Rijke. 2018. A collective variational autoencoder for top-n recommendation with side information. In Proceedings of the 3rd Workshop on Deep Learning for Recommender Systems. 3–9.
[21]
Jin Yao Chin, Yile Chen, and Gao Cong. 2022. The datasets dilemma: How much do we really know about recommendation datasets? In Proceedings of the 15th ACM International Conference on Web Search and Data Mining. 141–149.
[22]
Evangelia Christakopoulou and George Karypis. 2018. Local latent space models for top-n recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’18). Association for Computing Machinery, New York, NY, 1235–1243.
[23]
Maurizio Ferrari Dacrema, Simone Boglio, Paolo Cremonesi, and Dietmar Jannach. 2021. A troubling analysis of reproducibility and progress in recommender systems research. ACM Trans. Inf. Syst. 39, 2, Article 20 (January2021), 49 pages.
[24]
Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. 2019. Are we really making much progress? A worrying analysis of recent neural recommendation approaches. In Proceedings of the 13th ACM Conference on Recommender Systems (RecSys’19). Association for Computing Machinery, New York, NY, 101–109.
[25]
Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. 2020. Methodological issues in recommender systems research. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI’20). 4706–4710.
[26]
Yashar Deldjoo, Tommaso Di Noia, Eugenio Di Sciascio, and Felice Antonio Merra. 2020. How dataset characteristics affect the robustness of collaborative recommendation models. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 951–960.
[27]
Zhi-Hong Deng, Ling Huang, Chang-Dong Wang, Jian-Huang Lai, and Philip S. Yu. 2019. DeepCF: A unified framework of representation learning and matching function learning in recommender system. In Proceedings of the AAAI Conference on Artificial Intelligence. 61–68.
[28]
Mukund Deshpande and George Karypis. 2004. Item-based top-n recommendation algorithms. ACM Trans. Inf. Syst. 22, 1 (2004), 143–177.
[29]
Robin Devooght, Nicolas Kourtellis, and Amin Mantrach. 2015. Dynamic matrix factorization with priors on unknown values. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 189–198.
[30]
Jingtao Ding, Yuhan Quan, Quanming Yao, Yong Li, and Depeng Jin. 2020. Simplify and robustify negative sampling for implicit collaborative filtering. arXiv:2009.03376. Retrieved from https://arxiv.org/abs/2009.03376.
[31]
Jingtao Ding, Guanghui Yu, Yong Li, Xiangnan He, and Depeng Jin. 2020. Improving implicit recommender systems with auxiliary data. ACM Trans. Inf. Syst. 38, 1, Article 11 (Feb.2020), 27 pages.
[32]
Travis Ebesu, Bin Shen, and Yi Fang. 2018. Collaborative memory network for recommendation systems. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR’18). Association for Computing Machinery, New York, NY, 515–524.
[33]
Ehtsham Elahi, Wei Wang, Dave Ray, Aish Fenton, and Tony Jebara. 2019. Variational low rank multinomials for collaborative filtering with side-information. In Proceedings of the 13th ACM Conference on Recommender Systems (RecSys’19). Association for Computing Machinery, New York, NY, 340–347.
[34]
Ehtsham Elahi, Wei Wang, Dave Ray, Aish Fenton, and Tony Jebara. 2019. Variational low rank multinomials for collaborative filtering with side-information. In Proceedings of the 13th ACM Conference on Recommender Systems. 340–347.
[35]
Ali Mamdouh Elkahky, Yang Song, and Xiaodong He. 2015. A multi-view deep learning approach for cross domain user modeling in recommendation systems. In Proceedings of the 24th International Conference on World Wide Web (WWW’15). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 278–288.
[36]
Soude Fazeli, Babak Loni, Alejandro Bellogin, Hendrik Drachsler, and Peter Sloep. 2014. Implicit vs. explicit trust in social matrix factorization. In Proceedings of the 8th ACM Conference on Recommender systems. 317–320.
[37]
Evgeny Frolov and Ivan Oseledets. 2019. HybridSVD: When collaborative information is not enough. In Proceedings of the 13th ACM Conference on Recommender Systems (RecSys’19). Association for Computing Machinery, New York, NY, 331–339.
[38]
Zeno Gantner, Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2011. MyMediaLite: A free recommender system library. In Proceedings of the 5th ACM Conference on Recommender Systems. 305–308.
[39]
Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline a/b testing for recommender systems. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining. 198–206.
[40]
Guibing Guo, Jie Zhang, Zhu Sun, and Neil Yorke-Smith. 2015. LibRec: A java library for recommender systems. In UMAP Workshops, Vol. 4. Citeseer.
[41]
Udit Gupta, Samuel Hsia, Vikram Saraph, Xiaodong Wang, Brandon Reagen, Gu-Yeon Wei, Hsien-Hsin S. Lee, David Brooks, and Carole-Jean Wu. 2020. Deeprecsys: A system for optimizing end-to-end at-scale neural recommendation inference. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). IEEE, 982–995.
[42]
F. Maxwell Harper and Joseph A. Konstan. 2015. The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst. 5, 4 (2015), 1–19.
[43]
Gaole He, Junyi Li, Wayne Xin Zhao, Peiju Liu, and Ji-Rong Wen. 2020. Mining implicit entity preference from user-item interaction data for knowledge graph completion via adversarial learning. In Proceedings of the Web Conference 2020. 740–751.
[44]
Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 639–648.
[45]
Xiangnan He, Zhankui He, Xiaoyu Du, and Tat-Seng Chua. 2018. Adversarial personalized ranking for recommendation. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR’18). Association for Computing Machinery, New York, NY, 355–364.
[46]
X. He, Z. He, J. Song, Z. Liu, Y. Jiang, and T. Chua. 2018. NAIS: Neural attentive item similarity model for recommendation. IEEE Trans. Knowl. Data Eng. 30, 12 (2018), 2354–2366.
[47]
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the World Wide Web Conference (WWW’17). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 173–182.
[48]
Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Tat-Seng Chua. 2016. Fast matrix factorization for online recommendation with implicit feedback. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 549–558.
[49]
Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Tat-Seng Chua. 2016. Fast matrix factorization for online recommendation with implicit feedback(SIGIR’16). Association for Computing Machinery, New York, NY, 549–558.
[50]
Balázs Hidasi and Alexandros Karatzoglou. 2018. Recurrent neural networks with top-k gains for session-based recommendations. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM’18). Association for Computing Machinery, New York, NY, 843–852.
[51]
Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2015. Session-based recommendations with recurrent neural networks. arXiv:1511.06939. Retrieved from https://arxiv.org/abs/1511.06939.
[52]
Binbin Hu, Chuan Shi, Wayne Xin Zhao, and Philip S. Yu. 2018. Leveraging meta-path based context for top- n recommendation with a neural co-attention model. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’18). Association for Computing Machinery, New York, NY, 1531–1540.
[53]
Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering for implicit feedback datasets. In Proceedings of the 8th IEEE International Conference on Data Mining. IEEE, 263–272.
[54]
Olivier Jeunen. 2019. Revisiting offline evaluation for implicit-feedback recommender systems. In Proceedings of the 13th ACM Conference on Recommender Systems (RecSys’19). Association for Computing Machinery, New York, NY, 596–600.
[55]
Olivier Jeunen, Koen Verstrepen, and Bart Goethals. 2018. Fair offline evaluation methodologies for implicit-feedback recommender systems with MNAR data. In Proceedings of the REVEAL 18 Workshop on Offine Evaluation, October 2018, Vancouver, Canada.
[56]
Shuyi Ji, Yifan Feng, Rongrong Ji, Xibin Zhao, Wanwan Tang, and Yue Gao. 2020. Dual channel hypergraph collaborative filtering. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’20). Association for Computing Machinery, New York, NY, 2020–2029.
[57]
Yitong Ji, Aixin Sun, Jie Zhang, and Chenliang Li. 2020. A re-visit of the popularity baseline in recommender systems. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1749–1752.
[58]
Yitong Ji, Aixin Sun, Jie Zhang, and Chenliang Li. 2021. A critical study on data leakage in recommender system offline evaluation. arxiv:2010.11060 [cs.IR]. Retrieved from https://arxiv.org/abs/2010.11060.
[59]
Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin. 2016. Field-aware factorization machines for CTR prediction. In Proceedings of the 10th ACM Conference on Recommender Systems. 43–50.
[60]
Santosh Kabbur, Xia Ning, and George Karypis. 2013. FISM: Factored item similarity models for top-n recommender systems. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’13). Association for Computing Machinery, New York, NY, 659–667.
[61]
Ron Kohavi and Roger Longbotham. 2017. Online Controlled Experiments and A/B Testing. 922–929.
[62]
Yehuda Koren. 2008. Factorization meets the neighborhood: A multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08). Association for Computing Machinery, New York, NY, 426–434.
[63]
Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30–37.
[64]
Walid Krichene and Steffen Rendle. 2020. On sampled metrics for item recommendation. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
[65]
Maciej Kula. 2015. Metadata embeddings for user and item cold-start recommendations. arXiv:1507.08439. Retrieved from https://arxiv.org/abs/1507.08439.
[66]
Volodymyr Kysenko, Karl Rupp, Oleksandr Marchenko, Siegfried Selberherr, and Anatoly Anisimov. 2012. GPU-accelerated non-negative matrix factorization for text mining. In Proceedings of the International Conference on Application of Natural Language to Information Systems. Springer, 158–163.
[67]
Dung D. Le and Hady W. Lauw. 2017. Indexable bayesian personalized ranking for efficient top-k recommendation. In Proceedings of the ACM on Conference on Information and Knowledge Management (CIKM’17). Association for Computing Machinery, New York, NY, 1389–1398.
[68]
DongSheng Li, Chao Chen, Qin Lv, Li Shang, Stephen Chu, and Hongyuan Zha. 2017. ERMMA: Expected risk minimization for matrix approximation-based recommender systems. In Proceedings of the Thirty-first AAAI Conference on Artificial Intelligence.
[69]
Dong Li, Ruoming Jin, Jing Gao, and Zhi Liu. 2020. On sampling top-k recommendation evaluation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’20). Association for Computing Machinery, New York, NY, 2114–2124.
[70]
Xiaopeng Li and James She. 2017. Collaborative variational autoencoder for recommender systems. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, 305–314.
[71]
Dawen Liang, Laurent Charlin, James McInerney, and David M. Blei. 2016. Modeling user exposure in recommendation. In Proceedings of the 25th International Conference on World Wide Web. 951–961.
[72]
Dawen Liang, Rahul G. Krishnan, Matthew D. Hoffman, and Tony Jebara. 2018. Variational autoencoders for collaborative filtering. In Proceedings of the World Wide Web Conference (WWW’18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 689–698.
[73]
Chen Ma, Peng Kang, Bin Wu, Qinglong Wang, and Xue Liu. 2019. Gated attentive-autoencoder for content-aware recommendation. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining (WSDM’19). Association for Computing Machinery, New York, NY, 519–527.
[74]
Jingwei Ma, Jiahui Wen, Mingyang Zhong, Liangchen Liu, Chaojie Li, Weitong Chen, Yin Yang, Hongkui Tu, and Xue Li. 2019. DBRec: Dual-bridging recommendation via discovering latent groups. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM’19). Association for Computing Machinery, New York, NY, 1513–1522.
[75]
Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 43–52.
[76]
Sean M. McNee, John Riedl, and Joseph A. Konstan. 2006. Being accurate is not enough: How accuracy metrics have hurt recommender systems. In CHI’06 Extended Abstracts on Human Factors in Computing Systems. 1097–1101.
[77]
Lei Mei, Pengjie Ren, Zhumin Chen, Liqiang Nie, Jun Ma, and Jian-Yun Nie. 2018. An attentive interaction network for context-aware recommendations. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM’18). Association for Computing Machinery, New York, NY, 157–166.
[78]
Elisa Mena-Maldonado, Rocío Cañamares, Pablo Castells, Yongli Ren, and Mark Sanderson. 2020. Agreement and disagreement between true and false-positive metrics in recommender systems evaluation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’20). Association for Computing Machinery, New York, NY, 841–850.
[79]
Zaiqiao Meng, Richard McCreadie, Craig Macdonald, and Iadh Ounis. 2020. Exploring data splitting strategies for the evaluation of recommendation models. In Proceedings of the 14th ACM Conference on Recommender Systems (RecSys’20). Association for Computing Machinery, New York, NY, 681–686.
[80]
Zaiqiao Meng, Richard McCreadie, Craig Macdonald, and Iadh Ounis. 2020. Exploring data splitting strategies for the evaluation of recommendation models. In Proceedings of the 14th ACM Conference on Recommender Systems(RecSys’20). Association for Computing Machinery, New York, NY, 681–686.
[81]
Athanasios N. Nikolakopoulos, Dimitris Berberidis, George Karypis, and Georgios B. Giannakis. 2019. Personalized diffusions for top-n recommendation. In Proceedings of the 13th ACM Conference on Recommender Systems (RecSys’19). Association for Computing Machinery, New York, NY, 260–268.
[82]
Athanasios N. Nikolakopoulos and George Karypis. 2019. RecWalk: Nearly uncoupled random walks for top-n recommendation. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining (WSDM’19). Association for Computing Machinery, New York, NY, 150–158.
[83]
Rasaq Otunba, Raimi A. Rufai, and Jessica Lin. 2017. MPR: Multi-objective pairwise ranking. In Proceedings of the Eleventh ACM Conference on Recommender Systems (RecSys’17). Association for Computing Machinery, New York, NY, 170–178.
[84]
R. Pan, Y. Zhou, B. Cao, N. N. Liu, R. Lukose, M. Scholz, and Q. Yang. 2008. One-class collaborative filtering. In Proceedings of the 8th IEEE International Conference on Data Mining. 502–511.
[85]
Bibek Paudel, Thilo Haas, and Abraham Bernstein. 2017. Fewer flops at the top: Accuracy, diversity, and regularization in two-class collaborative filtering. In Proceedings of the 11th ACM Conference on Recommender Systems (RecSys’17). Association for Computing Machinery, New York, NY, 215–223.
[86]
Changhua Pei, Xinru Yang, Qing Cui, Xiao Lin, Fei Sun, Peng Jiang, Wenwu Ou, and Yongfeng Zhang. 2019. Value-aware recommendation based on reinforced profit maximization in e-commerce systems. arXiv:1902.00851. Retrieved from https://arxiv.org/abs/1902.00851.
[87]
Nikolaos Polatidis, Stelios Kapetanakis, Elias Pimenidis, and Konstantinos Kosmidis. 2018. Reproducibility of experiments in recommender systems evaluation. In IFIP International Conference on Artificial Intelligence Applications and Innovations. Springer, 401–409.
[88]
Lutz Prechelt. 1998. Automatic early stopping using cross validation: Quantifying the criteria. Neural Netw. 11, 4 (1998), 761–767.
[89]
Steffen Rendle. 2010. Factorization machines. In Proceedings of the IEEE International Conference on Data Mining. IEEE, 995–1000.
[90]
Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence. 452–461.
[91]
Steffen Rendle, Walid Krichene, Li Zhang, and John Anderson. 2020. Neural collaborative filtering vs. matrix factorization revisited. In Proceedings of the 14th ACM Conference on Recommender Systems. 240–248.
[92]
Steffen Rendle, Li Zhang, and Yehuda Koren. 2019. On the difficulty of evaluating baselines: A study on recommender systems. arXiv:1905.01395. Retrieved from https://arxiv.org/abs/1905.01395.
[93]
Jasson D. M. Rennie and Nathan Srebro. 2005. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd International Conference on Machine Learning. 713–719.
[94]
Francesco Ricci, Lior Rokach, and Bracha Shapira. 2011. Introduction to recommender systems handbook. In Recommender Systems Handbook. Springer, 1–35.
[95]
Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul B. Kantor. 2010. Recommender Systems Handbook (1st ed.). Springer-Verlag, Berlin.
[96]
Marco Rossetti, Fabio Stella, and Markus Zanker. 2016. Contrasting offline and online results when evaluating recommendation algorithms. In Proceedings of the 10th ACM Conference on Recommender Systems (RecSys’16). Association for Computing Machinery, New York, NY, 31–34.
[97]
Alan Said and Alejandro Bellogín. 2014. Comparative recommender system evaluation: Benchmarking recommendation frameworks. In Proceedings of the 8th ACM Conference on Recommender Systems. 129–136.
[98]
Alan Said and Alejandro Bellogín. 2015. Replicable evaluation of recommender systems. In Proceedings of the 9th ACM Conference on Recommender Systems (RecSys’15). Association for Computing Machinery, New York, NY, 363–364.
[99]
Guy Shani and Asela Gunawardana. 2011. Evaluating recommendation systems. In Recommender Systems Handbook. Springer, 257–297.
[100]
Lalita Sharma and Anju Gera. 2013. A survey of recommendation system: Research challenges. Int. J. Eng. Trends Technol. 4, 5 (2013), 1989–1992.
[101]
Thiago Silveira, Min Zhang, Xiao Lin, Yiqun Liu, and Shaoping Ma. 2019. How good your recommender system is? A survey on evaluations in recommendation. Int. J. Mach. Learn. Cybernet. 10, 5 (2019), 813–831.
[102]
Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. 2012. Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), Vol. 25. Curran Associates, Inc., 2951–2959.
[103]
Harald Steck. 2013. Evaluation of recommendations: Rating-prediction and ranking. In Proceedings of the 7th ACM Conference on Recommender Systems. 213–220.
[104]
Jianing Sun, Wei Guo, Dengcheng Zhang, Yingxue Zhang, Florence Regol, Yaochen Hu, Huifeng Guo, Ruiming Tang, Han Yuan, Xiuqiang He, and Mark Coates. 2020. A framework for recommending accurate and diverse items using bayesian graph convolutional neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’20). Association for Computing Machinery, New York, NY, 2030–2039.
[105]
Rui Sun, Xuezhi Cao, Yan Zhao, Junchen Wan, Kun Zhou, Fuzheng Zhang, Zhongyuan Wang, and Kai Zheng. 2020. Multi-modal knowledge graphs for recommender systems. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (CIKM’20). Association for Computing Machinery, New York, NY, 1405–1414.
[106]
Zhu Sun, Jie Yang, Jie Zhang, Alessandro Bozzon, Yu Chen, and Chi Xu. 2017. MRLR: Multi-level representation learning for personalized ranking in recommendation. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’17). 2807–2813.
[107]
Zhu Sun, Di Yu, Hui Fang, Jie Yang, Xinghua Qu, Jie Zhang, and Cong Geng. 2020. Are we evaluating rigorously? Benchmarking recommendation for reproducible evaluation and fair comparison. In Proceedings of the 14th ACM Conference on Recommender Systems. 23–32.
[108]
Thanh Tran, Xinyue Liu, Kyumin Lee, and Xiangnan Kong. 2019. Signed distance-based deep memory recommender. In Proceedings of the World Wide Web Conference (WWW’19). Association for Computing Machinery, New York, NY, 1841–1852.
[109]
Daniel Valcarce, Alejandro Bellogín, Javier Parapar, and Pablo Castells. 2018. On the robustness and discriminative power of information retrieval metrics for top-n recommendation. InProceedings of the ACM Conference on Recommender Systems (RecSys’18). Association for Computing Machinery, New York, NY, 260–268.
[110]
Saúl Vargas. 2014. Novelty and diversity enhancement and evaluation in recommender systems and information retrieval. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval. 1281–1281.
[111]
Hongwei Wang, Fuzheng Zhang, Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo. 2019. Multi-task feature learning for knowledge graph enhanced recommendation. In Proceedings of the World Wide Web Conference (WWW’19). Association for Computing Machinery, New York, NY, 2000–2010.
[112]
Xiang Wang, Xiangnan He, Yixin Cao, Meng Liu, and Tat-Seng Chua. 2019. KGAT: Knowledge graph attention network for recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’19). Association for Computing Machinery, New York, NY, 950–958.
[113]
Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural graph collaborative filtering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19). Association for Computing Machinery, New York, NY, 165–174.
[114]
Xiang Wang, Yaokun Xu, Xiangnan He, Yixin Cao, Meng Wang, and Tat-Seng Chua. 2020. Reinforced negative sampling over knowledge graph for recommendation. In Proceedings of the World Wide Web Conference 2020 (WWW’20). Association for Computing Machinery, New York, NY, 99–109.
[115]
Zengmao Wang, Yuhong Guo, and Bo Du. 2018. Matrix completion with preference ranking for top-n recommendation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18). International Joint Conferences on Artificial Intelligence Organization, 3585–3591.
[116]
Bin Wu, Zhongchuan Sun, Xiangnan He, Xiang Wang, and Jonathan Staniforth. 2017. NeuRec. Retrieved from https://github.com/wubinzzu/NeuRec.
[117]
Ga Wu, Maksims Volkovs, Chee Loong Soon, Scott Sanner, and Himanshu Rai. 2019. Noise contrastive estimation for one-class collaborative filtering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19). Association for Computing Machinery, New York, NY, 135–144.
[118]
Yao Wu, Christopher DuBois, Alice X. Zheng, and Martin Ester. 2016. Collaborative denoising auto-encoders for top-n recommender systems. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining (WSDM’16). Association for Computing Machinery, New York, NY, 153–162.
[119]
Xin Xin, Bo Chen, Xiangnan He, Dong Wang, Yue Ding, and Joemon Jose. 2019. CFM: Convolutional factorization machines for context-aware recommendation. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI’19). International Joint Conferences on Artificial Intelligence Organization, 3926–3932.
[120]
Xin Xin, Xiangnan He, Yongfeng Zhang, Yongdong Zhang, and Joemon Jose. 2019. Relational collaborative filtering: Modeling multiple item relations for recommendation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19). Association for Computing Machinery, New York, NY, 125–134.
[121]
Fengli Xu, Jianxun Lian, Zhenyu Han, Yong Li, Yujian Xu, and Xing Xie. 2019. Relation-aware graph convolutional networks for agent-initiated social e-commerce recommendation. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM’19). Association for Computing Machinery, New York, NY, 529–538.
[122]
Feng Xue, Xiangnan He, Xiang Wang, Jiandong Xu, Kai Liu, and Richang Hong. 2019. Deep item-based collaborative filtering for top-n recommendation. ACM Trans. Inf. Syst. 37, 3 (2019), 1–25.
[123]
Hong-Jian Xue, Xinyu Dai, Jianbing Zhang, Shujian Huang, and Jiajun Chen. 2017. Deep matrix factorization models for recommender systems. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’17), Vol. 17. Melbourne, Australia, 3203–3209.
[124]
Longqi Yang, Yin Cui, Yuan Xuan, Chenyang Wang, Serge Belongie, and Deborah Estrin. 2018. Unbiased offline recommender evaluation for missing-not-at-random implicit feedback. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys’18). Association for Computing Machinery, New York, NY, 279–287.
[125]
Yonghui Yang, Le Wu, Richang Hong, Kun Zhang, and Meng Wang. 2021. Enhanced graph learning for collaborative filtering via mutual information maximization. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 71–80.
[126]
Hsiang-Fu Yu, Nikhil Rao, and Inderjit S. Dhillon. 2016. Temporal regularized matrix factorization for high-dimensional time series prediction. In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS’16). 847–855.
[127]
Junliang Yu, Min Gao, Jundong Li, Hongzhi Yin, and Huan Liu. 2018. Adaptive implicit friends identification over heterogeneous network for social recommendation. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM, 357–366.
[128]
Lu Yu, Chuxu Zhang, Shichao Pei, Guolei Sun, and Xiangliang Zhang. 2018. WalkRanker: A unified pairwise ranking model with multiple relations for item recommendation.
[129]
Wenhui Yu and Zheng Qin. 2020. Sampler design for implicit feedback data by noisy-label robust learning. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’20). Association for Computing Machinery, New York, NY, 861–870.
[130]
Wenhui Yu, Huidi Zhang, Xiangnan He, Xu Chen, Li Xiong, and Zheng Qin. 2018. Aesthetic-based clothing recommendation. In Proceedings of the World Wide Web Conference (WWW’18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 649–658.
[131]
Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep learning based recommender system: A survey and new perspectives. ACM Comput. Surv. 52, 1 (2019), 5:1–5:38.
[132]
Shuai Zhang, Lina Yao, Lucas Vinh Tran, Aston Zhang, and Yi Tay. 2019. Quaternion collaborative filtering for recommendation. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI’19). International Joint Conferences on Artificial Intelligence Organization, 4313–4319.
[133]
Shuai Zhang, Lina Yao, and Xiwei Xu. 2017. AutoSVD++ An efficient hybrid collaborative filtering model via contractive auto-encoders. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 957–960.
[134]
Yongfeng Zhang, Qingyao Ai, Xu Chen, and W. Bruce Croft. 2017. Joint representation learning for top-n recommendation with heterogeneous information sources. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM’17). Association for Computing Machinery, New York, NY, 1449–1458.
[135]
Yuan Zhang, Xiaoran Xu, Hanning Zhou, and Yan Zhang. 2020. Distilling structured knowledge into embeddings for explainable and accurate recommendation. In Proceedings of the 13th International Conference on Web Search and Data Mining (WSDM’20). Association for Computing Machinery, New York, NY, 735–743.
[136]
Yan Zhang, Hongzhi Yin, Zi Huang, Xingzhong Du, Guowu Yang, and Defu Lian. 2018. Discrete deep learning for fast content-aware recommendation. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM’18). Association for Computing Machinery, New York, NY, 717–726.
[137]
Feipeng Zhao and Yuhong Guo. 2017. Learning discriminative recommendation systems with side information. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17). 3469–3475.
[138]
Wayne Xin Zhao, Junhua Chen, Pengfei Wang, Qi Gu, and Ji-Rong Wen. 2020. Revisiting alternative experimental settings for evaluating top-n item recommendation algorithms. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2329–2332.
[139]
Wayne Xin Zhao, Shanlei Mu, Yupeng Hou, Zihan Lin, Yushuo Chen, Xingyu Pan, Kaiyuan Li, Yujie Lu, Hui Wang, Changxin Tian, et al. 2021. Recbole: Towards a unified, comprehensive and efficient framework for recommendation algorithms. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 4653–4664.
[140]
Han Zhu, Xiang Li, Pengye Zhang, Guozheng Li, Jie He, Han Li, and Kun Gai. 2018. Learning tree-based deep model for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’18). Association for Computing Machinery, New York, NY, 1079–1088.

Cited By

View all

Index Terms

  1. A Revisiting Study of Appropriate Offline Evaluation for Top-N Recommendation Algorithms

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 41, Issue 2
    April 2023
    770 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/3568971
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 December 2022
    Online AM: 27 June 2022
    Accepted: 28 May 2022
    Revised: 10 March 2022
    Received: 14 May 2021
    Published in TOIS Volume 41, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Top-N recommendation
    2. evaluation
    3. experimental setup

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • National Natural Science Foundation of China
    • Beijing Natural Science Foundation
    • Beijing Outstanding Young Scientist Program
    • Beijing Academy of Artificial Intelligence (BAAI)

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)457
    • Downloads (Last 6 weeks)36
    Reflects downloads up to 07 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media