research-article

On the Feasibility and Benefits of Extensive Evaluation

Authors:

Yang WangAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 2, Issue 4

Article No.: 201, Pages 1 - 24

https://doi.org/10.1145/3677137

Published: 30 September 2024 Publication History

Abstract

Benchmark and system parameters often have a significant impact on performance evaluation, which raises a long-lasting question about which settings we should use.

This paper studies the feasibility and benefits of extensive evaluation. A full extensive evaluation, which tests all possible settings, is usually too expensive. This work investigates whether it is possible to sample a subset of the settings and, upon them, generate observations that match those from a full extensive evaluation. Towards this goal, we have explored the incremental sampling approach, which starts by measuring a small subset of random settings, builds a prediction model on these samples using the popular ANOVA approach, adds more samples if the model is not accurate enough, and terminates otherwise.

To summarize our findings: 1) Enhancing a research prototype to support extensive evaluation mostly involves changing hard-coded configurations, which does not take much effort. 2) Some systems are highly predictable, which means that they can achieve accurate predictions with a low sampling rate, but some systems are less predictable. 3) We have not found a method that can consistently outperform random sampling + ANOVA. Based on these findings, we provide recommendations to improve artifact predictability and strategies for selecting parameter values during evaluation.

References

[1]

ACM SIGMOD Reproducibility. https://reproducibility.sigmod.org/.

[2]

Calvin Source Code. https://github.com/yaledb/calvin.

[3]

EuroSys Call for Artifacts. https://sysartifacts.github.io/eurosys2021/.

[4]

OSDI Call for Artifacts. https://www.usenix.org/conference/osdi21/call-for-artifacts.

[5]

PVLDB Reproducibility. http://vldb.org/pvldb/reproducibility/.

[6]

SOSP Call for Artifacts. https://sysartifacts.github.io/sosp2021/call.html.

[7]

Tapir Source Code. https://github.com/UWSysLab/tapir.

[8]

Mohammad Alomari, Michael J. Cahill, Alan D. Fekete, and Uwe Röhm. The Cost of Serializability on Platforms That Use Snapshot Isolation. In ICDE, pages 576--585. IEEE Computer Society, 2008.

Digital Library

[9]

Aria Source Code. https://github.com/luyi0619/aria.

[10]

Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. Workload Analysis of a Large-scale Key-value Store. In SIGMETRICS, pages 53--64. ACM, 2012.

Digital Library

[11]

Tiemo Bang, Norman May, Ilia Petrov, and Carsten Binnig. The Tale of 1000 Cores: An Evaluation of Concurrency Control on Real(ly) Large Multi-socket Hardware. In DaMoN, pages 3:1--3:9. ACM, 2020.

[12]

Tiemo Bang, Norman May, Ilia Petrov, and Carsten Binnig. The Full Story of 1000 Cores: An Examination of Concurrency Control on Real(ly) Large Multi-socket Hardware. The VLDB Journal, 31(6):1185--1213, apr 2022.

[13]

Zhendong Bei, Zhibin Yu, Huiling Zhang, Wen Xiong, Cheng-Zhong Xu, Lieven Eeckhout, and Shengzhong Feng. RFHOC: A Random-Forest Approach to Auto-Tuning Hadoop's Configuration. IEEE Trans. Parallel Distributed Syst., 27(5):1470--1483, 2016.

Digital Library

[14]

Benjamin Berg, Daniel S. Berger, Sara McAllister, Isaac Grosof, Sathya Gunasekar, Jimmy Lu, Michael Uhlar, Jim Carrig, Nathan Beckmann, Mor Harchol-Balter, and Gregory R. Ganger. The CacheLib Caching Engine: Design and Experiences at Scale. In OSDI, pages 753--768. USENIX Association, 2020.

[15]

Qingchao Cai, Wentian Guo, Hao Zhang, Divyakant Agrawal, Gang Chen, Beng Chin Ooi, Kian-Lee Tan, Yong Meng Teo, and Sheng Wang. Efficient Distributed Memory Management with RDMA and Caching. Proc. VLDB Endow., 11(11):1604--1617, 2018.

Digital Library

[16]

Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. Bigtable: A Distributed Storage System for Structured Data. In OSDI, pages 205--218. USENIX Association, 2006.

Digital Library

[17]

Cicada Source Code. https://github.com/efficient/cicada-exp-sigmod2017.

[18]

Cloudlab. https://www.cloudlab.us/.

[19]

Cody Coleman, Deepak Narayanan, Daniel Kang, Tian Zhao, Jian Zhang, Luigi Nardi, Peter Bailis, Kunle Olukotun, Chris Re, and Matei Zaharia. DAWNBench: An End-to-End Deep Learning Benchmark and Competition. https://cs.stanford.edu/~deepakn/assets/papers/dawnbench-sosp17.pdf, 2017.

[20]

Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking Cloud Serving Systems with YCSB. In SoCC, pages 143--154. ACM, 2010.

Digital Library

[21]

Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon's Highly Available Key-value Store. In SOSP, pages 205--220. ACM, 2007.

Digital Library

[22]

Christina Delimitrou and Christos Kozyrakis. Paragon: QoS-aware Scheduling for Heterogeneous Datacenters. In ASPLOS, pages 77--88. ACM, 2013.

Digital Library

[23]

Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-efficient and QoS-aware Cluster Management. In ASPLOS, pages 127--144. ACM, 2014.

Digital Library

[24]

Mo Dong, Tong Meng, Doron Zarchy, Engin Arslan, Yossi Gilad, Brighten Godfrey, and Michael Schapira. PCC Vivace: Online-Learning Congestion Control. In NSDI, pages 343--356. USENIX Association, 2018.

[25]

Jack J. Dongarra, Michael A. Heroux, and Piotr Luszczek. High-performance Conjugate-gradient Benchmark: A New Metric for Ranking High-performance Computing Systems. Int. J. High Perform. Comput. Appl., 30(1):3--10, 2016.

Digital Library

[26]

DrTM Source Code. https://github.com/SJTU-IPADS/drtm.

[27]

fio - Flexible I/O Tester. https://github.com/axboe/fio.

[28]

Silvery Fu, Saurabh Gupta, Radhika Mittal, and Sylvia Ratnasamy. On the Use of ML for Blackbox System Performance Prediction. In NSDI, pages 763--784. USENIX Association, 2021.

[29]

Fursin, Grigori. Reproducing 150 Research Papers and Testing Them in the RealWorld: Challenges and Solutions. https://learning.acm.org/binaries/content/assets/leaning-center/webinar-slides/2021/grigorifursin_techtalk_slides.pdf, 2021.

[30]

GAM Source Code. https://github.com/ooibc88/gam.

[31]

Graph500 Committee. Graph500 benchmark. http://graph500.org.

[32]

Alexander Grebhahn, Norbert Siegmund, and Sven Apel. Predicting Performance of Software Configurations: There is no Silver Bullet. CoRR, abs/1911.12643, 2019.

[33]

Mingzhe Hao, Levent Toksoz, Nanqinqin Li, Edward Edberg Halim, Henry Hoffmann, and Haryadi S. Gunawi. LinnOS: Predictability on Unpredictable Flash Storage with a Light Neural Network. In OSDI, pages 173--190. USENIX Association, 2020.

[34]

Rachael Harding, Dana Van Aken, Andrew Pavlo, and Michael Stonebraker. An Evaluation of Distributed Concurrency Control. Proc. VLDB Endow., 10(5):553--564, 2017.

Digital Library

[35]

Gernot Heiser. Systems Benchmarking Crimes. https://gernot-heiser.org/benchmarking-crimes.html.

[36]

HERD Source Code. https://github.com/efficient/HERD.

[37]

Henry Hoffmann. Jouleguard: energy guarantees for approximate applications. In SOSP, pages 198--214. ACM, 2015.

[38]

Lawrence Ibarria, Peter Lindstrom, Jarek Rossignac, and Andrzej Szymczak. Out-of-core Compression and Decompression of Large n-dimensional Scalar Fields. Comput. Graph. Forum, 22(3):343--348, 2003.

[39]

Raj Jain. The Art of Computer Systems Performance Analysis - Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley professional computing. Wiley, 1991.

[40]

Janus Source Code. https://github.com/NYU-NEWS/janus.

[41]

Anuj Kalia, Michael Kaminsky, and David G. Andersen. Using RDMA Efficiently for Key-value Services. In SIGCOMM, pages 295--306. ACM, 2014.

Digital Library

[42]

Christian Kaltenecker, Alexander Grebhahn, Norbert Siegmund, and Sven Apel. The Interplay of Sampling and Machine Learning for Software Performance Prediction. IEEE Softw., 37(4):58--66, 2020.

Digital Library

[43]

Sangmin Lee, Zhenhua Guo, Omer Sunercan, Jun Ying, Thawan Kooburat, Suryadeep Biswal, Jun Chen, Kun Huang, Yatpang Cheung, Yiding Zhou, Kaushik Veeraraghavan, Biren Damani, Pol Mauri Ruiz, Vikas Mehta, and Chunqiang Tang. Shard Manager: A Generic Shard Management Framework for Geo-distributed Applications. In SOSP, pages 553--569. ACM, 2021.

[44]

Chieh-Jan Mike Liang, Zilin Fang, Yuqing Xie, Fan Yang, Zhao Lucis Li, Li Lyna Zhang, Mao Yang, and Lidong Zhou. On Modular Learning of Distributed Systems for Predicting End-to-End Latency. In NSDI, pages 1081--1095. USENIX Association, 2023.

[45]

Hyeontaek Lim, Michael Kaminsky, and David G. Andersen. Cicada: Dependably Fast Multi-Core In-Memory Transactions. In SIGMOD Conference, pages 21--35. ACM, 2017.

[46]

Yi Lu, Xiangyao Yu, Lei Cao, and Samuel Madden. Aria: A Fast and Practical Deterministic OLTP Database. Proc. VLDB Endow., 13(11):2047--2060, 2020.

Digital Library

[47]

Yi Lu, Xiangyao Yu, and Samuel Madden. STAR: Scaling Transactions through Asymmetric Replication. Proc. VLDB Endow., 12(11):1316--1329, 2019.

Digital Library

[48]

Martin Maas, David G. Andersen, Michael Isard, Mohammad Mahdi Javanmard, Kathryn S. McKinley, and Colin Raffel. Learning-based Memory Allocation for C Server Workloads. In ASPLOS, pages 541--556. ACM, 2020.

Digital Library

[49]

Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. Learning scheduling algorithms for data processing clusters. In SIGCOMM, pages 270--288. ACM, 2019.

Digital Library

[50]

Aleksander Maricq, Dmitry Duplyakin, Ivo Jimenez, Carlos Maltzahn, Ryan Stutsman, Robert Ricci, and Ana Klimovic. Taming Performance Variability. In OSDI, pages 409--425. USENIX Association, 2018.

[51]

Peter Mattson, Christine Cheng, Gregory F. Diamos, Cody Coleman, Paulius Micikevicius, David A. Patterson, Hanlin Tang, Gu-YeonWei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debo Dutta, Udit Gupta, Kim M. Hazelwood, Andy Hock, Xinyuan Huang, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay Janapa Reddi, Taylor Robie, Tom St. John, Carole-Jean Wu, Lingjie Xu, Cliff Young, and Matei Zaharia. MLPerf Training Benchmark. In MLSys. mlsys.org, 2020.

[52]

Nikita Mishra, Connor Imes, John D. Lafferty, and Henry Hoffmann. CALOREE: Learning Control for Predictable Latency and Low Energy. In ASPLOS, pages 184--198. ACM, 2018.

[53]

Shuai Mu, Lamont Nelson, Wyatt Lloyd, and Jinyang Li. Consolidating Concurrency Control and Consensus for Commits under Conflicts. In OSDI, pages 517--532. USENIX Association, 2016.

Digital Library

[54]

Leann Myers and Maria J Sirois. Spearman Correlation Coefficients, Differences between. Encyclopedia of Statistical Sciences, 12, 2004.

[55]

Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, David Stafford, Tony Tung, and Venkateshwaran Venkataramani. Scaling Memcache at Facebook. In NSDI, pages 385--398. USENIX Association, 2013.

Digital Library

[56]

Van L Parsons. Stratified Sampling. Wiley StatsRef: Statistics Reference Online, pages 1--11, 2014.

[57]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011.

Digital Library

[58]

Petitet, A. and Whaley, R. C. and Dongarra, Jack and Cleary, A. HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers. https://www.netlib.org/benchmark/hpl/.

[59]

Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kanwar, David Lee, Jeffery Liao, Anton Lokhmotov, Francisco Massa, Peng Meng, Paulius Micikevicius, Colin Osborne, Gennady Pekhimenko, Arun Tejusve Raghunath Rajan, Dilip Sequeira, Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei, Ephrem Wu, Lingjie Xu, Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong, Peizhao Zhang, and Yuchen Zhou. MLPerf Inference Benchmark. In ISCA, pages 446--459. IEEE, 2020.

[60]

Kun Ren, Alexander Thomson, and Daniel J. Abadi. An Evaluation of theAdvantages and Disadvantages of Deterministic Database Systems. Proc. VLDB Endow., 7(10):821--832, 2014.

Digital Library

[61]

Lukas Rupprecht, James C. Davis, Constantine Arnold, Yaniv Gur, and Deepavali Bhagwat. Improving Reproducibility of Data Science Pipelines through Transparent Provenance Capture. Proc. VLDB Endow., 13(12):3354--3368, 2020.

Digital Library

[62]

Stan Salvador and Philip Chan. Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms. In ICTAI, pages 576--584. IEEE Computer Society, 2004.

Digital Library

[63]

Ville Satopaa, Jeannie R. Albrecht, David E. Irwin, and Barath Raghavan. Finding a "Kneedle" in a Haystack: Detecting Knee Points in System Behavior. In ICDCS Workshops, pages 166--171. IEEE Computer Society, 2011.

[64]

Henry Scheffe. The Analysis of Variance, volume 72. John Wiley & Sons, 1999.

[65]

Silo Source Code. https://github.com/stephentu/silo.

[66]

Standard Performance Evaluation Corporation. https://www.spec.org/.

[67]

Star Source Code. https://github.com/luyi0619/star.

[68]

Takayuki Tanabe, Takashi Hoshino, Hideyuki Kawashima, and Osamu Tatebe. An Analysis of Concurrency Control Protocols for In-Memory Databases with CCBench. CoRR, abs/2009.11558, 2020.

[69]

Dingwen Tao, Sheng Di, Zizhong Chen, and Franck Cappello. Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization. In IPDPS, pages 1129--1139. IEEE Computer Society, 2017.

[70]

Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. Calvin: Fast Distributed Transactions for Partitioned Database Systems. In SIGMOD Conference, pages 1--12. ACM, 2012.

[71]

Transaction Processing Performance Council. The TPC-C home page. http://www.tpc.org/tpcc/.

[72]

Transaction Processing Performance Council. The TPC home page. http://www.tpc.org.

[73]

Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden. Speedy Transactions in Multicore In-Memory Databases. In SOSP, pages 18--32. ACM, 2013.

Digital Library

[74]

Alexandru Uta, Alexandru Custura, Dmitry Duplyakin, Ivo Jimenez, Jan S. Rellermeyer, Carlos Maltzahn, Robert Ricci, and Alexandru Iosup. Is Big Data Performance Reproducible in Modern Cloud Networks? In NSDI, pages 513--527. USENIX Association, 2020.

[75]

Yang Wang, Miao Yu, Yujie Hui, Fang Zhou, Yuyang Huang, Rui Zhu, Xueyuan Ren, Tianxi Li, and Xiaoyi Lu. A Study of Database Performance Sensitivity to Experiment Settings. Proc. VLDB Endow., 15(7):1439--1452, 2022.

Digital Library

[76]

Md. Wasi-ur-Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, Dipti Shankar, and Dhabaleswar K. Panda. MR-Advisor: A comprehensive tuning, profiling, and prediction tool for MapReduce execution frameworks on HPC clusters. J. Parallel Distributed Comput., 120:237--250, 2018.

Digital Library

[77]

Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. Fast In-Memory Transaction Processing Using RDMA and HTM. In SOSP, pages 87--104. ACM, 2015.

Digital Library

[78]

YingjunWu, Joy Arulraj, Jiexi Lin, Ran Xian, and Andrew Pavlo. An Empirical Evaluation of In-Memory Multi-Version Concurrency Control. Proc. VLDB Endow., 10(7):781--792, 2017.

Digital Library

[79]

Juncheng Yang, Yao Yue, and K. V. Rashmi. A Large-scale Analysis of Hundreds of In-Memory Cache Clusters at Twitter. In OSDI, pages 191--208. USENIX Association, 2020.

[80]

Zhibin Yu, Zhendong Bei, and Xuehai Qian. Datasize-Aware High Dimensional Configurations Auto-Tuning of In-Memory Cluster Computing. In ASPLOS, pages 564--577. ACM, 2018.

Digital Library

[81]

Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan R. K. Ports. Building Consistent Transactions with Inconsistent Replication. In SOSP, pages 263--278. ACM, 2015.

Digital Library

Index Terms

On the Feasibility and Benefits of Extensive Evaluation
1. Information systems
  1. Data management systems
    1. Database administration
      1. Database performance evaluation

Recommendations

Strategy evaluation in extensive games with importance sampling
ICML '08: Proceedings of the 25th international conference on Machine learning

Typically agent evaluation is done through Monte Carlo estimation. However, stochastic agent decisions and stochastic outcomes can make this approach inefficient, requiring many samples for an accurate estimate. We present a new technique that can be ...
Extensive semi-quantitative regression

In this paper, we propose and solve a new machine learning problem called the extensive semi-quantitative regression, where the information about some target values is incomplete; we only know their lower bounds and/or upper bounds instead of their ...
An extensive evaluation of seven machine learning methods for rainfall prediction in weather derivatives

An extensive evaluation of machine learning methods.Predictive performance not affected between Europe and the USA.Some linear relationships exists between error and climate.Machine learning methods outperform state-of-the-art. Regression problems ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 2, Issue 4

SIGMOD

September 2024

458 pages

EISSN:2836-6573

DOI:10.1145/3698442

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 September 2024

Published in PACMMOD Volume 2, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Funding Sources

NSF (National Science Foundation)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
65
Total Downloads

Downloads (Last 12 months)65
Downloads (Last 6 weeks)23

Reflects downloads up to 03 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents