skip to main content
research-article

On the Feasibility and Benefits of Extensive Evaluation

Published: 30 September 2024 Publication History

Abstract

Benchmark and system parameters often have a significant impact on performance evaluation, which raises a long-lasting question about which settings we should use.
This paper studies the feasibility and benefits of extensive evaluation. A full extensive evaluation, which tests all possible settings, is usually too expensive. This work investigates whether it is possible to sample a subset of the settings and, upon them, generate observations that match those from a full extensive evaluation. Towards this goal, we have explored the incremental sampling approach, which starts by measuring a small subset of random settings, builds a prediction model on these samples using the popular ANOVA approach, adds more samples if the model is not accurate enough, and terminates otherwise.
To summarize our findings: 1) Enhancing a research prototype to support extensive evaluation mostly involves changing hard-coded configurations, which does not take much effort. 2) Some systems are highly predictable, which means that they can achieve accurate predictions with a low sampling rate, but some systems are less predictable. 3) We have not found a method that can consistently outperform random sampling + ANOVA. Based on these findings, we provide recommendations to improve artifact predictability and strategies for selecting parameter values during evaluation.

References

[1]
ACM SIGMOD Reproducibility. https://reproducibility.sigmod.org/.
[2]
Calvin Source Code. https://github.com/yaledb/calvin.
[3]
EuroSys Call for Artifacts. https://sysartifacts.github.io/eurosys2021/.
[4]
OSDI Call for Artifacts. https://www.usenix.org/conference/osdi21/call-for-artifacts.
[5]
PVLDB Reproducibility. http://vldb.org/pvldb/reproducibility/.
[6]
SOSP Call for Artifacts. https://sysartifacts.github.io/sosp2021/call.html.
[7]
Tapir Source Code. https://github.com/UWSysLab/tapir.
[8]
Mohammad Alomari, Michael J. Cahill, Alan D. Fekete, and Uwe Röhm. The Cost of Serializability on Platforms That Use Snapshot Isolation. In ICDE, pages 576--585. IEEE Computer Society, 2008.
[9]
Aria Source Code. https://github.com/luyi0619/aria.
[10]
Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. Workload Analysis of a Large-scale Key-value Store. In SIGMETRICS, pages 53--64. ACM, 2012.
[11]
Tiemo Bang, Norman May, Ilia Petrov, and Carsten Binnig. The Tale of 1000 Cores: An Evaluation of Concurrency Control on Real(ly) Large Multi-socket Hardware. In DaMoN, pages 3:1--3:9. ACM, 2020.
[12]
Tiemo Bang, Norman May, Ilia Petrov, and Carsten Binnig. The Full Story of 1000 Cores: An Examination of Concurrency Control on Real(ly) Large Multi-socket Hardware. The VLDB Journal, 31(6):1185--1213, apr 2022.
[13]
Zhendong Bei, Zhibin Yu, Huiling Zhang, Wen Xiong, Cheng-Zhong Xu, Lieven Eeckhout, and Shengzhong Feng. RFHOC: A Random-Forest Approach to Auto-Tuning Hadoop's Configuration. IEEE Trans. Parallel Distributed Syst., 27(5):1470--1483, 2016.
[14]
Benjamin Berg, Daniel S. Berger, Sara McAllister, Isaac Grosof, Sathya Gunasekar, Jimmy Lu, Michael Uhlar, Jim Carrig, Nathan Beckmann, Mor Harchol-Balter, and Gregory R. Ganger. The CacheLib Caching Engine: Design and Experiences at Scale. In OSDI, pages 753--768. USENIX Association, 2020.
[15]
Qingchao Cai, Wentian Guo, Hao Zhang, Divyakant Agrawal, Gang Chen, Beng Chin Ooi, Kian-Lee Tan, Yong Meng Teo, and Sheng Wang. Efficient Distributed Memory Management with RDMA and Caching. Proc. VLDB Endow., 11(11):1604--1617, 2018.
[16]
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. Bigtable: A Distributed Storage System for Structured Data. In OSDI, pages 205--218. USENIX Association, 2006.
[17]
Cicada Source Code. https://github.com/efficient/cicada-exp-sigmod2017.
[18]
Cloudlab. https://www.cloudlab.us/.
[19]
Cody Coleman, Deepak Narayanan, Daniel Kang, Tian Zhao, Jian Zhang, Luigi Nardi, Peter Bailis, Kunle Olukotun, Chris Re, and Matei Zaharia. DAWNBench: An End-to-End Deep Learning Benchmark and Competition. https://cs.stanford.edu/~deepakn/assets/papers/dawnbench-sosp17.pdf, 2017.
[20]
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking Cloud Serving Systems with YCSB. In SoCC, pages 143--154. ACM, 2010.
[21]
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon's Highly Available Key-value Store. In SOSP, pages 205--220. ACM, 2007.
[22]
Christina Delimitrou and Christos Kozyrakis. Paragon: QoS-aware Scheduling for Heterogeneous Datacenters. In ASPLOS, pages 77--88. ACM, 2013.
[23]
Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-efficient and QoS-aware Cluster Management. In ASPLOS, pages 127--144. ACM, 2014.
[24]
Mo Dong, Tong Meng, Doron Zarchy, Engin Arslan, Yossi Gilad, Brighten Godfrey, and Michael Schapira. PCC Vivace: Online-Learning Congestion Control. In NSDI, pages 343--356. USENIX Association, 2018.
[25]
Jack J. Dongarra, Michael A. Heroux, and Piotr Luszczek. High-performance Conjugate-gradient Benchmark: A New Metric for Ranking High-performance Computing Systems. Int. J. High Perform. Comput. Appl., 30(1):3--10, 2016.
[26]
DrTM Source Code. https://github.com/SJTU-IPADS/drtm.
[27]
fio - Flexible I/O Tester. https://github.com/axboe/fio.
[28]
Silvery Fu, Saurabh Gupta, Radhika Mittal, and Sylvia Ratnasamy. On the Use of ML for Blackbox System Performance Prediction. In NSDI, pages 763--784. USENIX Association, 2021.
[29]
Fursin, Grigori. Reproducing 150 Research Papers and Testing Them in the RealWorld: Challenges and Solutions. https://learning.acm.org/binaries/content/assets/leaning-center/webinar-slides/2021/grigorifursin_techtalk_slides.pdf, 2021.
[30]
GAM Source Code. https://github.com/ooibc88/gam.
[31]
Graph500 Committee. Graph500 benchmark. http://graph500.org.
[32]
Alexander Grebhahn, Norbert Siegmund, and Sven Apel. Predicting Performance of Software Configurations: There is no Silver Bullet. CoRR, abs/1911.12643, 2019.
[33]
Mingzhe Hao, Levent Toksoz, Nanqinqin Li, Edward Edberg Halim, Henry Hoffmann, and Haryadi S. Gunawi. LinnOS: Predictability on Unpredictable Flash Storage with a Light Neural Network. In OSDI, pages 173--190. USENIX Association, 2020.
[34]
Rachael Harding, Dana Van Aken, Andrew Pavlo, and Michael Stonebraker. An Evaluation of Distributed Concurrency Control. Proc. VLDB Endow., 10(5):553--564, 2017.
[35]
Gernot Heiser. Systems Benchmarking Crimes. https://gernot-heiser.org/benchmarking-crimes.html.
[36]
HERD Source Code. https://github.com/efficient/HERD.
[37]
Henry Hoffmann. Jouleguard: energy guarantees for approximate applications. In SOSP, pages 198--214. ACM, 2015.
[38]
Lawrence Ibarria, Peter Lindstrom, Jarek Rossignac, and Andrzej Szymczak. Out-of-core Compression and Decompression of Large n-dimensional Scalar Fields. Comput. Graph. Forum, 22(3):343--348, 2003.
[39]
Raj Jain. The Art of Computer Systems Performance Analysis - Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley professional computing. Wiley, 1991.
[40]
Janus Source Code. https://github.com/NYU-NEWS/janus.
[41]
Anuj Kalia, Michael Kaminsky, and David G. Andersen. Using RDMA Efficiently for Key-value Services. In SIGCOMM, pages 295--306. ACM, 2014.
[42]
Christian Kaltenecker, Alexander Grebhahn, Norbert Siegmund, and Sven Apel. The Interplay of Sampling and Machine Learning for Software Performance Prediction. IEEE Softw., 37(4):58--66, 2020.
[43]
Sangmin Lee, Zhenhua Guo, Omer Sunercan, Jun Ying, Thawan Kooburat, Suryadeep Biswal, Jun Chen, Kun Huang, Yatpang Cheung, Yiding Zhou, Kaushik Veeraraghavan, Biren Damani, Pol Mauri Ruiz, Vikas Mehta, and Chunqiang Tang. Shard Manager: A Generic Shard Management Framework for Geo-distributed Applications. In SOSP, pages 553--569. ACM, 2021.
[44]
Chieh-Jan Mike Liang, Zilin Fang, Yuqing Xie, Fan Yang, Zhao Lucis Li, Li Lyna Zhang, Mao Yang, and Lidong Zhou. On Modular Learning of Distributed Systems for Predicting End-to-End Latency. In NSDI, pages 1081--1095. USENIX Association, 2023.
[45]
Hyeontaek Lim, Michael Kaminsky, and David G. Andersen. Cicada: Dependably Fast Multi-Core In-Memory Transactions. In SIGMOD Conference, pages 21--35. ACM, 2017.
[46]
Yi Lu, Xiangyao Yu, Lei Cao, and Samuel Madden. Aria: A Fast and Practical Deterministic OLTP Database. Proc. VLDB Endow., 13(11):2047--2060, 2020.
[47]
Yi Lu, Xiangyao Yu, and Samuel Madden. STAR: Scaling Transactions through Asymmetric Replication. Proc. VLDB Endow., 12(11):1316--1329, 2019.
[48]
Martin Maas, David G. Andersen, Michael Isard, Mohammad Mahdi Javanmard, Kathryn S. McKinley, and Colin Raffel. Learning-based Memory Allocation for C Server Workloads. In ASPLOS, pages 541--556. ACM, 2020.
[49]
Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. Learning scheduling algorithms for data processing clusters. In SIGCOMM, pages 270--288. ACM, 2019.
[50]
Aleksander Maricq, Dmitry Duplyakin, Ivo Jimenez, Carlos Maltzahn, Ryan Stutsman, Robert Ricci, and Ana Klimovic. Taming Performance Variability. In OSDI, pages 409--425. USENIX Association, 2018.
[51]
Peter Mattson, Christine Cheng, Gregory F. Diamos, Cody Coleman, Paulius Micikevicius, David A. Patterson, Hanlin Tang, Gu-YeonWei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debo Dutta, Udit Gupta, Kim M. Hazelwood, Andy Hock, Xinyuan Huang, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay Janapa Reddi, Taylor Robie, Tom St. John, Carole-Jean Wu, Lingjie Xu, Cliff Young, and Matei Zaharia. MLPerf Training Benchmark. In MLSys. mlsys.org, 2020.
[52]
Nikita Mishra, Connor Imes, John D. Lafferty, and Henry Hoffmann. CALOREE: Learning Control for Predictable Latency and Low Energy. In ASPLOS, pages 184--198. ACM, 2018.
[53]
Shuai Mu, Lamont Nelson, Wyatt Lloyd, and Jinyang Li. Consolidating Concurrency Control and Consensus for Commits under Conflicts. In OSDI, pages 517--532. USENIX Association, 2016.
[54]
Leann Myers and Maria J Sirois. Spearman Correlation Coefficients, Differences between. Encyclopedia of Statistical Sciences, 12, 2004.
[55]
Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, David Stafford, Tony Tung, and Venkateshwaran Venkataramani. Scaling Memcache at Facebook. In NSDI, pages 385--398. USENIX Association, 2013.
[56]
Van L Parsons. Stratified Sampling. Wiley StatsRef: Statistics Reference Online, pages 1--11, 2014.
[57]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011.
[58]
Petitet, A. and Whaley, R. C. and Dongarra, Jack and Cleary, A. HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers. https://www.netlib.org/benchmark/hpl/.
[59]
Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kanwar, David Lee, Jeffery Liao, Anton Lokhmotov, Francisco Massa, Peng Meng, Paulius Micikevicius, Colin Osborne, Gennady Pekhimenko, Arun Tejusve Raghunath Rajan, Dilip Sequeira, Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei, Ephrem Wu, Lingjie Xu, Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong, Peizhao Zhang, and Yuchen Zhou. MLPerf Inference Benchmark. In ISCA, pages 446--459. IEEE, 2020.
[60]
Kun Ren, Alexander Thomson, and Daniel J. Abadi. An Evaluation of theAdvantages and Disadvantages of Deterministic Database Systems. Proc. VLDB Endow., 7(10):821--832, 2014.
[61]
Lukas Rupprecht, James C. Davis, Constantine Arnold, Yaniv Gur, and Deepavali Bhagwat. Improving Reproducibility of Data Science Pipelines through Transparent Provenance Capture. Proc. VLDB Endow., 13(12):3354--3368, 2020.
[62]
Stan Salvador and Philip Chan. Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms. In ICTAI, pages 576--584. IEEE Computer Society, 2004.
[63]
Ville Satopaa, Jeannie R. Albrecht, David E. Irwin, and Barath Raghavan. Finding a "Kneedle" in a Haystack: Detecting Knee Points in System Behavior. In ICDCS Workshops, pages 166--171. IEEE Computer Society, 2011.
[64]
Henry Scheffe. The Analysis of Variance, volume 72. John Wiley & Sons, 1999.
[65]
Silo Source Code. https://github.com/stephentu/silo.
[66]
Standard Performance Evaluation Corporation. https://www.spec.org/.
[67]
Star Source Code. https://github.com/luyi0619/star.
[68]
Takayuki Tanabe, Takashi Hoshino, Hideyuki Kawashima, and Osamu Tatebe. An Analysis of Concurrency Control Protocols for In-Memory Databases with CCBench. CoRR, abs/2009.11558, 2020.
[69]
Dingwen Tao, Sheng Di, Zizhong Chen, and Franck Cappello. Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization. In IPDPS, pages 1129--1139. IEEE Computer Society, 2017.
[70]
Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. Calvin: Fast Distributed Transactions for Partitioned Database Systems. In SIGMOD Conference, pages 1--12. ACM, 2012.
[71]
Transaction Processing Performance Council. The TPC-C home page. http://www.tpc.org/tpcc/.
[72]
Transaction Processing Performance Council. The TPC home page. http://www.tpc.org.
[73]
Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden. Speedy Transactions in Multicore In-Memory Databases. In SOSP, pages 18--32. ACM, 2013.
[74]
Alexandru Uta, Alexandru Custura, Dmitry Duplyakin, Ivo Jimenez, Jan S. Rellermeyer, Carlos Maltzahn, Robert Ricci, and Alexandru Iosup. Is Big Data Performance Reproducible in Modern Cloud Networks? In NSDI, pages 513--527. USENIX Association, 2020.
[75]
Yang Wang, Miao Yu, Yujie Hui, Fang Zhou, Yuyang Huang, Rui Zhu, Xueyuan Ren, Tianxi Li, and Xiaoyi Lu. A Study of Database Performance Sensitivity to Experiment Settings. Proc. VLDB Endow., 15(7):1439--1452, 2022.
[76]
Md. Wasi-ur-Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, Dipti Shankar, and Dhabaleswar K. Panda. MR-Advisor: A comprehensive tuning, profiling, and prediction tool for MapReduce execution frameworks on HPC clusters. J. Parallel Distributed Comput., 120:237--250, 2018.
[77]
Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. Fast In-Memory Transaction Processing Using RDMA and HTM. In SOSP, pages 87--104. ACM, 2015.
[78]
YingjunWu, Joy Arulraj, Jiexi Lin, Ran Xian, and Andrew Pavlo. An Empirical Evaluation of In-Memory Multi-Version Concurrency Control. Proc. VLDB Endow., 10(7):781--792, 2017.
[79]
Juncheng Yang, Yao Yue, and K. V. Rashmi. A Large-scale Analysis of Hundreds of In-Memory Cache Clusters at Twitter. In OSDI, pages 191--208. USENIX Association, 2020.
[80]
Zhibin Yu, Zhendong Bei, and Xuehai Qian. Datasize-Aware High Dimensional Configurations Auto-Tuning of In-Memory Cluster Computing. In ASPLOS, pages 564--577. ACM, 2018.
[81]
Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan R. K. Ports. Building Consistent Transactions with Inconsistent Replication. In SOSP, pages 263--278. ACM, 2015.

Index Terms

  1. On the Feasibility and Benefits of Extensive Evaluation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Management of Data
    Proceedings of the ACM on Management of Data  Volume 2, Issue 4
    SIGMOD
    September 2024
    458 pages
    EISSN:2836-6573
    DOI:10.1145/3698442
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 September 2024
    Published in PACMMOD Volume 2, Issue 4

    Permissions

    Request permissions for this article.

    Author Tags

    1. benchmarking
    2. database performance evaluation

    Qualifiers

    • Research-article

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 65
      Total Downloads
    • Downloads (Last 12 months)65
    • Downloads (Last 6 weeks)23
    Reflects downloads up to 03 Jan 2025

    Other Metrics

    Citations

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media