skip to main content
10.1145/3458817.3476195acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Open access

Resilient error-bounded lossy compressor for data transfer

Published: 13 November 2021 Publication History

Abstract

Today's exa-scale scientific applications or advanced instruments are producing vast volumes of data, which need to be shared/transferred through the network/devices with relatively low bandwidth (e.g., data sharing on WAN or transferring from edge devices to supercomputers). Lossy compression is one of the candidate strategies to address the big data issue. However, little work was done to make it resilient against silent errors, which may happen during the stage of compression or data transferring. In this paper, we propose a resilient error-bounded lossy compressor based on the SZ compression framework. Specifically, we design a new independent-block-wise model that decomposes the entire dataset into many independent sub-blocks to compress. Then, we design and implement a series of error detection/correction strategies elaboratively for each stage of SZ. Our method is arguably the first algorithm-based fault tolerance (ABFT) solution for lossy compression. Our proposed solution incurs negligible execution overhead in the fault-free situation. Upon soft errors happening, it ensures decompressed data strictly bounded within user's requirement with a very limited degradation of compression ratio and low overhead.

Supplementary Material

MP4 File (Resilient Error-Bounded Lossy Compressor for Data Transfer.mp4.mp4)
Presentation video

References

[1]
[n.d.]. ANL Bebop. Retrieved January 23, 2020 from https://www.lcrc.anl.gov/systems/resources/bebop/
[2]
[n.d.]. ANL Theta supercomputer. https://www.alcf.anl.gov/support-center/theta
[3]
[n.d.]. Berkeley Lab Checkpoint/Restart (BLCR) for LINUX. Retrieved January 23, 2020 from https://crd.lbl.gov/departments/computer-science/CLaSS/research/BLCR
[4]
[n.d.]. New Horizons: The First Mission to the Pluto System and the Kuiper Belt. Retrieved January 23, 2020 from nasa.gov/newhorizons
[5]
[n.d.]. ORNL Summit supercomputer. https://www.olcf.ornl.gov/summit/
[6]
[n.d.]. PDS: The Planetary Data System. Retrieved January 23, 2020 from https://pds.jpl.nasa.gov
[7]
[n.d.]. Zstandard. Retrieved January 23, 2020 from https://github.com/facebook/zstd/releases
[8]
Mark Ainsworth, Ozan Tugluk, Ben Whitney, and Scott Klasky. 2018. Multilevel Techniques for Compression and Reduction of Scientific Data---the Univariate Case. 19, 5--6 (2018), 65--76.
[9]
A Alekseev, A Kiryanov, A Klimentov, T Korchuganova, V Mitsyn, D Oleynik, A Smirnov, S Smirnov, and A Zarochentsev. 2020. Scientific Data Lake for High Luminosity LHC project and other data-intensive particle and astro-particle physics experiments. Journal of Physics: Conference Series 1690 (dec 2020), 012166.
[10]
Cyrille Artho, Kuniyasu Suzaki, Masami Hagiya, Watcharin Leungwattanakit, Richard Potter, Eric Platon, Yoshinori Tanabe, Franz Weitl, and Mitsuharu Yamamoto. 2015. Using checkpointing and virtualization for fault injection. International Journal of Networking and Computing 5, 2 (2015), 347--372.
[11]
Allison H. Baker, Haiying Xu, John M. Dennis, Michael N. Levy, Doug Nychka, Sheri A. Mickelson, Jim Edwards, Mariana Vertenstein, and Al Wegener. 2014. A methodology for evaluating the impact of data compression on climate simulation data. In The 23rd International Symposium on High-Performance Parallel and Distributed Computing, HPDC'14, Vancouver, BC, Canada - June 23 - 27, 2014, Beth Plale, Matei Ripeanu, Franck Cappello, and Dongyan Xu (Eds.). ACM, 203--214.
[12]
Allison H Baker, Haiying Xu, John M Dennis, Michael N Levy, Doug Nychka, Sheri A Mickelson, Jim Edwards, Mariana Vertenstein, and Al Wegener. 2014. A methodology for evaluating the impact of data compression on climate simulation data. In Proceedings of the 23rd international symposium on High-performance parallel and distributed computing. 203--214.
[13]
R. Ballester-Ripoll, P. Lindstrom, and R. Pajarola. 2020. TTHRESH: Tensor Compression for Multidimensional Visual Data. IEEE Transactions on Visualization and Computer Graphics 26, 9 (2020), 2891--2903.
[14]
Mikaël Capelle, Marie-José Huguet, Nicolas Jozefowiez, and Xavier Olive. 2019. Optimizing ground station networks for free space optical communications: maximizing the data transfer. Networks 73, 2 (2019), 234--253.
[15]
Franck Cappello, Sheng Di, and Ali Murat Gok. 2020. Fulfilling the Promises of Lossy Compression for Scientific Applications. In Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI, Jeffrey Nichols, Becky Verastegui, Arthur 'Barney' Maccabe, Oscar Hernandez, Suzanne Parete-Koon, and Theresa Ahearn (Eds.). Springer International Publishing, Cham, 99--116.
[16]
Franck Cappello, Sheng Di, Sihuan Li, Xin Liang, Ali Murat Gok, Dingwen Tao, Chun Hong Yoon, Xin-Chuan Wu, Yuri Alexeev, and Frederic T Chong. 2019. Use cases of lossy compression for floating-point data in scientific data sets. The International Journal of High Performance Computing Applications 33, 6 (2019), 1201--1220. arXiv:https://doi.org/10.1177/1094342019853336
[17]
Jieyang Chen, Xin Liang, and Zizhong Chen. 2016. Online algorithm-based fault tolerance for cholesky decomposition on heterogeneous systems with gpus. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 993--1002.
[18]
Zizhong Chen. 2008. Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments. In 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 1--8.
[19]
Zhengzhang Chen, Seung Woo Son, William Hendrix, Ankit Agrawal, Wei-keng Liao, and Alok Choudhary. 2014. NUMARCK: machine learning algorithm for resiliency and checkpointing. In SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 733--744.
[20]
James Demmel and Hong Diep Nguyen. 2013. Fast reproducible floating-point summation. In 2013 IEEE 21st Symposium on Computer Arithmetic. IEEE, 163--172.
[21]
Sheng Di and Franck Cappello. 2016. Fast error-bounded lossy HPC data compression with SZ. In 2016 ieee international parallel and distributed processing symposium (ipdps). IEEE, 730--739.
[22]
Thomas E Fornek. 2017. Advanced photon source upgrade project preliminary design report. Technical Report. Argonne National Laboratory (ANL)(United States). Funding organisation ....
[23]
Ian Foster and Carl Kesselman. 1997. Globus: A metacomputing infrastructure toolkit. The International Journal of Supercomputer Applications and High Performance Computing 11, 2 (1997), 115--128.
[24]
Ali Murat Gok et al. 2018. PaSTRI: A novel data compression algorithm for two-electron integrals in quantum chemistry. In IEEE International Conference on Cluster Computing (CLUSTER). 1--11.
[25]
Salman Habib, Vitali Morozov, Nicholas Frontiere, Hal Finkel, Adrian Pope, and Katrin Heitmann. 2013. HACC: extreme scaling and performance across diverse architectures. In SC'13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1--10.
[26]
Peter H Hochschild, Paul Turner, Jeffrey C Mogul, Rama Govindaraju, Parthasarathy Ranganathan, David E Culler, and Amin Vahdat. 2021. Cores that don't count. In Proceedings of the Workshop on Hot Topics in Operating Systems. 9--16.
[27]
Kuang-Hua Huang and Jacob A Abraham. 1984. Algorithm-based fault tolerance for matrix operations. IEEE transactions on computers 100, 6 (1984), 518--528.
[28]
Hurricane ISABEL simulation data. 2016. https://www.earthsystemgrid.org/dataset/isabeldata.html. Online.
[29]
Lawrence Ibarria, Peter Lindstrom, Jarek Rossignac, and Andrzej Szymczak. 2003. Out-of-core compression and decompression of large n-dimensional scalar fields. In Computer Graphics Forum, Vol. 22. Wiley Online Library, 343--348.
[30]
Adam M Jacobs. 2013. Reconfigurable fault tolerance for space systems. University of Florida.
[31]
Sian Jin, Jesus Pulido, Pascal Grosset, Jiannan Tian, Dingwen Tao, and James Ahrens. 2021. Adaptive Configuration of In Situ Lossy Compression for Cosmology Simulations via Fine-Grained Rate-Quality Modeling. In Proceedings of the 30th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC2021).
[32]
Rajkumar Kettimuthu, Zhengchun Liu, David Wheeler, Ian Foster, Katrin Heitmann, and Franck Cappello. 2018. Transferring a petabyte in a day. Future Generation Computer Systems 88 (2018), 191--198.
[33]
Sriram Lakshminarasimhan, Neil Shah, Stephane Ethier, Scott Klasky, Rob Latham, Rob Ross, and Nagiza F Samatova. 2011. Compressing the incompressible with ISABELA: In-situ reduction of spatio-temporal data. In European Conference on Parallel Processing. Springer, 366--379.
[34]
Xin Liang, Jieyang Chen, Dingwen Tao, Sihuan Li, Panruo Wu, Hongbo Li, Kaiming Ouyang, Yuanlai Liu, Fengguang Song, and Zizhong Chen. 2017. Correcting soft errors online in fast fourier transform. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--12.
[35]
Xin Liang, Sheng Di, Dingwen Tao, Sihuan Li, Shaomeng Li, Hanqi Guo, Zizhong Chen, and Franck Cappello. 2018. Error-controlled lossy compression optimized for high compression ratios of scientific datasets. In 2018 IEEE International Conference on Big Data (Big Data). IEEE, 438--447.
[36]
Xin Liang, Hanqi Guo, Sheng Di, Franck Cappello, Mukund Raj, Chunhui Liu, Kenji Ono, Zizhong Chen, and Tom Peterka. 2020. Toward Feature-Preserving 2D and 3D Vector Field Compression. In 2020 IEEE Pacific Visualization Symposium (PacificVis). 81--90.
[37]
S Lim. 2009. A fault tolerant parallel computing architecture for remote sensing satellites. (2009).
[38]
Peter Lindstrom. 2014. Fixed-rate compressed floating-point arrays. IEEE transactions on visualization and computer graphics 20, 12 (2014), 2674--2683.
[39]
Peter Lindstrom and Martin Isenburg. 2006. Fast and efficient compression of floating-point data. IEEE transactions on visualization and computer graphics 12, 5 (2006), 1245--1250.
[40]
Y. Liu, Z. Liu, R. Kettimuthu, N. Rao, Z. Chen, and I. Foster. 2019. Data Transfer between Scientific Facilities - Bottleneck Analysis, Insights and Optimizations. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). 122--131.
[41]
Tao Lu, Qing Liu, Xubin He, Huizhang Luo, Eric Suchyta, Jong Choi, Norbert Podhorszki, Scott Klasky, Mathew Wolf, Tong Liu, et al. 2018. Understanding and modeling lossy compression schemes on HPC scientific data. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 348--357.
[42]
Gabriel Marcus, Zhirong Huang, Yuantao Ding, Tor Raubenheimer, Lanfa Wang, Marco Venturini, Paul Emma, and Ji Qiang. 2015. High fidelity start-to-end numerical particle simulations and performance studies for LCLS-II. In 37th International Free Electron Laser Conference. TUP007.
[43]
Nor Rizuan Mat Noor and Tanya Vladimirova. 2013. Parallelised fault-tolerant Integer KLT implementation for lossless hyperspectral image compression on board satellites. In 2013 NASA/ESA Conference on Adaptive Hardware and Systems (AHS-2013). IEEE, 115--122.
[44]
Nyx. 2013. https://ccse.lbl.gov/Research/NYX/. Online.
[45]
Andrew Poppick, Joseph Nardi, Noah Feldman, Allison H. Baker, Alexander Pinard, and Dorit M. Hammerling. 2020. A statistical analysis of lossily compressed climate model data. Computers & Geosciences 145 (2020), 104599.
[46]
Abhishek Rhisheekesan, Reiley Jeyapaul, and Aviral Shrivastava. 2019. Control Flow Checking or Not? (For Soft Errors). ACM Trans. Embed. Comput. Syst. 18, 1, Article 11 (Feb. 2019), 25 pages.
[47]
Naoto Sasaki, Kento Sato, Toshio Endo, and Satoshi Matsuoka. 2015. Exploration of lossy compression for application-level checkpoint/restart. In 2015 IEEE International Parallel and Distributed Processing Symposium. IEEE, 914--922.
[48]
Dingwen Tao, Sheng Di, Zizhong Chen, and Franck Cappello. 2017. Significantly improving lossy compression for scientific data sets based on multidimensional prediction and error-controlled quantization. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 1129--1139.
[49]
Dingwen Tao, Shuaiwen Leon Song, Sriram Krishnamoorthy, Panruo Wu, Xin Liang, Eddy Z Zhang, Darren Kerbyson, and Zizhong Chen. 2016. New-sum: A novel online abft scheme for general iterative methods. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing. 43--55.
[50]
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600--612.
[51]
SCALE-LETKF weather model. [n.d.]. https://github.com/gylien/scale-letkf. Online.
[52]
Brent Welch. 2005. Posix io extensions for hpc. In Proceedings of the 4th USENIX Conference on File and Storage Technologies (FAST).
[53]
Panruo Wu, Nathan DeBardeleben, Qiang Guan, Sean Blanchard, Jieyang Chen, Dingwen Tao, Xin Liang, Kaiming Ouyang, and Zizhong Chen. 2017. Silent data corruption resilient two-sided matrix factorizations. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 415--427.
[54]
Panruo Wu, Chong Ding, Longxiang Chen, Teresa Davies, Christer Karlsson, and Zizhong Chen. 2013. On-line soft error correction in matrix-matrix multiplication. Journal of Computational Science 4, 6 (2013), 465--472.
[55]
X. Zou, T. Lu, W. Xia, X. Wang, W. Zhang, S. Di, D. Tao, and F. Cappello. 2019. Accelerating Relative-error Bounded Lossy Compression for HPC datasets with Precomputation-Based Mechanisms. In 2019 35th Symposium on Mass Storage Systems and Technologies (MSST). 65--78.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2021
1493 pages
ISBN:9781450384421
DOI:10.1145/3458817
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 November 2021

Check for updates

Author Tags

  1. algorithm based fault tolerance
  2. data transfer
  3. lossy compression

Qualifiers

  • Research-article

Funding Sources

Conference

SC '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)136
  • Downloads (Last 6 weeks)22
Reflects downloads up to 15 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media