article

XORing elephants: novel erasure codes for big data

Authors:

Maheswaran Sathiamoorthy,

Megasthenis Asteris,

Dimitris Papailiopoulos,

Alexandros G. Dimakis,

Ramkumar Vadali,

Dhruba BorthakurAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 6, Issue 5

Pages 325 - 336

https://doi.org/10.14778/2535573.2488339

Published: 01 March 2013 Publication History

Abstract

Distributed storage systems for large clusters typically use replication to provide reliability. Recently, erasure codes have been used to reduce the large storage overhead of three-replicated systems. Reed-Solomon codes are the standard design choice and their high repair cost is often considered an unavoidable price to pay for high storage efficiency and high reliability.

This paper shows how to overcome this limitation. We present a novel family of erasure codes that are efficiently repairable and offer higher reliability compared to Reed-Solomon codes. We show analytically that our codes are optimal on a recently identified tradeoff between locality and minimum distance.

We implement our new codes in Hadoop HDFS and compare to a currently deployed HDFS module that uses Reed-Solomon codes. Our modified HDFS implementation shows a reduction of approximately 2× on the repair disk I/O and repair network traffic. The disadvantage of the new coding scheme is that it requires 14% more storage compared to Reed-Solomon codes, an overhead shown to be information theoretically optimal to obtain locality. Because the new codes repair failures faster, this provides higher reliability, which is orders of magnitude higher compared to replication.

References

[1]

Amazon EC2. http://aws.amazon.com/ec2/.

[2]

HDFS-RAID wiki. http://wiki.apache.org/hadoop/HDFS-RAID.

[3]

V. Cadambe, S. Jafar, H. Maleki, K. Ramchandran, and C. Suh. Asymptotic interference alignment for optimal repair of mds codes in distributed storage. Submitted to IEEE Transactions on Information Theory, Sep. 2011 (consolidated paper of arXiv:1004.4299 and arXiv:1004.4663).

[4]

B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci, et al. Windows azure storage: A highly available cloud storage service with strong consistency. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pages 143-157, 2011.

[5]

M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica. Managing data transfers in computer clusters with orchestra. In SIGCOMM-Computer Communication Review, pages 98-109, 2011. with orchestra. In SIGCOMM-Computer Communication Review, pages 98-109, 2011.

[6]

A. Dimakis, P. Godfrey, Y. Wu, M. Wainwright, and K. Ramchandran. Network coding for distributed storage systems. IEEE Transactions on Information Theory, pages 4539-4551, 2010.

[7]

A. Dimakis, K. Ramchandran, Y. Wu, and C. Suh. A survey on network codes for distributed storage. Proceedings of the IEEE, 99(3):476-489, 2011.

[8]

B. Fan, W. Tantisiriroj, L. Xiao, and G. Gibson. Diskreduce: Raid for data-intensive scalable computing. In Proceedings of the 4th Annual Workshop on Petascale Data Storage, pages 6-10. ACM, 2009.

[9]

D. Ford, F. Labelle, F. Popovici, M. Stokely, V. Truong, L. Barroso, C. Grimes, and S. Quinlan. Availability in globally distributed storage systems. In Proceedings of the 9th USENIX conference on Operating systems design and implementation, pages 1-7, 2010.

[10]

P. Gopalan, C. Huang, H. Simitci, and S. Yekhanin. On the locality of codeword symbols. CoRR, abs/1106.3625, 2011.

[11]

K. Greenan. Reliability and power-efficiency in erasure-coded storage systems. PhD thesis, University of California, Santa Cruz, December 2009.

[12]

K. Greenan, J. Plank, and J. Wylie. Mean time to meaningless: MTTDL, Markov models, and storage system reliability. In HotStorage, 2010.

[13]

A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel. The cost of a cloud: Research problems in data center networks. Computer Communications Review (CCR), pages 68-73, 2009.

[14]

A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. VL2: A scalable and flexible data center network. SIGCOMM Comput. Commun. Rev., 39:51-62, Aug. 2009.

[15]

C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, and S. Lu. DCell: a scalable and fault-tolerant network structure for data centers. SIGCOMM Comput. Commun. Rev., 38:75-86, August 2008.

[16]

T. Ho, M. Médard, R. Koetter, D. Karger, M. Effros, J. Shi, and B. Leong. A random linear network coding approach to multicast. IEEE Transactions on Information Theory, pages 4413-4430, October 2006.

[17]

C. Huang, M. Chen, and J. Li. Pyramid codes: Flexible schemes to trade space for access efficiency in reliable data storage systems. NCA, 2007.

[18]

S. Jaggi, P. Sanders, P. A. Chou, M. Effros, S. Egner, K. Jain, and L. Tolhuizen. Polynomial time algorithms for multicast network code construction. Information Theory, IEEE Transactions on, 51(6):1973-1982, 2005.

[19]

O. Khan, R. Burns, J. Plank, W. Pierce, and C. Huang. Rethinking erasure codes for cloud file systems: Minimizing I/O for recovery and degraded reads. In FAST 2012.

[20]

O. Khan, R. Burns, J. S. Plank, and C. Huang. In search of I/O-optimal recovery from disk failures. In HotStorage'11: 3rd Workshop on Hot Topics in Storage and File Systems, Portland, June 2011. USENIX.

[21]

D. Narayanan, A. Donnelly, and A. Rowstron. Write off-loading: Practical power management for enterprise storage. ACM Transactions on Storage (TOS), 4(3):10, 2008.

[22]

F. Oggier and A. Datta. Self-repairing homomorphic codes for distributed storage systems. In INFOCOM, 2011 Proceedings IEEE, pages 1215-1223, april 2011.

[23]

D. Papailiopoulos and A. G. Dimakis. Locally repairable codes. In ISIT 2012.

[24]

D. Papailiopoulos, J. Luo, A. Dimakis, C. Huang, and J. Li. Simple regenerating codes: Network coding for cloud storage. Arxiv preprint arXiv:1109.0264, 2011.

[25]

K. Rashmi, N. Shah, and P. Kumar. Optimal exact-regenerating codes for distributed storage at the msr and mbr points via a product-matrix construction. Information Theory, IEEE Transactions on, 57(8):5227-5239, aug. 2011.

[26]

K. Rashmi, N. Shah, and P. Kumar. Optimal exact-regenerating codes for distributed storage at the msr and mbr points via a product-matrix construction. Information Theory, IEEE Transactions on, 57(8):5227-5239, 2011.

[27]

I. Reed and G. Solomon. Polynomial codes over certain finite fields. In Journal of the SIAM, 1960.

[28]

R. Rodrigues and B. Liskov. High availability in dhts: Erasure coding vs. replication. Peer-to-Peer Systems IV, pages 226-239, 2005.

[29]

M. Sathiamoorthy, M. Asteris, D. Papailiopoulous, A. G. Dimakis, R. Vadali, S. Chen, and D. Borthakur. Xoring elephants: Novel erasure codes for big data. USC Technical Report 2012, available online at http://bit.ly/xorbas.

[30]

N. Shah, K. Rashmi, P. Kumar, and K. Ramchandran. Interference alignment in regenerating codes for distributed storage: Necessity and code constructions. Information Theory, IEEE Transactions on, 58(4):2134-2158, 2012.

[31]

I. Tamo, Z. Wang, and J. Bruck. MDS array codes with optimal rebuilding. CoRR, abs/1103.3737, 2011.

[32]

S. B. Wicker and V. K. Bhargava. Reed-solomon codes and their applications. In IEEE Press, 1994.

[33]

Q. Xin, E. Miller, T. Schwarz, D. Long, S. Brandt, and W. Litwin. Reliability mechanisms for very large storage systems. In MSST, pages 146-156. IEEE, 2003.

Cited By

Lin GWu SLi CXu Y(2024)Designing Non-uniform Locally Repairable Codes for Wide Stripes under Skewed File AccessesProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673103(1197-1206)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673103
Hu JKosaian JRashmi K(2024)Rethinking Erasure-Coding Libraries in the Age of Optimized Machine LearningProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665943(23-30)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1145/3655038.3665943
Kong X(2024)Locally Repairable Convertible Codes With Optimal Access CostsIEEE Transactions on Information Theory10.1109/TIT.2024.343534670:9(6239-6257)Online publication date: 29-Jul-2024
https://dl.acm.org/doi/10.1109/TIT.2024.3435346
Show More Cited By

Index Terms

XORing elephants: novel erasure codes for big data
1. Information systems
  1. Data management systems
    1. Database management system engines
  2. Information systems applications

Recommendations

Only aggressive elephants are fast elephants

Yellow elephants are slow. A major reason is that they consume their inputs entirely before responding to an elephant rider's orders. Some clever riders have trained their yellow elephants to only consume parts of the inputs before responding. However, ...
Can the elephants handle the NoSQL onslaught?

In this new era of "big data", traditional DBMSs are under attack from two sides. At one end of the spectrum, the use of document store NoSQL systems (e.g. MongoDB) threatens to move modern Web 2.0 applications away from traditional RDBMSs. At the other ...
Coded modulation in the block-fading channel: coding theorems and code construction

We consider coded modulation schemes for the block-fading channel. In the setting where a codeword spans a finite number N of fading degrees of freedom, we show that coded modulations of rate R bit per complex dimension, over a finite signal set ýýC of ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 6, Issue 5

March 2013

60 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 March 2013

Published in PVLDB Volume 6, Issue 5

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

142
Total Citations
View Citations
1,583
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)2

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lin GWu SLi CXu Y(2024)Designing Non-uniform Locally Repairable Codes for Wide Stripes under Skewed File AccessesProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673103(1197-1206)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673103
Hu JKosaian JRashmi K(2024)Rethinking Erasure-Coding Libraries in the Age of Optimized Machine LearningProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665943(23-30)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1145/3655038.3665943
Kong X(2024)Locally Repairable Convertible Codes With Optimal Access CostsIEEE Transactions on Information Theory10.1109/TIT.2024.343534670:9(6239-6257)Online publication date: 29-Jul-2024
https://dl.acm.org/doi/10.1109/TIT.2024.3435346
Li XCheng KTang KLee PHu YFeng DLi JWu TNaor DGoel A(2023)ParaRCProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585940(17-31)Online publication date: 21-Feb-2023
https://dl.acm.org/doi/10.5555/3585938.3585940
Kadekodi SSilas SClausen DMerchant A(2023)Practical Design Considerations for Wide Locally Recoverable Codes (LRCs)ACM Transactions on Storage10.1145/362619819:4(1-26)Online publication date: 14-Nov-2023
https://dl.acm.org/doi/10.1145/3626198
Li SCao QWan SXia WXie C(2023)gPPM: A Generalized Matrix Operation and Parallel Algorithm to Accelerate the Encoding/Decoding Process of Erasure CodesACM Transactions on Architecture and Code Optimization10.1145/362500520:4(1-25)Online publication date: 21-Sep-2023
https://dl.acm.org/doi/10.1145/3625005
Zhao HWu SLiu HTang ZHe XXu Y(2023)Toward Optimal Repair and Load Balance in Locally Repairable CodesProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605635(725-735)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605635
Xu JZhang YWang KZhang Z(2023)Cooperative Repair of Reed-Solomon Codes via Linearized Permutation PolynomialsIEEE Transactions on Information Theory10.1109/TIT.2023.334765470:7(4747-4758)Online publication date: 26-Dec-2023
https://dl.acm.org/doi/10.1109/TIT.2023.3347654
Wang RWang JIdreos SÖzsu MAref W(2022)The case for distributed shared-memory databases with RDMA-enabled memory disaggregationProceedings of the VLDB Endowment10.14778/3561261.356126316:1(15-22)Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.14778/3561261.3561263
Dai MZhang CHong ZWawrzyński PTrzciński TLin X(2022)Multiple Dimensional Encoding/Modulation Shift-and-Addition Design for Distributed SystemsWireless Communications & Mobile Computing10.1155/2022/56150412022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/5615041
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents