Article

MAPX: controlled data migration in the expansion of decentralized object-based storage systems

Authors:

Guangtao XueAuthors Info & Claims

FAST'20: Proceedings of the 18th USENIX Conference on File and Storage Technologies

Pages 1 - 12

Published: 24 February 2020 Publication History

Abstract

Data placement is critical for the scalability of decentralized object-based storage systems. The state-of-the-art CRUSH placement method is a decentralized algorithm that deterministically places object replicas onto storage devices without relying on a central directory. While enjoying the benefits of decentralization such as high scalability, robustness, and performance, CRUSH-based storage systems suffer from uncontrolled data migration when expanding the clusters, which will cause significant performance degradation when the expansion is nontrivial.

This paper presents MAPX, a novel extension to CRUSH that uses an extra time-dimension mapping (from object creation times to cluster expansion times) for controlled data migration in cluster expansions. Each expansion is viewed as a new layer of the CRUSH map represented by a virtual node beneath the CRUSH root. MAPX controls the mapping from objects onto layers by manipulating the timestamps of the intermediate placement groups (PGs). MAPX is applicable to a large variety of object-based storage scenarios where object timestamps can be maintained as higher-level metadata. For example, we apply MAPX to Ceph-RBD by extending the RBD metadata structure to maintain and retrieve approximate object creation times at the granularity of expansions layers. Experimental results show that the MAPX-based migration-free system outperforms the CRUSH-based system (which is busy in migrating objects after expansions) by up to 4.25× in the tail latency.

References

[1]

https://aws.amazon.com/s3/.

[2]

https://blog.twitter.com/engineering/en_us/a/2012/blobstore-twitter-s-in-house-photo-storage-system.html.

[3]

https://ceph.com/ceph-storage/block-storage/.

[4]

https://ceph.com/ceph-storage/file-system/.

[5]

https://docs.ceph.com/docs/master/releases/luminous/.

[6]

https://docs.ceph.com/docs/mimic/man/8/crushtool/.

[7]

https://docs.ceph.com/docs/mimic/rados/configuration/osd-config-ref/.

[8]

https://github.com/ceph/ceph/tree/master/src/rgw.

[9]

https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html.

[10]

https://rocksdb.org/.

[11]

https://www.swiftstack.com/product/open-source/openstack-swift/.

[12]

AIKEN, S., GRUNWALD, D., PLESZKUN, A. R., AND WILLEKE, J. A performance analysis of the iscsi protocol. In Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings. 20th IEEE/11th NASA Goddard Conference on (2003), IEEE, pp. 123-134.

Digital Library

[13]

ANAND, A., MUTHUKRISHNAN, C., KAPPES, S., AKELLA, A., AND NATH, S. Cheap and large cams for high performance data-intensive networked systems. In NSDI (2010), USENIX Association, pp. 433-448.

Digital Library

[14]

ANDERSEN, D. G., FRANKLIN, J., KAMINSKY, M., PHANISHAYEE, A., TAN, L., AND VASUDEVAN, V. Fawn: a fast array of wimpy nodes. In SOSP (2009), J. N. Matthews and T. E. Anderson, Eds., ACM, pp. 1-14.

Digital Library

[15]

BEAVER, D., KUMAR, S., LI, H. C., SOBEL, J., AND VAJGEL, P. Finding a needle in haystack: facebook's photo storage. In Usenix Conference on Operating Systems Design and Implementation (2010), pp. 47-60.

Digital Library

[16]

BRAAM, P. The lustre storage architecture. arXiv preprint arXiv: 1903.01955 (2019).

[17]

CASHIN, E. L. Kernel korner: Ata over ethernet: putting hard drives on the lan. Linux Journal 2005, 134 (2005), 10.

Digital Library

[18]

CASTRO, M., COSTA, M., AND ROWSTRON, A. I. T. Debunking some myths about structured and unstructured overlays. In NSDI (2005).

Digital Library

[19]

CHAN, J. C., DING, Q., LEE, P. P., AND CHAN, H. H. Parity logging with reserved space: Towards efficient updates and recovery in erasure-coded clustered storage. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST 14) (2014), pp. 163-176.

Digital Library

[20]

CHANG, F., DEAN, J., GHEMAWAT, S., HSIEH, W. C., WALLACH, D. A., BURROWS, M., CHANDRA, T., FIKES, A., AND GRUBER, R. E. Bigtable: A distributed storage system for structured data. Acm Transactions on Computer Systems 26, 2 (2008), 1-26.

Digital Library

[21]

CHEN, P. M., LEE, E. K., GIBSON, G. A., KATZ, R. H., AND PATTERSON, D. A. Raid: High-performance, reliable secondary storage. ACM Computing Surveys (CSUR) 26, 2 (1994), 145-185.

Digital Library

[22]

CHIDAMBARAM, V., PILLAI, T. S., ARPACI-DUSSEAU, A. C., AND ARPACI-DUSSEAU, R. H. Optimistic crash consistency. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (2013), ACM, pp. 228-243.

Digital Library

[23]

CHIDAMBARAM, V., SHARMA, T., ARPACI-DUSSEAU, A. C., AND ARPACI-DUSSEAU, R. H. Consistency without ordering. In Proceedings of the 10th USENIX conference on File and Storage Technologies, FAST 2012, San Jose, CA, USA, February 14-17, 2012 (2012), p. 9.

Digital Library

[24]

CONDIT, J., NIGHTINGALE, E. B., FROST, C., IPEK, E., LEE, B., BURGER, D., AND COETZEE, D. Better i/o through byte-addressable, persistent memory. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (2009), ACM, pp. 133-146.

Digital Library

[25]

DEAN, J., AND GHEMAWAT, S. Mapreduce: simplified data processing on large clusters. Communications of the ACM 51, 1 (2008), 107-113.

Digital Library

[26]

DEBNATH, B., SENGUPTA, S., AND LI, J. Skimpystash: Ram space skimpy key-value store on flash-based storage. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2011), SIGMOD '11, ACM, pp. 25-36.

Digital Library

[27]

DEBNATH, B. K., SENGUPTA, S., AND LI, J. Flashstore: High throughput persistent key-value store. PVLDB 3, 2 (2010), 1414- 1425.

Digital Library

[28]

DECANDIA, G., HASTORUN, D., JAMPANI, M., KAKULAPATI, G., LAKSHMAN, A., PILCHIN, A., SIVASUBRAMANIAN, S., VOSSHALL, P., AND VOGELS, W. Dynamo: amazon's highly available key-value store. Acm Sigops Operating Systems Review 41, 6 (2007), 205- 220.

Digital Library

[29]

GANESAN, P., GUMMADI, P. K., AND GARCIA-MOLINA, H. Canon in g major: Designing dhts with hierarchical structure. In ICDCS (2004), pp. 263-272.

Digital Library

[30]

GHEMAWAT, S., GOBIOFF, H., AND LEUNG, S.-T. The google file system. In SOSP (2003), pp. 29-43.

Digital Library

[31]

HARTER, T., BORTHAKUR, D., DONG, S., AIYER, A., TANG, L., ARPACI-DUSSEAU, A. C., AND ARPACI-DUSSEAU, R. H. Analysis of hdfs under hbase: A facebook messages case study. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST 14) (2014), pp. 199-212.

Digital Library

[32]

HARTMAN, J. H., AND OUSTERHOUT, J. K. The zebra striped network file system. ACM Transactions on Computer Systems (TOCS) 13, 3 (1995), 274-310.

Digital Library

[33]

HARVEY, N. J. A., JONES, M. B., SAROIU, S., THEIMER, M., AND WOLMAN, A. Skipnet: A scalable overlay network with practical locality properties. In USENIX Symposium on Internet Technologies and Systems (2003).

Digital Library

[34]

HERLIHY, M. P., AND WING, J. M. Linearizability: A correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 12, 3 (July 1990), 463-492.

Digital Library

[35]

HILDEBRAND, D., AND HONEYMAN, P. Exporting storage systems in a scalable manner with pnfs. In 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST'05) (2005), IEEE, pp. 18-27.

Digital Library

[36]

HUANG, M., LUO, L., LI, Y., AND LIANG, L. Research on data migration optimization of ceph. In 2017 14th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP) (2017), IEEE, pp. 83-88.

[37]

JIN, C., FENG, D., JIANG, H., AND TIAN, L. Raid6l: A log-assisted raid6 storage architecture with improved write performance. In 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST) (2011), IEEE, pp. 1-6.

Digital Library

[38]

KAASHOEK, M. F., AND KARGER, D. R. Koorde: A simple degree-optimal distributed hash table. In IPTPS (2003), pp. 98-107.

[39]

KARGER, D. R., AND RUHL, M. Diminished chord: A protocol for heterogeneous subgroup formation in peer-to-peer networks. In IPTPS (2004), pp. 288-297.

Digital Library

[40]

LAKSHMAN, A., AND MALIK, P. Cassandra:a structured storage system on a p2p network. In Proc Acm Sigmod International Conference on Management of Data (2009).

[41]

LAKSHMAN, A., AND MALIK, P. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review 44, 2 (2010), 35-40.

Digital Library

[42]

LEE, E. K., AND THEKKATH, C. A. Petal: Distributed virtual disks. In ACM SIGPLAN Notices (1996), vol. 31, ACM, pp. 84-92.

Digital Library

[43]

LEUNG, A. W., PASUPATHY, S., GOODSON, G. R., AND MILLER, E. L. Measurement and analysis of large-scale network file system workloads. In USENIX annual technical conference (2008), vol. 1, pp. 2-5.

Digital Library

[44]

LI, H., ZHANG, Y., LI, D., ZHANG, Z., LIU, S., HUANG, P., QIN, Z., CHEN, K., AND XIONG, Y. Ursa: Hybrid block storage for cloud-scale virtual disks. In Proceedings of the Fourteenth EuroSys Conference 2019 (2019), ACM, p. 15.

Digital Library

[45]

LI, H., ZHANG, Y., ZHANG, Z., LIU, S., LI, D., LIU, X., AND PENG, Y. Parix: speculative partial writes in erasure-coded systems. In 2017 USENIX Annual Technical Conference (USENIX ATC 17) (2017), USENIX Association, pp. 581-587.

Digital Library

[46]

LIM, H., FAN, B., ANDERSEN, D. G., AND KAMINSKY, M. Silt: A memory-efficient, high-performance key-value store. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (2011), ACM, pp. 1-13.

Digital Library

[47]

LU, L., GOPALAKRISHNAN, H., ARPACI-DUSSEAU, A. C., AND ARPACI-DUSSEAU, R. H. Wisckey: Separating keys from values in ssd-conscious storage. Acm Transactions on Storage 13, 1 (2017), 5.

Digital Library

[48]

MCKUSICK, M. K., JOY, W. N., LEFFLER, S. J., AND FABRY, R. S. A fast file system for unix. ACM Transactions on Computer Systems (TOCS) 2, 3 (1984), 181-197.

Digital Library

[49]

MEYER, D. T., AGGARWAL, G., CULLY, B., LEFEBVRE, G., FEELEY, M. J., HUTCHINSON, N. C., AND WARFIELD, A. Parallax: virtual disks for virtual machines. In ACM SIGOPS Operating Systems Review (2008), vol. 42, ACM, pp. 41-54.

Digital Library

[50]

MICKENS, J., NIGHTINGALE, E. B., ELSON, J., GEHRING, D., FAN, B., KADAV, A., CHIDAMBARAM, V., KHAN, O., AND NAREDDY, K. Blizzard: Fast, cloud-scale block storage for cloud-oblivious applications. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14) (2014), pp. 257-273.

Digital Library

[51]

MISLOVE, A., AND DRUSCHEL, P. Providing administrative control and autonomy in structured peer-to-peer overlays. In IPTPS (2004), pp. 162-172.

Digital Library

[52]

MURALIDHAR, S., LLOYD, W., ROY, S., HILL, C., LIN, E., LIU, W., PAN, S., SHANKAR, S., SIVAKUMAR, V., AND TANG, L. f4: Facebook's warm blob storage system. In Usenix Conference on Operating Systems Design and Implementation (2014), pp. 383-398.

Digital Library

[53]

NIGHTINGALE, E. B., ELSON, J., FAN, J., HOFMANN, O., HOWELL, J., AND SUZUE, Y. Flat datacenter storage. In OSDI (2012).

Digital Library

[54]

NOGHABI, S. A., SUBRAMANIAN, S., NARAYANAN, P., NARAYANAN, S., HOLLA, G., ZADEH, M., LI, T., GUPTA, I., AND CAMPBELL, R. H. Ambry:linkedin's scalable geo-distributed object store. In International Conference on Management of Data (2016), pp. 253-265.

Digital Library

[55]

ONGARO, D., RUMBLE, S. M., STUTSMAN, R., OUSTERHOUT, J. K., AND ROSENBLUM, M. Fast crash recovery in ramcloud. In SOSP (2011), pp. 29-41.

Digital Library

[56]

PIERNAS, J., CORTES, T., AND GARCÍA, J. M. Dualfs: a new journaling file system without meta-data duplication. In Proceedings of the 16th international conference on Supercomputing (2002), ACM, pp. 137-146.

Digital Library

[57]

RATNASAMY, S., FRANCIS, P., HANDLEY, M., KARP, R. M., AND SHENKER, S. A scalable content-addressable network. In Proceedings of the ACM SIGCOMM 2001 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, August 27-31, 2001, San Diego, CA, USA (2001), pp. 161-172.

Digital Library

[58]

REN, K., ZHENG, Q., PATIL, S., AND GIBSON, G. Indexfs: Scaling file system metadata performance with stateless caching and bulk insertion. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2014), IEEE Press, pp. 237-248.

Digital Library

[59]

ROWSTRON, A. I. T., AND DRUSCHEL, P. Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In Middleware (2001), pp. 329-350.

Digital Library

[60]

SHEN, H., XU, C.-Z., AND CHEN, G. Cycloid: A constant-degree and lookup-efficient p2p overlay network. Perform. Eval. 63, 3 (2006), 195-216.

Digital Library

[61]

SPIROVSKA, K., DIDONA, D., AND ZWAENEPOEL, W. Optimistic causal consistency for geo-replicated key-value stores. In Distributed Computing Systems (ICDCS), 2017 IEEE 37th International Conference on (2017), IEEE, pp. 2626-2629.

[62]

STODOLSKY, D., GIBSON, G., AND HOLLAND, M. Parity logging overcoming the small write problem in redundant disk arrays. In ACM SIGARCH Computer Architecture News (1993), vol. 21, ACM, pp. 64-75.

Digital Library

[63]

STOICA, I., MORRIS, R., KARGER, D., KAASHOEK, M. F., AND BALAKRISHNAN, H. Chord: A scalable peer-to-peer lookup service for internet applications. ACM SIGCOMM Computer Communication Review 31, 4 (2001), 149-160.

Digital Library

[64]

WANG, Y., KAPRITSOS, M., REN, Z., MAHAJAN, P., KIRUBANANDAM, J., ALVISI, L., AND DAHLIN, M. Robustness in the salus scalable block store. In Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13) (2013), pp. 357-370.

Digital Library

[65]

WARFIELD, A., ROSS, R., FRASER, K., LIMPACH, C., AND HAND, S. Parallax: Managing storage for a million machines. In HotOS (2005).

Digital Library

[66]

WEIL, S. A., BRANDT, S. A., MILLER, E. L., LONG, D. D., AND MALTZAHN, C. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th symposium on Operating systems design and implementation (2006), pp. 307-320.

Digital Library

[67]

WEIL, S. A., BRANDT, S. A., MILLER, E. L., AND MALTZAHN, C. Crush: Controlled, scalable, decentralized placement of replicated data. In SC'06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (2006), IEEE, pp. 31-31.

Digital Library

[68]

ZAHARIA, M., CHOWDHURY, M., DAS, T., AND DAVE, A. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI (2012), pp. 1-14.

Digital Library

[69]

ZHANG, Y., CHEN, L., LU, X., AND LI, D. Enabling routing control in a dht. IEEE Journal on Selected Areas in Communications 28, 1 (2009), 28-38.

Digital Library

[70]

ZHANG, Y., LI, D., GUO, C., WU, H., XIONG, Y., AND LU, X. Cubicring: Exploiting network proximity for distributed in-memory key-value store. IEEE/ACM Transactions on Networking 25, 4 (2017), 2040-2053.

Digital Library

[71]

ZHANG, Y., LI, D., AND LIU, L. Leveraging glocality for fast failure recovery in distributed ram storage. ACM Transactions on Storage (TOS) 15, 1 (2019), 1-24.

Digital Library

[72]

ZHANG, Y., LI, H., LIU, S., XU, J., AND XUE, G. Pbs: An efficient erasure-coded block storage system based on speculative partial writes. ACM Transactions on Storage (TOS) 15 (2020), 1-26.

Digital Library

[73]

ZHANG, Y., AND LIU, L. Distributed line graphs: A universal technique for designing dhts based on arbitrary regular graphs. IEEE Transactions on Knowledge and Data Engineering 24, 9 (2011), 1556-1569.

Digital Library

[74]

ZHAO, B. Y., HUANG, L., STRIBLING, J., RHEA, S. C., JOSEPH, A. D., AND KUBIATOWICZ, J. Tapestry: a resilient global-scale overlay for service deployment. IEEE Journal on Selected Areas in Communications 22, 1 (2004), 41-53.

Digital Library

Cited By

Chum SChoi JLi JPark H(2020)SLA-Aware Adaptive Mapping Scheme in Bigdata Distributed Storage SystemsThe 9th International Conference on Smart Media and Applications10.1145/3426020.3426053(135-140)Online publication date: 17-Sep-2020
https://dl.acm.org/doi/10.1145/3426020.3426053

Index Terms

MAPX: controlled data migration in the expansion of decentralized object-based storage systems

Index terms have been assigned to the content through auto-classification.

Recommendations

Oasis: Controlling Data Migration in Expansion of Object-based Storage Systems
Object-based storage systems have been widely used for various scenarios such as file storage, block storage, blob (e.g., large videos) storage, and so on, where the data is placed among a large number of object storage devices (OSDs). Data placement is ...
Development of GIS Application System on OCCI and MapX
ETCS '11: Proceedings of the 2011 Third International Workshop on Education Technology and Computer Science - Volume 01

Exploring a method of developing application system which can access and manipulate a mass of spatial data effectively under integration storage mode of GIS data in Oracle Spatial, is the research hot spot. This paper defines more compact, robust and ...
CosaFS: A Cooperative Shingle-Aware File System
Special Issue on MSST 2017 and Regular Papers

In this article, we design and implement a cooperative shingle-aware file system, called CosaFS, on heterogeneous storage devices that mix solid-state drives (SSDs) and shingled magnetic recording (SMR) technology to improve the overall performance of ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

FAST'20: Proceedings of the 18th USENIX Conference on File and Storage Technologies

February 2020

337 pages

ISBN:9781939133120

Program Chairs:
Sam H. Noh
Ulsan National Institute of Science and Technology
,
Brent Welch
Google

Sponsors

VMware
NetApp
Google Inc.
NSF

Publisher

USENIX Association

United States

Publication History

Published: 24 February 2020

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chum SChoi JLi JPark H(2020)SLA-Aware Adaptive Mapping Scheme in Bigdata Distributed Storage SystemsThe 9th International Conference on Smart Media and Applications10.1145/3426020.3426053(135-140)Online publication date: 17-Sep-2020
https://dl.acm.org/doi/10.1145/3426020.3426053

View Options

View options

Media

Figures

Other

Tables

View Table of Contents