skip to main content
10.5555/3386691.3386693guideproceedingsArticle/Chapter ViewAbstractPublication PagesfastConference Proceedingsconference-collections
Article

MAPX: controlled data migration in the expansion of decentralized object-based storage systems

Published: 24 February 2020 Publication History

Abstract

Data placement is critical for the scalability of decentralized object-based storage systems. The state-of-the-art CRUSH placement method is a decentralized algorithm that deterministically places object replicas onto storage devices without relying on a central directory. While enjoying the benefits of decentralization such as high scalability, robustness, and performance, CRUSH-based storage systems suffer from uncontrolled data migration when expanding the clusters, which will cause significant performance degradation when the expansion is nontrivial.
This paper presents MAPX, a novel extension to CRUSH that uses an extra time-dimension mapping (from object creation times to cluster expansion times) for controlled data migration in cluster expansions. Each expansion is viewed as a new layer of the CRUSH map represented by a virtual node beneath the CRUSH root. MAPX controls the mapping from objects onto layers by manipulating the timestamps of the intermediate placement groups (PGs). MAPX is applicable to a large variety of object-based storage scenarios where object timestamps can be maintained as higher-level metadata. For example, we apply MAPX to Ceph-RBD by extending the RBD metadata structure to maintain and retrieve approximate object creation times at the granularity of expansions layers. Experimental results show that the MAPX-based migration-free system outperforms the CRUSH-based system (which is busy in migrating objects after expansions) by up to 4.25× in the tail latency.

References

[1]
https://aws.amazon.com/s3/.
[2]
https://blog.twitter.com/engineering/en_us/a/2012/blobstore-twitter-s-in-house-photo-storage-system.html.
[3]
https://ceph.com/ceph-storage/block-storage/.
[4]
https://ceph.com/ceph-storage/file-system/.
[5]
https://docs.ceph.com/docs/master/releases/luminous/.
[6]
https://docs.ceph.com/docs/mimic/man/8/crushtool/.
[7]
https://docs.ceph.com/docs/mimic/rados/configuration/osd-config-ref/.
[8]
https://github.com/ceph/ceph/tree/master/src/rgw.
[9]
https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html.
[10]
https://rocksdb.org/.
[11]
https://www.swiftstack.com/product/open-source/openstack-swift/.
[12]
AIKEN, S., GRUNWALD, D., PLESZKUN, A. R., AND WILLEKE, J. A performance analysis of the iscsi protocol. In Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings. 20th IEEE/11th NASA Goddard Conference on (2003), IEEE, pp. 123-134.
[13]
ANAND, A., MUTHUKRISHNAN, C., KAPPES, S., AKELLA, A., AND NATH, S. Cheap and large cams for high performance data-intensive networked systems. In NSDI (2010), USENIX Association, pp. 433-448.
[14]
ANDERSEN, D. G., FRANKLIN, J., KAMINSKY, M., PHANISHAYEE, A., TAN, L., AND VASUDEVAN, V. Fawn: a fast array of wimpy nodes. In SOSP (2009), J. N. Matthews and T. E. Anderson, Eds., ACM, pp. 1-14.
[15]
BEAVER, D., KUMAR, S., LI, H. C., SOBEL, J., AND VAJGEL, P. Finding a needle in haystack: facebook's photo storage. In Usenix Conference on Operating Systems Design and Implementation (2010), pp. 47-60.
[16]
BRAAM, P. The lustre storage architecture. arXiv preprint arXiv: 1903.01955 (2019).
[17]
CASHIN, E. L. Kernel korner: Ata over ethernet: putting hard drives on the lan. Linux Journal 2005, 134 (2005), 10.
[18]
CASTRO, M., COSTA, M., AND ROWSTRON, A. I. T. Debunking some myths about structured and unstructured overlays. In NSDI (2005).
[19]
CHAN, J. C., DING, Q., LEE, P. P., AND CHAN, H. H. Parity logging with reserved space: Towards efficient updates and recovery in erasure-coded clustered storage. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST 14) (2014), pp. 163-176.
[20]
CHANG, F., DEAN, J., GHEMAWAT, S., HSIEH, W. C., WALLACH, D. A., BURROWS, M., CHANDRA, T., FIKES, A., AND GRUBER, R. E. Bigtable: A distributed storage system for structured data. Acm Transactions on Computer Systems 26, 2 (2008), 1-26.
[21]
CHEN, P. M., LEE, E. K., GIBSON, G. A., KATZ, R. H., AND PATTERSON, D. A. Raid: High-performance, reliable secondary storage. ACM Computing Surveys (CSUR) 26, 2 (1994), 145-185.
[22]
CHIDAMBARAM, V., PILLAI, T. S., ARPACI-DUSSEAU, A. C., AND ARPACI-DUSSEAU, R. H. Optimistic crash consistency. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (2013), ACM, pp. 228-243.
[23]
CHIDAMBARAM, V., SHARMA, T., ARPACI-DUSSEAU, A. C., AND ARPACI-DUSSEAU, R. H. Consistency without ordering. In Proceedings of the 10th USENIX conference on File and Storage Technologies, FAST 2012, San Jose, CA, USA, February 14-17, 2012 (2012), p. 9.
[24]
CONDIT, J., NIGHTINGALE, E. B., FROST, C., IPEK, E., LEE, B., BURGER, D., AND COETZEE, D. Better i/o through byte-addressable, persistent memory. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (2009), ACM, pp. 133-146.
[25]
DEAN, J., AND GHEMAWAT, S. Mapreduce: simplified data processing on large clusters. Communications of the ACM 51, 1 (2008), 107-113.
[26]
DEBNATH, B., SENGUPTA, S., AND LI, J. Skimpystash: Ram space skimpy key-value store on flash-based storage. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2011), SIGMOD '11, ACM, pp. 25-36.
[27]
DEBNATH, B. K., SENGUPTA, S., AND LI, J. Flashstore: High throughput persistent key-value store. PVLDB 3, 2 (2010), 1414- 1425.
[28]
DECANDIA, G., HASTORUN, D., JAMPANI, M., KAKULAPATI, G., LAKSHMAN, A., PILCHIN, A., SIVASUBRAMANIAN, S., VOSSHALL, P., AND VOGELS, W. Dynamo: amazon's highly available key-value store. Acm Sigops Operating Systems Review 41, 6 (2007), 205- 220.
[29]
GANESAN, P., GUMMADI, P. K., AND GARCIA-MOLINA, H. Canon in g major: Designing dhts with hierarchical structure. In ICDCS (2004), pp. 263-272.
[30]
GHEMAWAT, S., GOBIOFF, H., AND LEUNG, S.-T. The google file system. In SOSP (2003), pp. 29-43.
[31]
HARTER, T., BORTHAKUR, D., DONG, S., AIYER, A., TANG, L., ARPACI-DUSSEAU, A. C., AND ARPACI-DUSSEAU, R. H. Analysis of hdfs under hbase: A facebook messages case study. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST 14) (2014), pp. 199-212.
[32]
HARTMAN, J. H., AND OUSTERHOUT, J. K. The zebra striped network file system. ACM Transactions on Computer Systems (TOCS) 13, 3 (1995), 274-310.
[33]
HARVEY, N. J. A., JONES, M. B., SAROIU, S., THEIMER, M., AND WOLMAN, A. Skipnet: A scalable overlay network with practical locality properties. In USENIX Symposium on Internet Technologies and Systems (2003).
[34]
HERLIHY, M. P., AND WING, J. M. Linearizability: A correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 12, 3 (July 1990), 463-492.
[35]
HILDEBRAND, D., AND HONEYMAN, P. Exporting storage systems in a scalable manner with pnfs. In 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST'05) (2005), IEEE, pp. 18-27.
[36]
HUANG, M., LUO, L., LI, Y., AND LIANG, L. Research on data migration optimization of ceph. In 2017 14th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP) (2017), IEEE, pp. 83-88.
[37]
JIN, C., FENG, D., JIANG, H., AND TIAN, L. Raid6l: A log-assisted raid6 storage architecture with improved write performance. In 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST) (2011), IEEE, pp. 1-6.
[38]
KAASHOEK, M. F., AND KARGER, D. R. Koorde: A simple degree-optimal distributed hash table. In IPTPS (2003), pp. 98-107.
[39]
KARGER, D. R., AND RUHL, M. Diminished chord: A protocol for heterogeneous subgroup formation in peer-to-peer networks. In IPTPS (2004), pp. 288-297.
[40]
LAKSHMAN, A., AND MALIK, P. Cassandra:a structured storage system on a p2p network. In Proc Acm Sigmod International Conference on Management of Data (2009).
[41]
LAKSHMAN, A., AND MALIK, P. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review 44, 2 (2010), 35-40.
[42]
LEE, E. K., AND THEKKATH, C. A. Petal: Distributed virtual disks. In ACM SIGPLAN Notices (1996), vol. 31, ACM, pp. 84-92.
[43]
LEUNG, A. W., PASUPATHY, S., GOODSON, G. R., AND MILLER, E. L. Measurement and analysis of large-scale network file system workloads. In USENIX annual technical conference (2008), vol. 1, pp. 2-5.
[44]
LI, H., ZHANG, Y., LI, D., ZHANG, Z., LIU, S., HUANG, P., QIN, Z., CHEN, K., AND XIONG, Y. Ursa: Hybrid block storage for cloud-scale virtual disks. In Proceedings of the Fourteenth EuroSys Conference 2019 (2019), ACM, p. 15.
[45]
LI, H., ZHANG, Y., ZHANG, Z., LIU, S., LI, D., LIU, X., AND PENG, Y. Parix: speculative partial writes in erasure-coded systems. In 2017 USENIX Annual Technical Conference (USENIX ATC 17) (2017), USENIX Association, pp. 581-587.
[46]
LIM, H., FAN, B., ANDERSEN, D. G., AND KAMINSKY, M. Silt: A memory-efficient, high-performance key-value store. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (2011), ACM, pp. 1-13.
[47]
LU, L., GOPALAKRISHNAN, H., ARPACI-DUSSEAU, A. C., AND ARPACI-DUSSEAU, R. H. Wisckey: Separating keys from values in ssd-conscious storage. Acm Transactions on Storage 13, 1 (2017), 5.
[48]
MCKUSICK, M. K., JOY, W. N., LEFFLER, S. J., AND FABRY, R. S. A fast file system for unix. ACM Transactions on Computer Systems (TOCS) 2, 3 (1984), 181-197.
[49]
MEYER, D. T., AGGARWAL, G., CULLY, B., LEFEBVRE, G., FEELEY, M. J., HUTCHINSON, N. C., AND WARFIELD, A. Parallax: virtual disks for virtual machines. In ACM SIGOPS Operating Systems Review (2008), vol. 42, ACM, pp. 41-54.
[50]
MICKENS, J., NIGHTINGALE, E. B., ELSON, J., GEHRING, D., FAN, B., KADAV, A., CHIDAMBARAM, V., KHAN, O., AND NAREDDY, K. Blizzard: Fast, cloud-scale block storage for cloud-oblivious applications. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14) (2014), pp. 257-273.
[51]
MISLOVE, A., AND DRUSCHEL, P. Providing administrative control and autonomy in structured peer-to-peer overlays. In IPTPS (2004), pp. 162-172.
[52]
MURALIDHAR, S., LLOYD, W., ROY, S., HILL, C., LIN, E., LIU, W., PAN, S., SHANKAR, S., SIVAKUMAR, V., AND TANG, L. f4: Facebook's warm blob storage system. In Usenix Conference on Operating Systems Design and Implementation (2014), pp. 383-398.
[53]
NIGHTINGALE, E. B., ELSON, J., FAN, J., HOFMANN, O., HOWELL, J., AND SUZUE, Y. Flat datacenter storage. In OSDI (2012).
[54]
NOGHABI, S. A., SUBRAMANIAN, S., NARAYANAN, P., NARAYANAN, S., HOLLA, G., ZADEH, M., LI, T., GUPTA, I., AND CAMPBELL, R. H. Ambry:linkedin's scalable geo-distributed object store. In International Conference on Management of Data (2016), pp. 253-265.
[55]
ONGARO, D., RUMBLE, S. M., STUTSMAN, R., OUSTERHOUT, J. K., AND ROSENBLUM, M. Fast crash recovery in ramcloud. In SOSP (2011), pp. 29-41.
[56]
PIERNAS, J., CORTES, T., AND GARCÍA, J. M. Dualfs: a new journaling file system without meta-data duplication. In Proceedings of the 16th international conference on Supercomputing (2002), ACM, pp. 137-146.
[57]
RATNASAMY, S., FRANCIS, P., HANDLEY, M., KARP, R. M., AND SHENKER, S. A scalable content-addressable network. In Proceedings of the ACM SIGCOMM 2001 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, August 27-31, 2001, San Diego, CA, USA (2001), pp. 161-172.
[58]
REN, K., ZHENG, Q., PATIL, S., AND GIBSON, G. Indexfs: Scaling file system metadata performance with stateless caching and bulk insertion. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2014), IEEE Press, pp. 237-248.
[59]
ROWSTRON, A. I. T., AND DRUSCHEL, P. Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In Middleware (2001), pp. 329-350.
[60]
SHEN, H., XU, C.-Z., AND CHEN, G. Cycloid: A constant-degree and lookup-efficient p2p overlay network. Perform. Eval. 63, 3 (2006), 195-216.
[61]
SPIROVSKA, K., DIDONA, D., AND ZWAENEPOEL, W. Optimistic causal consistency for geo-replicated key-value stores. In Distributed Computing Systems (ICDCS), 2017 IEEE 37th International Conference on (2017), IEEE, pp. 2626-2629.
[62]
STODOLSKY, D., GIBSON, G., AND HOLLAND, M. Parity logging overcoming the small write problem in redundant disk arrays. In ACM SIGARCH Computer Architecture News (1993), vol. 21, ACM, pp. 64-75.
[63]
STOICA, I., MORRIS, R., KARGER, D., KAASHOEK, M. F., AND BALAKRISHNAN, H. Chord: A scalable peer-to-peer lookup service for internet applications. ACM SIGCOMM Computer Communication Review 31, 4 (2001), 149-160.
[64]
WANG, Y., KAPRITSOS, M., REN, Z., MAHAJAN, P., KIRUBANANDAM, J., ALVISI, L., AND DAHLIN, M. Robustness in the salus scalable block store. In Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13) (2013), pp. 357-370.
[65]
WARFIELD, A., ROSS, R., FRASER, K., LIMPACH, C., AND HAND, S. Parallax: Managing storage for a million machines. In HotOS (2005).
[66]
WEIL, S. A., BRANDT, S. A., MILLER, E. L., LONG, D. D., AND MALTZAHN, C. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th symposium on Operating systems design and implementation (2006), pp. 307-320.
[67]
WEIL, S. A., BRANDT, S. A., MILLER, E. L., AND MALTZAHN, C. Crush: Controlled, scalable, decentralized placement of replicated data. In SC'06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (2006), IEEE, pp. 31-31.
[68]
ZAHARIA, M., CHOWDHURY, M., DAS, T., AND DAVE, A. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI (2012), pp. 1-14.
[69]
ZHANG, Y., CHEN, L., LU, X., AND LI, D. Enabling routing control in a dht. IEEE Journal on Selected Areas in Communications 28, 1 (2009), 28-38.
[70]
ZHANG, Y., LI, D., GUO, C., WU, H., XIONG, Y., AND LU, X. Cubicring: Exploiting network proximity for distributed in-memory key-value store. IEEE/ACM Transactions on Networking 25, 4 (2017), 2040-2053.
[71]
ZHANG, Y., LI, D., AND LIU, L. Leveraging glocality for fast failure recovery in distributed ram storage. ACM Transactions on Storage (TOS) 15, 1 (2019), 1-24.
[72]
ZHANG, Y., LI, H., LIU, S., XU, J., AND XUE, G. Pbs: An efficient erasure-coded block storage system based on speculative partial writes. ACM Transactions on Storage (TOS) 15 (2020), 1-26.
[73]
ZHANG, Y., AND LIU, L. Distributed line graphs: A universal technique for designing dhts based on arbitrary regular graphs. IEEE Transactions on Knowledge and Data Engineering 24, 9 (2011), 1556-1569.
[74]
ZHAO, B. Y., HUANG, L., STRIBLING, J., RHEA, S. C., JOSEPH, A. D., AND KUBIATOWICZ, J. Tapestry: a resilient global-scale overlay for service deployment. IEEE Journal on Selected Areas in Communications 22, 1 (2004), 41-53.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
FAST'20: Proceedings of the 18th USENIX Conference on File and Storage Technologies
February 2020
337 pages
ISBN:9781939133120

Sponsors

  • VMware
  • NetApp
  • Google Inc.
  • NSF

Publisher

USENIX Association

United States

Publication History

Published: 24 February 2020

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media