skip to main content
research-article

Oasis: Controlling Data Migration in Expansion of Object-based Storage Systems

Published: 19 January 2023 Publication History

Abstract

Object-based storage systems have been widely used for various scenarios such as file storage, block storage, blob (e.g., large videos) storage, and so on, where the data is placed among a large number of object storage devices (OSDs). Data placement is critical for the scalability of decentralized object-based storage systems. The state-of-the-art CRUSH placement method is a decentralized algorithm that deterministically places object replicas onto storage devices without relying on a central directory. While enjoying the benefits of decentralization such as high scalability, robustness, and performance, CRUSH-based storage systems suffer from uncontrolled data migration when expanding the capacity of the storage clusters (i.e., adding new OSDs), which is determined by the nature of CRUSH and will cause significant performance degradation when the expansion is nontrivial.
This article presents MapX, a novel extension to CRUSH that uses an extra time-dimension mapping (from object creation times to cluster expansion times) for controlling data migration after cluster expansions. Each expansion is viewed as a new layer of the CRUSH map represented by a virtual node beneath the CRUSH root. MapX controls the mapping from objects onto layers by manipulating the timestamps of the intermediate placement groups (PGs). MapX is applicable to a large variety of object-based storage scenarios where object timestamps can be maintained as higher-level metadata. We have applied MapX to the state-of-the-art Ceph-RBD (RADOS Block Device) to implement a migration-controllable, decentralized object-based block store (called Oasis). Oasis extends the RBD metadata structure to maintain and retrieve approximate object creation times (for migration control) at the granularity of expansion layers. Experimental results show that the MapX-based Oasis block store outperforms the CRUSH-based Ceph-RBD (which is busy in migrating objects after expansions) by 3.17× ∼ 4.31× in tail latency, and 76.3% (respectively, 83.8%) in IOPS for reads (respectively, writes).

References

[12]
MinIO Team. Multi-Cloud Object Storage. Retrieved from https://min.io/.
[13]
RocksDB Team. A persistent key-value store for fast storage environments. Retrieved from https://rocksdb.org/.
[14]
Swift Team. OpenStack Swift Storage. Retrieved from https://wiki.openstack.org/wiki/Swift.
[15]
Stephen Aiken, Dirk Grunwald, Andrew R. Pleszkun, and Jesse Willeke. 2003. A performance analysis of the iSCSI protocol. In 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies. IEEE, 123–134.
[16]
Ashok Anand, Chitra Muthukrishnan, Steven Kappes, Aditya Akella, and Suman Nath. 2010. Cheap and large CAMs for high performance data-intensive networked systems. In USENIX Symposium on Networked Systems Design and Implementation. USENIX Association, 433–448. Retrieved from http://www.usenix.org/events/nsdi10/tech/full_papers/anand.pdf.
[17]
David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan. 2009. FAWN: A fast array of wimpy nodes. In ACM Symposium on Operating Systems Principles (SOSP). ACM, 1–14. Retrieved from http://dblp.uni-trier.de/db/conf/sosp/sosp2009.html#AndersenFKPTV09.
[18]
Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, and Peter Vajgel. 2010. Finding a needle in Haystack: Facebook’s photo storage. In USENIX Conference on Operating Systems Design and Implementation. 47–60.
[19]
Peter Braam. 2019. The Lustre storage architecture. arXiv preprint arXiv:1903.01955 (2019).
[20]
Ed L. Cashin. 2005. Kernel korner: ATA over ethernet: Putting hard drives on the lan. Linux J. 2005, 134 (2005), 10.
[21]
Miguel Castro, Manuel Costa, and Antony I. T. Rowstron. 2005. Debunking some myths about structured and unstructured overlays. In USENIX Symposium on Networked Systems Design and Implementation.
[22]
Jeremy C. W. Chan, Qian Ding, Patrick P. C. Lee, and Helen H. W. Chan. 2014. Parity logging with reserved space: Towards efficient updates and recovery in erasure-coded clustered storage. In 12th USENIX Conference on File and Storage Technologies (FAST’14). 163–176.
[23]
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2008. BigTable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2 (2008), 1–26.
[24]
Peter M. Chen, Edward K. Lee, Garth A. Gibson, Randy H. Katz, and David A. Patterson. 1994. RAID: High-performance, reliable secondary storage. ACM Comput. Surv. 26, 2 (1994), 145–185.
[25]
Vijay Chidambaram, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2013. Optimistic crash consistency. In 24th ACM Symposium on Operating Systems Principles. ACM, 228–243.
[26]
Vijay Chidambaram, Tushar Sharma, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2012. Consistency without ordering. In 10th USENIX Conference on File and Storage Technologies. 9. Retrieved from https://www.usenix.org/conference/fast12/consistency-without-ordering.
[27]
Jeremy Condit, Edmund B. Nightingale, Christopher Frost, Engin Ipek, Benjamin Lee, Doug Burger, and Derrick Coetzee. 2009. Better I/O through byte-addressable, persistent memory. In ACM SIGOPS 22nd Symposium on Operating Systems Principles. ACM, 133–146.
[28]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107–113.
[29]
Biplob Debnath, Sudipta Sengupta, and Jin Li. 2011. SkimpyStash: RAM space Skimpy key-value store on flash-based storage. In ACM SIGMOD International Conference on Management of Data (SIGMOD’11). ACM, New York, NY, 25–36. DOI:DOI:
[30]
Biplob K. Debnath, Sudipta Sengupta, and Jin Li. 2010. FlashStore: High throughput persistent key-value store. PVLDB 3, 2 (2010), 1414–1425. Retrieved from http://dblp.uni-trier.de/db/journals/pvldb/pvldb3.html#DebnathSL10.
[31]
Giuseppe Decandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s highly available key-value store. ACM SIGOPS Oper. Syst. Rev. 41, 6 (2007), 205–220.
[32]
Prasanna Ganesan, P. Krishna Gummadi, and Hector Garcia-Molina. 2004. Canon in G major: Designing DHTs with hierarchical structure. In IEEE International Conference on Distributed Computing Systems. 263–272.
[33]
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In ACM Symposium on Operating Systems Principles (SOSP). 29–43.
[34]
Tyler Harter, Dhruba Borthakur, Siying Dong, Amitanand Aiyer, Liyin Tang, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. Analysis of HDFS under HBase: A Facebook messages case study. In 12th USENIX Conference on File and Storage Technologies (FAST’14). 199–212.
[35]
John H. Hartman and John K. Ousterhout. 1995. The Zebra striped network file system. ACM Trans. Comput. Syst. 13, 3 (1995), 274–310.
[36]
Nicholas J. A. Harvey, Michael B. Jones, Stefan Saroiu, Marvin Theimer, and Alec Wolman. 2003. SkipNet: A scalable overlay network with practical locality properties. In USENIX Symposium on Internet Technologies and Systems.
[37]
Maurice P. Herlihy and Jeannette M. Wing. 1990. Linearizability: A correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 12, 3 (July1990), 463–492. DOI:DOI:
[38]
Dean Hildebrand and Peter Honeyman. 2005. Exporting storage systems in a scalable manner with pNFS. In 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST’05). IEEE, 18–27.
[39]
Manqi Huang, Lan Luo, You Li, and Liang Liang. 2017. Research on data migration optimization of Ceph. In 14th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP). IEEE, 83–88.
[40]
Chao Jin, Dan Feng, Hong Jiang, and Lei Tian. 2011. RAID6L: A log-assisted RAID6 storage architecture with improved write performance. In IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, 1–6.
[41]
M. Frans Kaashoek and David R. Karger. 2003. Koorde: A simple degree-optimal distributed hash table. In International Workshop on Peer-to-Peer Systems (IPTPS). 98–107.
[42]
David R. Karger and Matthias Ruhl. 2004. Diminished chord: A protocol for heterogeneous subgroup formation in peer-to-peer networks. In International Workshop on Peer-to-Peer Systems (IPTPS). 288–297.
[43]
Avinash Lakshman and Prashant Malik. 2009. Cassandra: A structured storage system on a P2P network. In ACM SIGMOD International Conference on Management of Data.
[44]
Avinash Lakshman and Prashant Malik. 2010. Cassandra: A decentralized structured storage system. ACM SIGOPS Oper. Syst. Rev. 44, 2 (2010), 35–40.
[45]
Edward K. Lee and Chandramohan A. Thekkath. 1996. Petal: Distributed virtual disks. In ACM SIGPLAN Notices, Vol. 31. ACM, 84–92.
[46]
Andrew W. Leung, Shankar Pasupathy, Garth R. Goodson, and Ethan L. Miller. 2008. Measurement and analysis of large-scale network file system workloads. In USENIX Annual Technical Conference, Vol. 1. 2–5.
[47]
Huiba Li, Yiming Zhang, Dongsheng Li, Zhiming Zhang, Shengyun Liu, Peng Huang, Zheng Qin, Kai Chen, and Yongqiang Xiong. 2019. URSA: Hybrid block storage for cloud-scale virtual disks. In 14th EuroSys Conference. ACM.
[48]
Huiba Li, Yiming Zhang, Zhiming Zhang, Shengyun Liu, Dongsheng Li, Xiaohui Liu, and Yuxing Peng. 2017. PARIX: Speculative partial writes in erasure-coded systems. In USENIX Annual Technical Conference (USENIX ATC’17). USENIX Association, 581–587.
[49]
Hyeontaek Lim, Bin Fan, David G. Andersen, and Michael Kaminsky. 2011. SILT: A memory-efficient, high-performance key-value store. In 23rd ACM Symposium on Operating Systems Principles. ACM, 1–13.
[50]
Lanyue Lu, Hariharan Gopalakrishnan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. WiscKey: Separating keys from values in SSD-conscious storage. ACM Trans. Stor. 13, 1 (2017), 5.
[51]
Marshall K. McKusick, William N. Joy, Samuel J. Leffler, and Robert S. Fabry. 1984. A fast file system for UNIX. ACM ACM Trans. Comput. Syst. 2, 3 (1984), 181–197.
[52]
Dutch T. Meyer, Gitika Aggarwal, Brendan Cully, Geoffrey Lefebvre, Michael J. Feeley, Norman C. Hutchinson, and Andrew Warfield. 2008. Parallax: Virtual disks for virtual machines. In ACM SIGOPS Operating Systems Review, Vol. 42. ACM, 41–54.
[53]
James Mickens, Edmund B. Nightingale, Jeremy Elson, Darren Gehring, Bin Fan, Asim Kadav, Vijay Chidambaram, Osama Khan, and Krishna Nareddy. 2014. Blizzard: Fast, cloud-scale block storage for cloud-oblivious applications. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI’14). 257–273.
[54]
Alan Mislove and Peter Druschel. 2004. Providing administrative control and autonomy in structured peer-to-peer overlays. In International Workshop on Peer-to-Peer Systems (IPTPS). 162–172.
[55]
Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva Shankar, Viswanath Sivakumar, and Linpeng Tang. 2014. F4: Facebook’s warm BLOB storage system. In USENIX Conference on Operating Systems Design and Implementation. 383–398.
[56]
Edmund B. Nightingale, Jeremy Elson, Jinliang Fan, Owen Hofmann, Jon Howell, and Yutaka Suzue. 2012. Flat datacenter storage. In USENIX Operating Systems Design and Implementation (OSDI).
[57]
Shadi A. Noghabi, Sriram Subramanian, Priyesh Narayanan, Sivabalan Narayanan, Gopalakrishna Holla, Mammad Zadeh, Tianwei Li, Indranil Gupta, and Roy H. Campbell. 2016. Ambry: LinkedIn’s scalable geo-distributed object store. In International Conference on Management of Data. 253–265.
[58]
Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John K. Ousterhout, and Mendel Rosenblum. 2011. Fast crash recovery in RAMCloud. In ACM Symposium on Operating Systems Principles (SOSP). 29–41.
[59]
Juan Piernas, Toni Cortes, and José M. García. 2002. DualFS: A new journaling file system without meta-data duplication. In 16th International Conference on Supercomputing. ACM, 137–146.
[60]
Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard M. Karp, and Scott Shenker. 2001. A scalable content-addressable network. In ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication. 161–172. DOI:DOI:
[61]
Kai Ren, Qing Zheng, Swapnil Patil, and Garth Gibson. 2014. IndexFS: Scaling file system metadata performance with stateless caching and bulk insertion. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 237–248.
[62]
Antony I. T. Rowstron and Peter Druschel. 2001. Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In IFIP/ACM International Conference on Distributed Systems Platforms and Open Distributed Processing (Middleware). 329–350.
[63]
Haiying Shen, Cheng-Zhong Xu, and Guihai Chen. 2006. Cycloid: A constant-degree and lookup-efficient P2P overlay network. Perform. Eval. 63, 3 (2006), 195–216.
[64]
Kristina Spirovska, Diego Didona, and Willy Zwaenepoel. 2017. Optimistic causal consistency for geo-replicated key-value stores. In IEEE 37th International Conference on Distributed Computing Systems. IEEE, 2626–2629.
[65]
Daniel Stodolsky, Garth Gibson, and Mark Holland. 1993. Parity logging overcoming the small write problem in redundant disk arrays. In ACM SIGARCH Computer Architecture News, Vol. 21. ACM, 64–75.
[66]
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. 2001. Chord: A scalable peer-to-peer lookup service for internet applications. ACM SIGCOMM Comput. Commun. Rev. 31, 4 (2001), 149–160.
[67]
Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin. 2013. Robustness in the Salus scalable block store. In 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI’13). 357–370.
[68]
Andrew Warfield, Russ Ross, Keir Fraser, Christian Limpach, and Steven Hand. 2005. Parallax: Managing storage for a million machines. In Workshop on Hot Topics in Operating Systems (HotOS).
[69]
Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. In 7th Symposium on Operating Systems Design and Implementation. 307–320.
[70]
Sage A. Weil, Scott A. Brandt, Ethan L. Miller, and Carlos Maltzahn. 2006. CRUSH: Controlled, scalable, decentralized placement of replicated data. In USENIX Operating Systems Design and Implementation (OSDI).
[71]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, and Ankur Dave. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX Symposium on Networked Systems Design and Implementation. 1–14.
[72]
Yiming Zhang, Lei Chen, Xicheng Lu, and Dongsheng Li. 2009. Enabling routing control in a DHT. IEEE J. Select. Areas Commun. 28, 1 (2009), 28–38.
[73]
Yiming Zhang, Dongsheng Li, Chuanxiong Guo, Haitao Wu, Yongqiang Xiong, and Xicheng Lu. 2017. CubicRing: Exploiting network proximity for distributed in-memory key-value store. IEEE/ACM Trans. Network. 25, 4 (2017), 2040–2053.
[74]
Yiming Zhang, Dongsheng Li, and Ling Liu. 2019. Leveraging glocality for fast failure recovery in distributed RAM storage. ACM Trans. Stor. 15, 1 (2019), 1–24.
[75]
Yiming Zhang, Dongsheng Li, Tian Tian, and Ping Zhong. 2017. CubeX: Leveraging glocality of cube-based networks for RAM-based key-value store. In IEEE Conference on Computer Communications. IEEE, 1–9.
[76]
Yiming Zhang, Huiba Li, Shengyun Liu, Jiawei Xu, and Guangtao Xue. 2020. PBS: An efficient erasure-coded block storage system based on speculative partial writes. ACM Trans. Stor. 15 (2020), 1–26.
[77]
Yiming Zhang and Ling Liu. 2011. Distributed line graphs: A universal technique for designing DHTs based on arbitrary regular graphs. IEEE Trans. Knowl. Data Eng. 24, 9 (2011), 1556–1569.
[78]
Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John Kubiatowicz. 2004. Tapestry: A resilient global-scale overlay for service deployment. IEEE J. Select. Areas Commun. 22, 1 (2004), 41–53.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage
ACM Transactions on Storage  Volume 19, Issue 1
February 2023
259 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3578369
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 January 2023
Online AM: 19 November 2022
Accepted: 24 May 2022
Revised: 01 January 2022
Received: 02 March 2021
Published in TOS Volume 19, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Object storage
  2. cluster expansion
  3. data migration
  4. data placement
  5. CRUSH
  6. block storage system

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • National Key R&D Program of China
  • National Natural Science Foundation of China
  • Program of Shanghai Academic Research Leader

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)297
  • Downloads (Last 6 weeks)30
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media