skip to main content
research-article

Put an elephant into a fridge: optimizing cache efficiency for in-memory key-value stores

Published: 01 May 2020 Publication History

Abstract

In today's data centers, memory-based key-value systems, such as Memcached and Redis, play an indispensable role in providing high-speed data services. The rapidly growing capacity and quickly falling price of DRAM memory in the past years have enabled us to create a large memory-based key-value store, which is able to serve hundreds of Gigabytes to even Terabytes of key-value data all in memory. Unfortunately, CPU cache in modern processors has not seen a similar growth in capacity, still remaining at the level of a few dozens of Megabytes. Such an extremely low cache-to-memory ratio (less than 0.1%) poses a significant new challenge---the limited CPU cache is becoming a severe performance bottleneck that hinders us from fully exploiting the great potential of high-speed memory-based key-value stores.
To address this critical challenge, we propose a highly cache-efficient scheme, called Cavast, to optimize the cache utilization of large-capacity in-memory key-value stores. Our goal is to maximize cache efficiency and system performance without any hardware changes. We first present two light-weight, software-only mechanisms to enable user to indirectly control the cache content at application level. Then we propose a set of optimization policies to address several critical design issues that impair cache's efficacy in the current key-value store systems. By carefully reorganizing the data layout in memory, redesigning the hash indexing structure, and offloading garbage collection, we can effectively improve the utilization of the limited cache space. We have developed a module in Linux as a kernel-level support, and implemented two prototypes based on Memcached and Redis with the proposed Cavast scheme. Our experimental studies show promising results. On a 6-core Intel Xeon processor with only 15-MB cache, we can raise the cache hit ratio up to 82.7% with a very small cache-to-memory ratio (0.023%), and significantly increase the key-value system throughput by a factor of up to 4.2.

References

[1]
CAS latency. https://en.wikipedia.org/wiki/CAS_latency.
[2]
Generalized extreme value distribution. https://en.wikipedia.org/wiki/Generalized_extreme_value_distribution.
[3]
Generalized Pareto distribution. https://en.wikipedia.org/wiki/Generalized_Pareto_distribution.
[4]
Hardware performance counter. https://en.wikipedia.org/wiki/Hardware_performance_counter.
[5]
Intel Skylake. https://www.7-cpu.com/cpu/Skylake.html.
[6]
Intel Xeon Platinum 9282. https://ark.intel.com/content/www/us/en/ark/products/194146/intel-xeon-platinum-9282-processor-77m-cache-2-60-ghz.html.
[7]
Jenkins Hash. https://en.wikipedia.org/wiki/Jenkins_hash_function.
[8]
Linux hugepage. https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt.
[9]
Linux Perf. https://en.wikipedia.org/wiki/Perf_(Linux).
[10]
Memcached. https://memcached.org.
[11]
MurmurHash3. https://github.com/aappleby/smhasher/wiki/MurmurHash3.
[12]
Random-access memory. https://en.wikipedia.org/wiki/Random-access_memory#Timeline.
[13]
Redis. https://redis.io.
[14]
Redis-based applications. https://techstacks.io/tech/redis.
[15]
Scaling memcached at Facebook. https://www.facebook.com/notes/facebook-engineering/scaling-memcached-at-facebook/39391378919/.
[16]
Synchronous dynamic random-access memory (SDRAM). https://en.wikipedia.org/wiki/Synchronous_dynamic_random-access_memory.
[17]
The 10% rule for VSAN caching, calculate it on a VM basis not disk capacity! http://www.yellow-bricks.com/2016/02/16/10-rule-vsan-caching-calculate-vm-basis-not-disk-capacity/.
[18]
Twemcache. https://github.com/twitter/twemcache.
[19]
A. Adya, R. Grandl, D. Myers, and H. Qin. Fast key-value stores: An idea whose time has come and gone. In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS '19), pages 113--119, 2019.
[20]
D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. FAWN: A fast array of wimpy nodes. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP '09), pages 1--14, 2009.
[21]
N. Askitis and R. Sinha. HAT-trie: A cache-conscious trie-based data structure for strings. In Proceedings of the 30th Australasian Conference on Computer Science, pages 97--105, 2007.
[22]
N. Askitis and R. Sinha. Engineering scalable, cache and space efficient tries for strings. The VLDB Journal, 19(5):633--660, 2010.
[23]
N. Askitis and J. Zobel. Redesigning the string hash table, burst trie, and BST to exploit cache. Journal of Experimental Algorithmics (JEA), 15(1):1--61, 2011.
[24]
B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny. Workload analysis of a large-scale key-value store. In Proceedings of 2012 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '12), volume 40, pages 53--64, 2012.
[25]
L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching and Zipf-like distributions: Evidence and implications. In Proceedings of IEEE Conference on Computer Communications (INFOCOM '99), volume 1, pages 126--134, 1999.
[26]
F. Chen, T. Luo, and X. Zhang. CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In Proceedings of the 9th USENIX Conference on File and Storage Technologies (FAST '11), San Jose, CA, Feb 15-17 2011.
[27]
T. M. Chilimbi, M. D. Hill, and J. R. Larus. Making pointer-based data structures cache conscious. Computer, 33(12):67--74, 2000.
[28]
A. Cidon, D. Rushton, S. M. Rumble, and R. Stutsman. Memshare: A dynamic multi-tenant key-value cache. In Proceedings of 2017 USENIX Annual Technical Conference (USENIX ATC '17), pages 321--334, 2017.
[29]
B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC '10), pages 143--154, 2010.
[30]
C. R. Cunha, A. Bestavros, and M. E. Crovella. Characteristics of WWW client-based traces. Technical report, Boston University Computer Science Department, 1995.
[31]
B. Debnath, S. Sengupta, and J. Li. FlashStore: High throughput persistent key-value store. PVLDB, 3(2):1414--1425, 2010.
[32]
B. Debnath, S. Sengupta, and J. Li. SkimpyStash: RAM space skimpy key-value store on flash-based storage. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data (SIGMOD '11), pages 25--36, 2011.
[33]
B. Fan, D. G. Andersen, and M. Kaminsky. MemC3: Compact and concurrent memcache with dumber caching and smarter hashing. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI '13), pages 371--384, 2013.
[34]
A. Farshin, A. Roozbeh, G. Q. Maguire Jr, and D. Kostić. Make the most out of last level cache in Intel processors. In Proceedings of the Fourteenth EuroSys Conference (EuroSys '19), pages 1--17, 2019.
[35]
R. A. Hankins and J. M. Patel. Effect of node size on the performance of cache-conscious B+-trees. In Proceedings of the 2003 ACM SIGMETRICS International Conference on Measurement and Modeling of computer systems (SIGMETRICS '03), pages 283--294, 2003.
[36]
S. Heinz, J. Zobel, and H. E. Williams. Burst tries: A fast, efficient data structure for string keys. ACM Transactions on Information Systems (TOIS), 20(2):192--223, 2002.
[37]
M. Herlihy, N. Shavit, and M. Tzafrir. Hopscotch hashing. In Proceedings of International Symposium on Distributed Computing (DISC '08), pages 350--364, 2008.
[38]
X. Hu, X. Wang, Y. Li, L. Zhou, Y. Luo, C. Ding, and Z. Wang. LAMA: Optimized locality-aware memory allocation for key-value cache. In Proceedings of 2015 USENIX Annual Technical Conference (USENIX ATC '15), pages 57--69, 2015.
[39]
R. Hund, C. Willems, and T. Holz. Practical timing side channel attacks against kernel space ASLR. In Proceedings of 2013 IEEE Symposium on Security and Privacy, pages 191--205, 2013.
[40]
S. Jiang, F. Chen, and X. Zhang. CLOCK-Pro: An effective improvement of the CLOCK replacement. In Proceedings of 2005 USENIX Annual Technical Conference (USENIX ATC '05), pages 323--336, 2005.
[41]
R. Kelly, B. A. Pearlmutter, and P. Maguire. Lock-free hopscotch hashing. In arXiv preprint arXiv.1911.03028, 2019.
[42]
C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, and et al. FAST: Fast architecture sensitive tree search on modern CPUs and GPUs. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD '10), pages 339--350, 2010.
[43]
M. C. Lee, F. Y. Leu, and Y. P. Chen. Pareto-based cache replacement for YouTube. In World Wide Web, pages 1523--1540, 2015.
[44]
D. Levinthal. Performance analysis guide for Intel Core i7 processor and Intel Xeon 5500 processors. https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf.
[45]
H. Lim, B. Fan, D. G. Andersen, and M. Kaminsky. SILT: A memory-efficient, high-performance key-value store. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP '11), pages 1--13, 2011.
[46]
H. Lim, D. Han, D. G. Andersen, and M. Kaminsky. MICA: A holistic approach to fast in-memory key-value storage. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI '14), pages 429--444, 2014.
[47]
J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In Proceedings of 14th IEEE International Symposium on High Performance Computer Architecture (HPCA '08), pages 367--378, 2008.
[48]
J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Enabling software management for multicore caches with a lightweight hardware support. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC '09), page 14, 2009.
[49]
G. Lu, Y. J. Nam, and D. H. Du. BloomStore: Bloom-filter based memory-efficient key-value store for indexing of data deduplication on flash. In Proceedings of 2012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST '12), pages 1--11, 2012.
[50]
Q. Lu, J. Lin, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Soft-OLP: Improving hardware cache performance through software-controlled object-level partitioning. In Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques (PACT '09), pages 246--257, 2009.
[51]
L. Marmol, S. Sundararaman, N. Talagala, and R. Rangaswami. NVMKV: A scalable, lightweight, FTL-aware key-value store. In Proceedings of 2015 USENIX Annual Technical Conference (USENIX ATC '15), pages 207--219, 2015.
[52]
C. Maurice, N. Le Scouarnec, C. Neumann, O. Heen, and A. Francillon. Reverse engineering Intel last-level cache complex addressing using performance counters. In International Symposium on Recent Advances in Intrusion Detection, pages 48--65, 2015.
[53]
J. C. McCallum. Memory prices 1957+. https://jcmit.net/memoryprice.htm.
[54]
Z. Metreveli, N. Zeldovich, and M. F. Kaashoek. CPHash: A cache-partitioned hash table. ACM SIGPLAN Notices, 47(8):319--320, 2012.
[55]
F. Ni, S. Jiang, H. Jiang, J. Huang, and X. Wu. SDC: A software defined cache for efficient data indexing. In Proceedings of the ACM International Conference on Supercomputing (ICS '19), pages 82--93, 2019.
[56]
S. Noll, J. Teubner, N. May, and A. Böhm. Accelerating concurrent workloads with CPU cache partitioning. In Proceedings of 2018 IEEE 34th International Conference on Data Engineering (ICDE '18), pages 437--448, 2018.
[57]
C. Pan, L. Zhou, Y. Luo, X. Wang, and Z. Wang. Lightweight and accurate memory allocation in key-value cache. International Journal of Parallel Programming, 47(3):451--466, 2019.
[58]
G. Psaropoulos, T. Legler, N. May, and A. Ailamaki. Interleaving with coroutines: A practical approach for robust index joins. PVLDB, 11(2):230--242, 2017.
[59]
J. Rao and K. A. Ross. Making B+-trees cache conscious in main memory. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '00), pages 475--486, 2000.
[60]
D. Reinsel, J. Gantz, and J. Rydning. Data age 2025: The digitization of the world from edge to core. IDC White Paper, 2018.
[61]
S. M. Rumble, A. Kejriwal, and J. Ousterhout. Log-structured memory for DRAM-based storage. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST '14), pages 1--16, 2014.
[62]
Z. Shen, F. Chen, Y. Jia, and Z. Shao. DIDACache: A deep integration of device and application for flash based key-value caching. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST '17), pages 391--405, 2017.
[63]
K. Wang and F. Chen. Cascade mapping: Optimizing memory efficiency for flash-based key-value caching. In Proceedings of the ACM Symposium on Cloud Computing (SoCC '18), pages 464--476, 2018.
[64]
X. Wu, L. Zhang, Y. Wang, Y. Ren, M. Hack, and S. Jiang. zExpander: A key-value cache with both high performance and fewer misses. In Proceedings of the Eleventh European Conference on Computer Systems (Eurosys '16), pages 1--15, 2016.
[65]
L. Xu, A. Pavlo, S. Sengupta, and G. R. Ganger. Online deduplication for databases. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD '17), page 1355--1368, 2017.
[66]
S. Xu, S. Lee, S. W. Jun, M. Liu, and J. Hicks. BlueCache: A scalable distributed flash-based key-value store. PVLDB, 10(4):301--312, 2016.
[67]
Y. Yarom, Q. Ge, F. Liu, R. B. Lee, and G. Heiser. Mapping the Intel last-level cache. Cryptology ePrint Archive, Report 2015/905, 2015.
[68]
G. Zhang and D. Sanchez. Leveraging caches to accelerate hash tables and memoization. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '19), pages 440--452, 2019.
[69]
H. Zhang, M. Dong, and H. Chen. Efficient and available in-memory KV-store with hybrid erasure coding and replication. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST '16), pages 167--180, 2016.
[70]
K. Zhang, K. Wang, Y. Yuan, L. Guo, R. Lee, and X. Zhang. Mega-KV: A case for GPUs to maximize the throughput of in-memory key-value stores. PVLDB, 8(11):1226--1237, 2015.
[71]
X. Zhang, S. Dwarkadas, and K. Shen. Towards practical page coloring-based multicore cache management. In Proceedings of the 4th ACM European Conference on Computer Systems (EuroSys '09), pages 89--102, 2009.
[72]
P. Zuo and Y. Hua. A write-friendly and cache-optimized hashing scheme for non-volatile memory systems. IEEE Transactions on Parallel and Distributed Systems, 29(5):985--998, 2017.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 13, Issue 9
May 2020
295 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 May 2020
Published in PVLDB Volume 13, Issue 9

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media