skip to main content
10.5555/3571885.3571896acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

P-massive: a real-time search engine for a multi-terabyte mass spectrometry database

Published: 18 November 2022 Publication History

Abstract

Queries of multi-TB Mass Spectrometry (MS) repositories provide deep insights into biological processes and pose challenging data processing problems. The key bottleneck for running these queries is the number of small random reads. Byte-addressable persistent main memory (PMEM) technologies enable real-time MS search systems by delivering low-latency, high-bandwidth storage.
This work presents P-Massive, real-time multi-terabyte scale MS search system. P-Massive takes advantage of PMEM and the underlying nature of its data access patterns to maximize performance. We evaluate P-Massive across various storage hierarchies and project forward over the next decade to understand how MS query systems might evolve.
Our evaluation shows that P-Massive offers a cost-effective solution that achieves near-DRAM performance. A single query takes 1.7 seconds in P-Massive, 69× faster than state-of-the-art implementation. In an end-to-end, user-facing application, P-Massive delivers a 90% shorter wait time than the latest MS search tool, returning results within seconds rather than minutes.

Supplementary Material

MP4 File (SC22_Presentation_Batsoyol.mp4)
Presentation at SC '22

References

[1]
Ruedi Aebersold and Matthias Mann. Mass spectrometry-based proteomics. Nature, 422(6928):198--207, 2003.
[2]
JM Berg and Stryer L TJ. Section 3.2, primary structure: Amino acids are linked by peptide bonds to form polypeptide chains. Biochemistry, 2002.
[3]
Wout Bittremieux, Kris Laukens, and William Stafford Noble. Extremely fast and accurate open modification spectral library searching of high-resolution mass spectra using feature hashing and graphics processing units. Journal of proteome research, 18(10):3792--3799, 2019.
[4]
Wout Bittremieux, Pieter Meysman, William Stafford Noble, and Kris Laukens. Fast open modification spectral library searching through approximate nearest neighbor indexing. Journal of proteome research, 17(10):3463--3474, 2018.
[5]
Block and Files. SSD Cost. Accessed Sep. 12, 2022. Available: https://blocksandfiles.com/2021/01/25/wikibon-ssds-vs-hard-drives-wrights-law/.
[6]
Denisa Bojkova, Kevin Klann, Benjamin Koch, Marek Widera, David Krause, Sandra Ciesek, Jindrich Cinatl, and Christian Münch. Proteomics of sars-cov-2-infected host cells reveals therapy targets. Nature, 583(7816):469--472, 2020.
[7]
Youmin Chen, Youyou Lu, Fan Yang, Qing Wang, Yang Wang, and Jiwu Shu. Flatstore: An efficient log-structured key-value storage engine for persistent memory. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 1077--1091, 2020.
[8]
Joel M Chick, Deepak Kolippakkam, David P Nusinow, Bo Zhai, Ramin Rad, Edward L Huttlin, and Steven P Gygi. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nature biotechnology, 33(7):743--749, 2015.
[9]
John S Cottrell. Protein identification using ms/ms data. Journal of proteomics, 74(10):1842--1851, 2011.
[10]
Andrew Crotty, Viktor Leis, and Andrew Pavlo. Are you sure you want to use mmap in your database management system? 2022.
[11]
Eric W Deutsch, Nuno Bandeira, Vagisha Sharma, Yasset Perez-Riverol, Jeremy J Carver, Deepti J Kundu, David García-Seisdedos, Andrew F Jarnuczak, Suresh Hewapathirana, Benjamin S Pullman, et al. The proteomexchange consortium in 2020: enabling 'big data'approaches in proteomics. Nucleic acids research, 48(D1):D1145--D1152, 2020.
[12]
Jimmy K Eng, Tahmina A Jahan, and Michael R Hoopmann. Comet: an open-source ms/ms sequence database search tool. Proteomics, 13(1):22--24, 2013.
[13]
Jimmy K Eng, Brian C Searle, Karl R Clauser, and David L Tabb. A face in the crowd: recognizing peptides through database search. Molecular & Cellular Proteomics, 10(11), 2011.
[14]
Lewis Y Geer, Sanford P Markey, Jeffrey A Kowalak, Lukas Wagner, Ming Xu, Dawn M Maynard, Xiaoyu Yang, Wenyao Shi, and Stephen H Bryant. Open mass spectrometry search algorithm. Journal of proteome research, 3(5):958--964, 2004.
[15]
Gurbinder Gill, Roshan Dathathri, Loc Hoang, Ramesh Peri, and Keshav Pingali. Single machine graph analytics on massive datasets using intel optane dc persistent memory. arXiv preprint arXiv:1904.07162, 2019.
[16]
Mark Hildebrand, Jawad Khan, Sanjeev Trika, Jason Lowe-Power, and Venkatesh Akella. Autotm: Automatic tensor movement in heterogeneous memory systems using integer linear programming. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 875--890, 2020.
[17]
M Hosomi, H Yamagishi, T Yamamoto, K Bessho, Y Higo, K Yamane, H Yamada, M Shoji, H Hachino, C Fukumoto, et al. A novel nonvolatile memory with spin torque transfer magnetization switching: Spin-ram. In IEEE InternationalElectron Devices Meeting, 2005. IEDM Technical Digest., pages 459--462. IEEE, 2005.
[18]
Intel. Intel and Micron Produce Breakthrough Memory Technology. Accessed Sep. 12, 2022. Available: https://newsroom.intel.com/news-releases/intel-and-micron-produce-breakthrough-memory-technology.
[19]
Intel. Xeon Gold 6230. Accessed Sep. 12, 2022. Available: https://ark.intel.com/content/www/us/en/ark/products/192437/intel-xeon-gold-6230-processor-27-5m-cache-2-10-ghz.html.
[20]
Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, Amir-saman Memaripour, Yun Joon Soh, Zixuan Wang, Yi Xu, Subramanya R Dulloor, et al. Basic performance measurements of the intel optane dc persistent memory module. arXiv preprint arXiv:1903.05714, 2019.
[21]
Olzhas Kaiyrakhmet, Songyi Lee, Beomseok Nam, Sam H Noh, and Young-ri Choi. Slm-db: single-level key-value store with persistent memory. In 17th USENIX Conference on File and Storage Technologies (FAST 19), pages 191--205, 2019.
[22]
Andy T Kong, Felipe V Leprevost, Dmitry M Avtonomov, Dattatreya Mellacheruvu, and Alexey I Nesvizhskii. Msfragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nature methods, 14(5):513--520, 2017.
[23]
John M Koomen, Eric B Haura, Gerold Bepler, Rebecca Sutphen, Elizabeth R Remily-Wood, Kaaron Benson, Mohamad Hussein, Lori A Hazlehurst, Timothy J Yeatman, Lynne T Hildreth, et al. Proteomic contributions to personalized cancer care. Molecular & Cellular Proteomics, 7(10):1780--1794, 2008.
[24]
Lucas Lersch, Xiangpeng Hao, Ismail Oukid, Tianzheng Wang, and Thomas Willhalm. Evaluating persistent memory range indexes. Proceedings of the VLDB Endowment, 13(4):574--587, 2019.
[25]
Jagan Singh Meena, Simon Min Sze, Umesh Chand, and Tseung-Yuen Tseng. Overview of emerging nonvolatile memory technologies. Nanoscale research letters, 9(1):1--33, 2014.
[26]
Memverge. DRAM vs PMEM cost. Accessed Sep. 12, 2022. Available: https://memverge.com/more-memory-less-cost/.
[27]
NetApp. What is persistent memory? Accessed Sep. 12, 2022. Available: https://www.netapp.com/knowledge-center/what-is-persistent-memory/.
[28]
Anastasios Papagiannis, Giorgos Xanthakis, Giorgos Saloustros, Manolis Marazakis, and Angelos Bilas. Optimizing memory-mapped I/O for fast storage devices. In 2020 USENIX Annual Technical Conference (USENIX ATC 20), pages 813--827, 2020.
[29]
Proteomexchange. MS-based Proteomics Public Repositories. Accessed Sep. 12, 2022. Available: http://www.proteomexchange.org/.
[30]
Ben Pullman, Narangerelt Batsoyol, Mingxun Wang, Steven Swanson, and Nuno Bandeira. Real-time modification-tolerant matching of ms/ms spectra at the repository scale. Available: https://fastlibrarysearch.ucsd.edu/fastsearch/.
[31]
Steven Scargall. PMEM 2MiB Page Fault. Accessed Sep. 12, 2022. Available: https://pmem.io/2018/05/15/using_persistent_memory_devices_with_the_linux_device_mapper.html.
[32]
Dmitri B Strukov, Gregory S Snider, Duncan R Stewart, and R Stanley Williams. The missing memristor found. nature, 453(7191):80--83, 2008.
[33]
Supermicro. BigTwin 2 Node Server. Accessed Sep. 12, 2022. Available: https://www.supermicro.com/manuals/superserver/2U/MNL-2011.pdf.
[34]
Mingxun Wang, Alan K Jarmusch, Fernando Vargas, Alexander A Aksenov, Julia M Gauglitz, Kelly Weldon, Daniel Petras, Ricardo da Silva, Robert Quinn, Alexey V Melnik, et al. Mass spectrometry searches using masst. Nature biotechnology, 38(1):23--26, 2020.
[35]
Mingxun Wang, Jian Wang, Jeremy Carver, Benjamin S Pullman, Seong Won Cha, and Nuno Bandeira. Assembling the community-scale discoverable human proteome. Cell systems, 7(4):412--421, 2018.
[36]
H-S Philip Wong, Simone Raoux, SangBum Kim, Jiale Liang, John P Reifenberg, Bipin Rajendran, Mehdi Asheghi, and Kenneth E Goodson. Phase change memory. Proceedings of the IEEE, 98(12):2201--2227, 2010.
[37]
Lingfeng Xiang, Xingsheng Zhao, Jia Rao, Song Jiang, and Hong Jiang. Characterizing the performance of intel optane persistent memory: a close look at its on-dimm buffering. In Proceedings of the Seventeenth European Conference on Computer Systems, pages 488--505, 2022.
[38]
Jian Xu, Juno Kim, Amirsaman Memaripour, and Steven Swanson. Finding and fixing performance pathologies in persistent memory software stacks. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 427--439, 2019.
[39]
Jian Yang, Joseph Izraelevitz, and Steven Swanson. Orion: A distributed file system for non-volatile main memory and rdma-capable networks. In 17th USENIX Conference on File and Storage Technologies (FAST 19), pages 221--234, 2019.
[40]
Jian Yang, Juno Kim, Morteza Hoseinzadeh, Joseph Izraelevitz, and Steve Swanson. An empirical guide to the behavior and use of scalable persistent memory. In 18th USENIX Conference on File and Storage Technologies (FAST 20), pages 169--182, 2020.
[41]
Bing Zhang, Jeffrey R Whiteaker, Andrew N Hoofnagle, Geoffrey S Baird, Karin D Rodland, and Amanda G Paulovich. Clinical potential of mass spectrometry-based proteogenomics. Nature Reviews Clinical Oncology, 16(4):256--268, 2019.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
November 2022
1277 pages
ISBN:9784665454445

Sponsors

In-Cooperation

  • IEEE CS

Publisher

IEEE Press

Publication History

Published: 18 November 2022

Check for updates

Author Tags

  1. bioinformatics
  2. indexing
  3. mass spectrometry
  4. nonvolatile memory
  5. search engines

Qualifiers

  • Research-article

Conference

SC '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 102
    Total Downloads
  • Downloads (Last 12 months)48
  • Downloads (Last 6 weeks)8
Reflects downloads up to 14 Sep 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media