Mass spectrometry searches using MASST

M Wang, AK Jarmusch, F Vargas, AA Aksenov… - Nature …, 2020 - nature.com
Nature biotechnology, 2020nature.com
To the Editor—We introduce a webenabled mass spectrometry (MS) search engine, named
Mass Spectrometry Search Tool (MASST; https://masst. ucsd. edu). By enabling searches of
all small-molecule tandem MS (MS/MS) data in public metabolomics repositories, we posit
that MASST will unlock these resources for clinical, environmental and natural product
applications. Introduced in 1990, a tool for discovering related protein or gene sequences
named Basic Local Alignment Search Tool (BLAST) enabled researchers to query entire …
To the Editor—We introduce a webenabled mass spectrometry (MS) search engine, named Mass Spectrometry Search Tool (MASST; https://masst. ucsd. edu). By enabling searches of all small-molecule tandem MS (MS/MS) data in public metabolomics repositories, we posit that MASST will unlock these resources for clinical, environmental and natural product applications. Introduced in 1990, a tool for discovering related protein or gene sequences named Basic Local Alignment Search Tool (BLAST) enabled researchers to query entire public sequence data repositories through a web interface (WebBLAST; https://blast. ncbi. nlm. nih. gov/Blast. cgi) 1. WebBLAST is one of the most widely cited and used bioinformatics tools because it permits any researcher to answer simple questions, such as ‘is a protein or DNA sequence common or rare?’. In the early days of public gene and protein databases, metadata, which include descriptions of sample, population or technical details, were limited. No deposition standards existed, except for the Short Read Archive and European Nucleotide Archive, which include experimental details for sequencing, instrumental details and sample description, such as the source of a sample. The current status of much MS data in the public domain is reminiscent of the DNA databanks of the 1990s. To increase usage and unlock the potential of openly available MS resources, we set out to build an infrastructure to enable WebBLAST for MS. Algorithms developed for MS data, including molecular networking2 and fragmentation trees3, enable similarity searches against reference libraries of known molecules, whereas powerful metabolomics analysis software infrastructures, such as MS-DIAL4, MetaboAnalyst5, XCMS Online6 and HMDB7, focus on annotation of MS/MS spectra, or finding statistical relationships between molecular features. However, none of the existing tools enable searching a single MS/MS spectrum for identical or analogous MS/MS spectra against public data in repositories, including unknown molecules. Finding specific MS/MS spectra of interest, including unannotated spectra or structural analogs, in public repositories of metabolomics MS data and natural product MS data, is not possible. Deposition of untargeted MS data in the public domain is experiencing rapid growth. In March 2017, 910 metabolomics datasets were available8; by January 2019, there were> 2,000 downloadable metabolomics datasets (about half of these datasets contain MS/MS data) 9. Despite the availability of metabolomics and natural product data, including environmental and clinical MS datasets, public small-molecule MS data are hardly reused10. Now that there is a huge amount of small-molecule untargeted MS datasets publicly available (~ 1,100 untargeted datasets and~ 110,000,000 spectra in~ 150,000 files as of December 11, 2018), we felt that the time was right to develop MASST, to enable reuse of these MS data. MASST comprises a web-based system to search the public data repository part of the GNPS/MassIVE knowledge base11 and an analysis infrastructure for a single MS/MS spectrum. The developments required for MASST searches included converting deposited public data to a uniform open format12 (irrespective of instrument type and original data format), the ability to trace the file from which each MS/MS spectrum originated, and a reporting system that shows all identical or similar MS/MS spectra found in public data along with their associated metadata. MASST development has been possible for two main reasons: first, adoption of universal, non-vendor-specific MS data formats has increased, which …
nature.com