skip to main content
research-article

Compressed suffix trees: Efficient computation and storage of LCP-values

Published: 17 May 2013 Publication History

Abstract

The suffix tree is a very important data structure in string processing, but typical implementations suffer from huge space consumption. In large-scale applications, compressed suffix trees (CSTs) are therefore used instead. A CST consists of three (compressed) components: the suffix array, the longest common prefix (LCP)-array and data structures for simulating navigational operations on the suffix tree. The LCP-array stores the lengths of the LCPs of lexicographically adjacent suffixes, and it can be computed in linear time. In this article, we present a new LCP-array construction algorithm that is fast and very space efficient. In practice, our algorithm outperforms alternative algorithms. Moreover, we introduce a new compressed representation of LCP-arrays.

References

[1]
Mohamed I. Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. 2004. Replacing suffix trees with enhanced suffix arrays. J. Discrete Algor. 2, 1, 53--83.
[2]
Timo Beller, Simon Gog, Enno Ohlebusch, and Thomas Schnattinger. 2011. Computing the longest common prefix array based on the Burrows-Wheeler Transform. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval (SPIRE'09). Lecture Notes in Computer Science, vol. 7024, Springer, 197--208.
[3]
Nieves R. Brisaboa, Susana Ladra, and Gonzalo Navarro. 2009. Directly addressable variable-length codes, string processing and information retrieval. In Proceedings of the 16th International Symposium (SPIRE'09). Lecture Notes in Computer Science, vol. 5721, Springer, 122--130.
[4]
Rodrigo Cánovas and Gonzalo Navarro. 2010. Practical compressed suffix trees. In Proceedings of the 9th International Symposium on Experimental Algorithms (SEA'10). Lecture Notes in Computer Science, vol. 6049, Springer, 94--105.
[5]
David R. Clark. 1996. Compact pat trees. Ph.D. dissertation. University of Waterloo.
[6]
Jasbir Dhaliwal, Simon J. Puglisi, and Andrew Turpin. 2012. Practical efficient string mining. IEEE Trans. Knowl. Data Eng. 24, 4, 735--744.
[7]
Paolo Ferragina and Giovanni Manzini. 2000. Opportunistic data Structures with applications. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science. 390--398.
[8]
Johannes Fischer. 2010a. Optimal Succinctness for range minimum queries. In Proceedings of the 4th Latin American Symposium on Theoretical Informatics (LATIN'00). Lecture Notes in Computer Science, vol. 6034, Springer, 158--169.
[9]
Johannes Fischer. 2010b. Wee LCP. Inform. Process. Lett. 110, 8--9, 317--320.
[10]
Johannes Fischer, Veli Mäkinen, and Gonzalo Navarro. 2009. Faster entropy-bounded compressed suffix trees. Theor. Comput. Sci. 410, 51, 5354--5364.
[11]
Simon Gog. 2011. Compressed suffix trees: Design, construction, and applications. Ph.D. dissertation. Ulm University.
[12]
Simon Gog and Johannes Fischer. 2010. Advantages of shared data structures for sequences of balanced parentheses. In Proceedings of the Data Compression Conference (DCC'10). IEEE, 406--415.
[13]
Simon Gog and Enno Ohlebusch. 2011. Fast and lightweight LCP-array construction algorithms. In Proceedings of the Workshop on Algorithm Engineering and Experiments (ALENEX'11). SIAM, 25--34.
[14]
Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. 2003. High-order entropy-compressed text indexes. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'03). 841--850.
[15]
Dan Gusfield. 1997. Algorithms on Strings, Trees, and Sequences. Cambridge University Press.
[16]
Wing-Kai Hon and Kunihiko Sadakane. 2002. Space-economical algorithms for finding maximal unique matches. In Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching (CPM'02). Lecture Notes in Computer Science, vol. 2373, Springer, 144--152.
[17]
David A. Huffman. 1952. A Method for the construction of minimum-redundancy codes. Proc. Inst. Radio Eng. 40, 9, 1098--1101.
[18]
Guy Jacobson. 1989. Space-efficient static trees and graphs. In Proceedings of the 30th Annual Symposium on Foundations of Computer Science. IEEE, 549--554.
[19]
Juha Kärkkäinen, Giovanni Manzini, and Simon J. Puglisi. 2009. Permuted Longest-common-prefix array. In Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM'09). Lecture Notes in Computer Science, vol. 5577, Springer, 181--192.
[20]
Juha Kärkkäinen and Peter Sanders. 2003. Simple Linear work suffix array construction. In Proceedings of the 30th International Colloquium on Automata, Languages and Programming (ICALP'03). Lecture Notes in Computer Science, vol. 2719, Springer, 943--955.
[21]
Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, and Kunsoo Park. 2001. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching (CPM'01). Lecture Notes in Computer Science, vol. 2089, Springer, 181--192.
[22]
Pang Ko and Srinivas Aluru. 2003. Space efficient linear time construction of suffix arrays. In Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching (CPM'03). Lecture Notes in Computer Science, vol. 2676, Springer, 200--210.
[23]
Markus Kowarschik and Christian Weiss. 2002. An overview of cache optimization techniques and cache-aware numerical algorithms. In Algorithms for Memory Hierarchies, Lecture Notes in Computer Science, vol. 2625, Springer, 213--232.
[24]
Stefan Kurtz. 1999. Reducing the space requirement of suffix trees. Softw., Pract. Exper. 29, 13 (1999), 1149--1171.
[25]
Veli Mäkinen. 2003. Compact suffix array—a space-efficient full-text index. Foundamenta Informaticae 56, 1--2, 191--210.
[26]
Giovanni Manzini. 2004. Two space saving tricks for linear time LCP array computation. In Proceedings of the 9th Scandinavian Workshop on Algorithm Theory (SWAT'04). Lecture Notes in Computer Science, vol. 3111, Springer, 372--383.
[27]
Giovanni Manzini and Paolo Ferragina. 2004. Engineering a lightweight suffix array construction algorithm. Algorithmica 40, 1, 33--50.
[28]
Gonzalo Navarro and Veli Mäkinen. 2007. Compressed full-text indexes. ACM Comput. Surv. 39, 1.
[29]
Ge Nong, Sen Zhang, and Wai Hong Chan. 2009. Linear suffix array construction by almost pure induced-sorting. In Proceedings of the Data Compression Conference (DCC'09). IEEE, 193--202.
[30]
Enno Ohlebusch, Johannes Fischer, and Simon Gog. 2010. CST++. In Proceedings of the 17th International Symposium on String Processing and Information Retrieval (SPIRE'10). Lecture Notes in Computer Science, vol. 6393, Springer, 322--333.
[31]
Enno Ohlebusch and Simon Gog. 2009. A Compressed enhanced suffix array supporting fast string matching. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval (SPIRE'09). Lecture Notes in Computer Science, vol. 5721, Springer, 51--62.
[32]
Daisuke Okanohara and Kunihiko Sadakane. 2009. A linear-time Burrows-Wheeler transform using induced sorting. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval (SPIRE'09). Lecture Notes in Computer Science, vol. 5721, Springer, 90--101.
[33]
Simon J. Puglisi, William F. Smyth, and Andrew Turpin. 2007. A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 39, 2.
[34]
Simon J. Puglisi and Andrew Turpin. 2008. Space-time tradeoffs for longest-common-prefix array computation. In Proceedings of the 19th International Symposium on Algorithms and Computation (ISAAC'09). Lecture Notes in Computer Science, vol. 5369, Springer, 124--135.
[35]
Luís M. S. Russo, Gonzalo Navarro, and Arlindo L. Oliveira. 2008. Fully-compressed suffix trees. In Proceedings of the 12th Latin American Symposium on Theoretical Informatics (LATIN'08). Lecture Notes in Computer Science, vol. 4957, Springer, 362--373.
[36]
Kunihiko Sadakane. 2002. Succinct representations of LCP information and improvements in the compressed suffix arrays. In Proceedings of the 2002 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'02). SIAM, 225--232.
[37]
Kunihiko Sadakane. 2007. Compressed suffix trees with full functionality. Theory Comput. Syst. 41, 4, 589--607.
[38]
Kunihiko Sadakane and Gonzalo Navarro. 2010. Fully-functional succinct trees. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'10). SIAM, 134--149.
[39]
Jouni Sirén. 2010. Sampled longest common prefix array. In Proceedings of the 21st Annual Symposium on Combinatorial Pattern Matching (CPM'10). Lecture Notes in Computer Science, vol. 6129, Springer, 227--237.
[40]
Niko Välimäki, Veli Mäkinen, Wolfgang Gerlach, and Kashyap Dixit. 2009. Engineering a compressed suffix tree implementation. ACM J. Exp. Algor. 14, 2.

Cited By

View all

Index Terms

  1. Compressed suffix trees: Efficient computation and storage of LCP-values

                      Recommendations

                      Comments

                      Information & Contributors

                      Information

                      Published In

                      cover image ACM Journal of Experimental Algorithmics
                      ACM Journal of Experimental Algorithmics  Volume 18, Issue
                      2013
                      329 pages
                      ISSN:1084-6654
                      EISSN:1084-6654
                      DOI:10.1145/2444016
                      Issue’s Table of Contents
                      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                      Publisher

                      Association for Computing Machinery

                      New York, NY, United States

                      Publication History

                      Published: 17 May 2013
                      Accepted: 01 March 2013
                      Revised: 01 February 2013
                      Received: 01 March 2012
                      Published in JEA Volume 18

                      Author Tags

                      1. Algorithm engineering
                      2. Burrows-Wheeler transform
                      3. LCP-array
                      4. balanced parentheses
                      5. text compression
                      6. text indexing
                      7. wavelet tree

                      Qualifiers

                      • Research-article
                      • Research
                      • Refereed

                      Contributors

                      Other Metrics

                      Bibliometrics & Citations

                      Bibliometrics

                      Article Metrics

                      • Downloads (Last 12 months)10
                      • Downloads (Last 6 weeks)1
                      Reflects downloads up to 06 Jan 2025

                      Other Metrics

                      Citations

                      Cited By

                      View all
                      • (2020)Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded SpaceJournal of the ACM10.1145/337589067:1(1-54)Online publication date: 15-Jan-2020
                      • (2020)Lightweight merging of compressed indices based on BWT variantsTheoretical Computer Science10.1016/j.tcs.2019.11.001812(214-229)Online publication date: Apr-2020
                      • (2020)Inducing the LCP ArrayConstruction of Fundamental Data Structures for Strings10.1007/978-3-030-55108-7_4(43-57)Online publication date: 8-Oct-2020
                      • (2020)BackgroundConstruction of Fundamental Data Structures for Strings10.1007/978-3-030-55108-7_2(9-21)Online publication date: 8-Oct-2020
                      • (2017)Relative Suffix TreesThe Computer Journal10.1093/comjnl/bxx10861:5(773-788)Online publication date: 21-Nov-2017
                      • (2017)Burrows–Wheeler transform and LCP array construction in constant spaceJournal of Discrete Algorithms10.1016/j.jda.2016.11.00342(14-22)Online publication date: Jan-2017
                      • (2017)Shared-Memory Parallelism Can Be Simple, Fast, and ScalableundefinedOnline publication date: 9-Jun-2017
                      • (2016)TextsCompact Data Structures10.1017/CBO9781316588284.012(395-449)Online publication date: 5-Sep-2016
                      • (2016)Tighter bounds for the sum of irreducible LCP valuesTheoretical Computer Science10.1016/j.tcs.2015.12.009656:PB(265-278)Online publication date: 20-Dec-2016
                      • (2015)The longest common substring problemMathematical Structures in Computer Science10.1017/S096012951500011027:2(277-295)Online publication date: 29-May-2015
                      • Show More Cited By

                      View Options

                      Login options

                      Full Access

                      View options

                      PDF

                      View or Download as a PDF file.

                      PDF

                      eReader

                      View online with eReader.

                      eReader

                      Media

                      Figures

                      Other

                      Tables

                      Share

                      Share

                      Share this Publication link

                      Share on social media