skip to main content
article

Indexing compressed text

Published: 01 July 2005 Publication History

Abstract

We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form.Our first compressed data structure retrieves the occ occurrences of a pattern P[1,p] within a text T[1,n] in O(p + occ log1+ε n) time for any chosen ε, 0<ε<1. This data structure uses at most 5nHk(T) + o(n) bits of storage, where Hk(T) is the kth order empirical entropy of T. The space usage is Θ(n) bits in the worst case and o(n) bits for compressible texts. This data structure exploits the relationship between suffix arrays and the Burrows--Wheeler Transform, and can be regarded as a compressed suffix array.Our second compressed data structure achieves O(p+occ) query time using O(nHk(T)logε n) + o(n) bits of storage for any chosen ε, 0<ε<1. Therefore, it provides optimal output-sensitive query time using o(nlog n) bits in the worst case. This second data structure builds upon the first one and exploits the interplay between two compressors: the Burrows--Wheeler Transform and the LZ78 algorithm.

References

[1]
Alstrup, S., Brodal, G. S., and Rauhe, T. 2000. New data structures for orthogonal range searching. In Proceedings of the 41st IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, Calif., 198--207.]]
[2]
Andersson, A. 1996. Sorting and searching revisited. In Proceedings of the 5th Scandinavian Workshop on Algorithm Theory. Lecture Notes in Computer Science, vol. 1097. Springer-Verlag, New York, 185--197.]]
[3]
Bentley, J., Sleator, D., Tarjan, R., and Wei, V. 1986. A locally adaptive compression scheme. Commun. ACM 29, 4, 320--330.]]
[4]
Brodnik, A., and Munro, I. 1999. Membership in constant time and almost-minimum space. SIAM J. Comput. 28, 5, 1627--1640.]]
[5]
Burrows, M., and Wheeler, D. 1994. A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation.]]
[6]
Chan, H., Hon, W., and Lam, T. 2004. Compressed index for a dynamic collection of texts. In Proceedings of the 15th Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 3109. Springer-Verlag, New York, 445--456.]]
[7]
Clark, D. R., and Munro, I. 1996. Efficient suffix trees on secondary storage. In Proceedings of the 7th ACM-SIAM Symposium on Discrete Algorithms. ACM, New York, 383--391.]]
[8]
Colussi, L., and De Col, A. 1996. A time and space efficient data structure for string searching on large texts. Info. Proc. Lett. 58, 5, 217--222.]]
[9]
Farach, M., and Thorup, M. 1998. String matching in Lempel--Ziv compressed strings. Algorithmica 20, 4, 388--404.]]
[10]
Fenwick, P. 1996. The Burrows-Wheeler transform for block sorting text compression: principles and improvements. The Computer J. 39, 9, 731--740.]]
[11]
Ferragina, P., Giancarlo, R., Manzini, G., and Sciortino, M. 2005. Boosting textual compression in optimal linear time. J. ACM 52, 4 (July), 688--713.]]
[12]
Ferragina, P., and Grossi, R. 1999. The string B-tree: A new data structure for string search in external memory and its applications. J. ACM 46, 2, 236--280.]]
[13]
Ferragina, P., and Manzini, G. 2000. Opportunistic data structures with applications. In Proceedings of the 41st IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, Calif., 390--398.]]
[14]
Ferragina, P., and Manzini, G. 2001. An experimental study of a compressed index. Information Sciences: Special issue on “Dictionary Based Compression” 135, 13--28.]]
[15]
Ferragina, P., Manzini, G., Mäkinen, V., and Navarro, G. 2004. An alphabet-friendly FM-index. In Proceedings of the 11th International Symposium on String Processing and Information Retrieval. Lecture Notes in Computer Science, vol. 3246. Springer-Verlag, New York, 150--160.]]
[16]
Gonnet, G. H., Baeza-Yates, R. A., and Snider, T. 1992. New indices for text: PAT trees and PAT arrays. In Information Retrieval: Data Structures and Algorithms, B. Frakes and R. A. Baeza-Yates Eds. Prentice-Hall, Englewood Cliffs, N.J., Chapter 5, 66--82.]]
[17]
Grabowski, S., Mäkinen, V., and Navarro, G. 2004. First Huffman, then Burrows-Wheeler: An alphabet-independent FM-index. In Proceedings of the 11th International Symposium on String Processing and Information Retrieval. Lecture Notes in Computer Science, vol. 3246. Springer-Verlag, New York, 210--211.]]
[18]
Grossi, R., Gupta, A., and Vitter, J. 2003. High-order entropy-compressed text indexes. In Proceedings of the 14th ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM Press, New York, 841--850.]]
[19]
Grossi, R., Gupta, A., and Vitter, J. 2004. When indexing equals compression: Experiments on compressing suffix arrays and applications. In Proceedings of the 15th ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM Press, New York, 636--645.]]
[20]
Grossi, R., and Vitter, J. 2000. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the 32nd ACM Symposium on Theory of Computing. ACM, New York, 397--406.]]
[21]
Healy, J., Thomas, E. E., Schwartz, J. T., and Wigler, M. 2003. Annotating large genomes with exact word matches. Genome Research 13, 2306--2315.]]
[22]
Hon, W., Lam, T., Sadakane, K., Sung, W., and Yiu, S. 2004a. Compressed index for dynamic text. In Proceedings of the IEEE Data Compression Conference. IEEE Computer Society Press, Los Alamitos, Calif., 102--111.]]
[23]
Hon, W., Lam, T., Sung, W., Tse, W., Wong, C., and Yiu, S. 2004b. Practical aspects of compressed suffix arrays and FM-index in searching DNA sequences. In Proceedings of the 6th Workshop on Algorithm Engineering and Experiments. SIAM Press, Philadelphia, Pa., 31--38.]]
[24]
Huynh, N., Hon, W., Lam, T., and Sung, W. 2004. Approximate string matching using compressed suffix arrays. In Proceedings of the 15th Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 3109. Springer-Verlag, New York, 434--444.]]
[25]
Kärkkäinen, J., and Sutinen, E. 1998. Lempel-Ziv index for q-grams. Algorithmica 21, 137--154.]]
[26]
Kärkkäinen, J., and Ukkonen, E. 1996. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proceedings of the 3rd South American Workshop on String Processing, N. Ziviani, R. Baeza-Yates, and K. Guimarães, Eds. Carleton University Press, 141--155.]]
[27]
Kosaraju, R., and Manzini, G. 1999. Compression of low entropy strings with Lempel--Ziv algorithms. SIAM J. Comput. 29, 3, 893--911.]]
[28]
Kurtz, S. 1999. Reducing the space requirement of suffix trees. Software---Practice and Experience 29, 13, 1149--1171.]]
[29]
Mäkinen, V., and Navarro, G. 2004. Compressed compact suffix arrays. In Proceedings of the 15th Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 3109. Springer-Verlag, New York, 420--433.]]
[30]
Mäkinen, V., Navarro, G., and Sadakane, K. 2004. Advantages of backward searching---efficient secondary memory and distributed implementation of compressed suffix arrays. In Proceedings of the 15th International Symposium on Algorithms and Computation. Lecture Notes in Computer Science, Springer-Verlag, New York.]]
[31]
Manber, U., and Myers, G. 1993. Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22, 5, 935--948.]]
[32]
Manzini, G. 2001. An analysis of the Burrows-Wheeler transform. J. ACM 48, 3, 407--430.]]
[33]
McCreight, E. M. 1976. A space-economical suffix tree construction algorithm. J. ACM 23, 2, 262-- 272.]]
[34]
Munro, J. I., Raman, V., and Rao, S. S. 2001. Space efficient suffix trees. J. Algor. 39, 2, 205--222.]]
[35]
Navarro, G. 2004. Indexing text using the Ziv-Lempel trie. J. Discr. Algor. 2, 1, 87--114.]]
[36]
Pagh, R. 2001. Low redundancy in static dictionaries with constant query time. SIAM J. Comput. 31, 2, 353--363.]]
[37]
Raman, R., Raman, V., and Rao, S. S. 2002. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proceedings of the 13th ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM Press, New York, 233--242.]]
[38]
Rao, S. S. 2002. Time-space trade-offs for compressed suffix arrays. Inf. Proc. Lett. 82, 6, 307--311.]]
[39]
Sadakane, K. 2000. Compressed text databases with efficient query algorithms based on the compressed suffix array. In Proceeding of the 11th International Symposium on Algorithms and Computation. Lecture Notes in Computer Science, vol. 1969. Springer-Verlag, New York, 410--421.]]
[40]
Sadakane, K. 2002. Succinct representations of LCP information and improvements in the compressed suffix arrays. In Proceedings of the 13th ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM Press, New York, 225--232.]]
[41]
Sadakane, K. 2003. New text indexing functionalities of compressed suffix arrays. J. Algor. 48, 2, 294--313.]]
[42]
Sadakane, K., and Shibuya, T. 2001. Indexing huge genome sequences for solving various problems. Genome Informatics 12, 175--183.]]
[43]
Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images, Second ed., Morgan Kaufmann Publishers, Los Altos, Calif.]]
[44]
Ziv, J., and Lempel, A. 1978. Compression of individual sequences via variable length coding. IEEE Trans. Info. Theory 24, 530--536.]]

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of the ACM
Journal of the ACM  Volume 52, Issue 4
July 2005
199 pages
ISSN:0004-5411
EISSN:1557-735X
DOI:10.1145/1082036
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2005
Published in JACM Volume 52, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Burrows--Wheeler transform
  2. Lempel--Ziv compressor
  3. full-text indexing
  4. indexing data structure
  5. pattern searching
  6. suffix array
  7. suffix tree
  8. text compression

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)91
  • Downloads (Last 6 weeks)7
Reflects downloads up to 30 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media