skip to main content
research-article

Compressing and indexing labeled trees, with applications

Published: 27 November 2009 Publication History

Abstract

Consider an ordered, static tree T where each node has a label from alphabet Σ. Tree T may be of arbitrary degree and shape. Our goal is designing a compressed storage scheme of T that supports basic navigational operations among the immediate neighbors of a node (i.e. parent, ith child, or any child with some label,…) as well as more sophisticated path-based search operations over its labeled structure.
We present a novel approach to this problem by designing what we call the XBW-transform of the tree in the spirit of the well-known Burrows-Wheeler transform for strings [1994]. The XBW-transform uses path-sorting to linearize the labeled tree T into two coordinated arrays, one capturing the structure and the other the labels. For the first time, by using the properties of the XBW-transform, our compressed indexes go beyond the information-theoretic lower bound, and support navigational and path-search operations over labeled trees within (near-)optimal time bounds and entropy-bounded space.
Our XBW-transform is simple and likely to spur new results in the theory of tree compression and indexing, as well as interesting application contexts. As an example, we use the XBW-transform to design and implement a compressed index for XML documents whose compression ratio is significantly better than the one achievable by state-of-the-art tools, and its query time performance is order of magnitudes faster.

References

[1]
Adiego, J., de la Fuente, P., and Navarro, G. 2004. Merging prediction by partial matching with structural contexts model. In Proceedings of the IEEE Data Compression Conference (DCC). IEEE Computer Society Press, Los Alamitos, CA, 522.
[2]
Arion, A., Bonifati, A., Costa, G., D'Aguanno, S., Manolescu, I., and Pugliese, A. 2003. XQueC: pushing queries to compressed XML data. In Proceedings of the 29th International Conference on Very Large Data Bases (VLDB). 1065--1068.
[3]
Arroyuelo, D., Navarro, G., and Sadakane, K. 2006. Reducing the space requirement of LZ-index. In Proceedings of the 17th Combinatorial Pattern Matching Conference (CPM). Lecture Notes in Computer Science vol. 4009. Springer-Verlag, New York, 318--329.
[4]
Barbay, J., Golynski, A., Munro, J. I., and Rao, S. 2007. Adaptive searching in succinctly encoded binary relations and tree-structured documents. Theoret. Comput. Sci. 387, 3, 284--297.
[5]
Barbay, J., He, M., Munro, J., and Rao, S. 2008. Succinct indexes for string, bynary relations and multi-labeled trees. Submitted to Journal (Preliminary version appeared in Proceedings of the 18th ACM-SIAM Symposium on Discrete Algorithms (SODA 07)).
[6]
Benoit, D., Demaine, E., Munro, J. I., Raman, R., Raman, V., and Rao, S. 2005. Representing trees of higher degree. Algorithmica 43, 275--292.
[7]
Burrows, M., and Wheeler, D. 1994. A block-sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation.
[8]
Busatto, G., Lohrey, M., and Maneth, S. 2008. Efficient memory representation of XML document trees. Inf. Syst. 33, 4-5, 456--474.
[9]
Catania, B., Maddalena, A., and Vakali, A. 2005. XML document indexes: A classification. IEEE Internet Computing, 1, 5, 64--71.
[10]
Cheney, J. 2001. Compressing XML with multiplexed hierarchical PPM models. In Proceedings of the IEEE Data Compression Conference (DCC). IEEE Computer Society Press, Los Alamitos, CA, 163--172.
[11]
Cheng, J., and Ng, W. 2004. XQzip: Querying compressed XML using structural indexing. In Proceedings of the 9th International Conference on Extending Database Technology. Lecture Notes in Computer Science, vol. 2992. Springer-Verlag, Berlin, Germany, 219--236.
[12]
Farzan, A., and Munro, I. 2008. Succinct representations of arbitrary graphs. In Proceedings of the 16th European Symposium on Algorithms (ESA). Lecture Notes in Computer Science vol. 4009. Springer-Verlag, Berlin Germany.
[13]
Fernandez, M. F., Simeon, J., Choi, B., Marian, A., and Sur, G. 2003. Implementing Xquery 1.0: the Galax experience. In Proceedings of the 29th International Conference on Very Large Data Bases (VLDB). 1077--1080.
[14]
Ferragina, P., Giancarlo, R., and Manzini, G. 2006a. The engineering of a compression boosting library: Theory vs practice in BWT compression. In Proceedings of the 14th European Symposium on Algorithms (ESA). Lecture Notes in Computer Science, vol. 4168. Springer-Verlag, Berlin, Germany, 756--767.
[15]
Ferragina, P., Giancarlo, R., and Manzini, G. 2009. The myriad virtues of wavelet trees. Inf. Comput. 207, 849--866.
[16]
Ferragina, P., Giancarlo, R., Manzini, G., and Sciortino, M. 2005a. Boosting textual compression in optimal linear time. J. ACM 52, 688--713.
[17]
Ferragina, P., González, R., Navarro, G., and Venturini, R. 2008. Compressed text indexes: From theory to practice. ACM J. Exper. Algor. 13.
[18]
Ferragina, P., Luccio, F., Manzini, G., and Muthukrishnan, S. 2005b. Structuring labeled trees for optimal succinctness, and beyond. In Proceedings of the 46th IEEE Symposium on Foundations of Computer Science (FOCS). IEEE Computer Society Press, Los Alamitos, CA, 184--193.
[19]
Ferragina, P., Luccio, F., Manzini, G., and Muthukrishnan, S. 2006b. Compressing and searching XML data via two zips. In Proceedings of the 15th International World Wide Web Conference (WWW). 751--760.
[20]
Ferragina, P., and Manzini, G. 2001. An experimental study of a compressed index. Information Sciences: special issue on “Dictionary Based Compression” 135, 13--28.
[21]
Ferragina, P., and Manzini, G. 2005. Indexing compressed text. J. ACM 52, 4, 552--581.
[22]
Ferragina, P., Manzini, G., Mäkinen, V., and Navarro, G. 2007. Compressed representations of sequences and full-text indexes. ACM Trans. Algor. 3, 2.
[23]
Ferragina, P., and Navarro, G. 2007. The Pizza&Chili corpus home page. http://pizzachili.dcc.uchile.cl/or http://pizzachili.di.unipi.it/.
[24]
Ferragina, P., and Rao, S. 2008. Tree compression and indexing. In Encyclopedia of Algorithms, Springer. 964--967.
[25]
Ferragina, P., and Venturini, R. 2007. A simple storage scheme for strings achieving entropy bounds. Theoret. Comput. Sci. 372, 1, 115--121.
[26]
Geary, R., Raman, R., and Raman, V. 2006. Succinct ordinal trees with level-ancestor queries. ACM Trans. Algor. 2, 4, 510--534.
[27]
Goldman, R., and Widom, J. 1997. Dataguides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB). 436--445.
[28]
Golynski, A., Munro, J. I., and Rao, S. 2006. Rank/select operations on large alphabets: A tool for text indexing. In Proceedings of the 17th ACM-SIAM Symposium on Discrete Algorithms (SODA). ACM, New York, 368--373.
[29]
Gupta, A., Hon, W., Shah, R., and Vitter, J. 2007. A framework for dynamizing succinct data structures. In Proceedings of the 34th International Colloquium on Automata, Languages and Programming (ICALP). Lecture Notes in Computer Science, vol. 4596. Springer-Verlag, Berlin, Germany, 521--532.
[30]
Jacobson, G. 1989. Space-efficient static trees and graphs. In Proceedings of the 30th IEEE Symposium on Foundations of Computer Science (FOCS). IEEE Computer Society Press, Los Alamitos, CA, 549--554.
[31]
Jansson, J., Sadakane, K., and Sung, W. 2007. Ultra-succinct representation of ordered trees. In Proceedings of the 18th ACM-SIAM Symposium on Discrete Algorithms (SODA). ACM, New York, 575--584.
[32]
Kärkkäinen, J., Sanders, P., and Burkhardt, S. 2006. Linear work suffix array construction. J. ACM 53, 6, 918--936.
[33]
Karp, R., Miller, R., and Rosenberg, A. 1972. Rapid identification of repeated patterns in strings, arrays and trees. In Proceedings of the ACM Symposium on Theory of Computation. ACM, New York, 125--136.
[34]
Kaushik, R., Krishnamurthy, R., Naughton, J., and Ramakrishnan, R. 2004. On the integration of structure indexes and inverted lists. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, 779--790.
[35]
Kosaraju, S. R. 1989. Efficient tree pattern matching. In Proceedings of the 20th IEEE Foundations of Computer Science (FOCS). 178--183.
[36]
Kurtz, S. 1999. Reducing the space requirement of suffix trees. Software—Practice and Experience 29, 13, 1149--1171.
[37]
Liefke, H., and Suciu, D. 2000. XMill: An efficient compressor for xml data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 153--164.
[38]
Mäkinen, V., and Navarro, G. 2006. Position-restricted substring searching. In Proceedings of the 7th Latin American Symposium on Theoretical Informatics (LATIN). Lecture Notes in Computer Science, vol. 3887. Springer-Verlag, Berlin, Germany, 703--714.
[39]
Manzini, G. 2001. An analysis of the Burrows-Wheeler transform. J. ACM 48, 3, 407--430.
[40]
Milo, T., and Suciu, D. 1999. Index structures for path expressions. In Proceedings of the 3rd International Conference on Database Theory. 277--295.
[41]
Min, J., Park, M., and Chung, C. 2003. Xpress: A queriable compression for XML data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, 122--133.
[42]
Moura, E., Navarro, G., Ziviani, N., and Baeza-Yates, R. 2000. Fast and flexible word searching on compressed text. ACM Trans. Inform. Syst. 18, 2, 113--139.
[43]
Munro, J. I. 1996. Tables. In Proceeding of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science. Lecture Notes in Computer Science, vol. 1180. Springer-Verlag, Berlin, Germany, 37--42.
[44]
Munro, J. I., Raman, R., Raman, V., and Rao, S. 2003. Succinct representations of permutations. In Proceedings of the 30th International Colloquium on Automata, Languages and Programming (ICALP). Lecture Notes in Computer Science, vol. 2719. Springer-Verlag, Berlin, Germany, 345--356.
[45]
Munro, J. I., and Raman, V. 1997. Succinct representation of balanced parentheses, static trees and planar graphs. In Proceedings of the 38th IEEE Symposium on Foundations of Computer Science (FOCS). IEEE Computer Society Press, Los Alamitos, CA, 118--126.
[46]
Munro, J. I., and Raman, V. 2001. Succinct representation of balanced parentheses and static trees. SIAM J. Comput. 31, 762--776.
[47]
Munro, J. I., and Rao, S. 2004. Succinct representations of functions. In Proceedings of the 31st International Colloquium on Automata, Languages and Programming (ICALP). Lecture Notes in Computer Science, vol. 3142. Springer-Verlag, Germany, 1006--1015.
[48]
Navarro, G., and Mäkinen, V. 2007. Compressed full-text indexes. ACM Comput. Surv. 39, 1.
[49]
Ng, W., Lam, W., Wood, P., and Levene, M. 2006. XCQ: A queriable XML compression system. Know. Inform. Syst. 10, 4, 421--452.
[50]
Raman, R., Raman, V., and Rao, S. 2007. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algor. 3, 4.
[51]
Raw, P., and Moon, B. 2004. PRIX: Indexing and querying XML using Prüfer sequences. In Proceedings of the 20th International Conference on Data Engineering (ICDE). 288--300.
[52]
Tolani, P. M., and Haritsa, J. R. 2002. XGRIND: A query-friendly XML compressor. In Proceedings of the 18th International Conference on Data Engineering (ICDE). 225--234.
[53]
Wang, H., Park, S., Fan, W., and Yu, P. S. 2003. ViST: A dynamic index methd for querying XML data by tree structures. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, 110--121.
[54]
Wang, W., Wang, H., Lu, H., Jang, H., Lin, X., and Li, J. 2005. Efficient processing of XML path queries using the disk-based F&B index. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB). 145--156.
[55]
Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images, Second ed. Morgan Kaufmann, Los Altos, CA.
[56]
Ye, Z., and Berger, T. 1998. Information measures for discrete random fields. Science Press.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of the ACM
Journal of the ACM  Volume 57, Issue 1
November 2009
149 pages
ISSN:0004-5411
EISSN:1557-735X
DOI:10.1145/1613676
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 November 2009
Accepted: 01 July 2009
Revised: 01 September 2008
Received: 01 March 2007
Published in JACM Volume 57, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Burrows-Wheeler transform
  2. XML compression
  3. XML indexing
  4. labeled tree compression
  5. labeled tree indexing
  6. path searching
  7. tree navigation

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)43
  • Downloads (Last 6 weeks)4
Reflects downloads up to 09 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media