skip to main content
research-article

Lazy Lempel-Ziv Factorization Algorithms

Published: 21 October 2016 Publication History

Abstract

For decades the Lempel-Ziv (LZ77) factorization has been a cornerstone of data compression and string processing algorithms, and uses for it are still being uncovered. For example, LZ77 is central to several recent text indexing data structures designed to search highly repetitive collections. However, in many applications computation of the factorization remains a bottleneck in practice. In this article, we describe a number of simple and fast LZ77 factorization algorithms, which consistently outperform all previous methods in practice, use less memory, and still offer strong worst-case performance guarantees. A common feature of the new algorithms is that they compute longest common prefix information in a lazy fashion, with the degree of laziness in preprocessing characterizing different algorithms.

References

[1]
1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526, 7571 (2015), 68--74.
[2]
A. Abeliuk, R. Cánovas, and G. Navarro. 2013. Practical compressed suffix trees. Algorithms 6, 2 (2013), 319--351.
[3]
A. Al-Hafeedh, M. Crochemore, L. Ilie, E. Kopylova, W. F. Smyth, G. Tischler, and M. Yusufu. 2012. A comparison of index-based Lempel-Ziv LZ77 factorization algorithms. ACM Comput. Surv. 45, 1 (2012), 5:1--5:17.
[4]
G. Badkobeh, M. Crochemore, and C. Toopsuwan. 2012. Computing the maximal-exponent repeats of an overlap-free string in linear time. In Proceedings of the 19th International Symposium on String Processing and Information Retrieval (SPIRE’12), Lecture Notes in Computer Science, Vol. 7608. Springer, 61--72.
[5]
M. Burrows and D. J. Wheeler. 1994. A Block Sorting Lossless Data Compression Algorithm. Technical Report 124. Digital Equipment Corporation, Palo Alto, California.
[6]
M. Charikar, E. Lehman, D. Liu, R. Panigrhy, M. Prabhakaran, A. Sahai, and A. Shelat. 2005. The smallest grammar problem. IEEE Trans. Inf. Theory 51, 7 (2005), 2554--2576.
[7]
G. Chen, S. J. Puglisi, and W. F. Smyth. 2008. Lempel-Ziv factorization using less time and space. Math. Comput. Sci. 1, 4 (2008), 605--623.
[8]
M. Crochemore and L. Ilie. 2008. Computing longest previous factor in linear time and applications. Inform. Process. Lett. 106, 2 (2008), 75--80.
[9]
M. Crochemore, L. Ilie, C. S. Iliopoulos, M. Kubica, W. Rytter, and T. Walen. 2009. LPF computation revisited. In Proceedings of the 20th International Workshop on Combinatorial Algorithms (IWOCA’09), Lecture Notes in Computer Science, Vol. 5874. Springer, 158--169.
[10]
M. Crochemore, L. Ilie, and W. F. Smyth. 2008. A simple algorithm for computing the Lempel-Ziv factorization. In Proceedings of the 2008 Data Compression Conference (DCC’08). IEEE Computer Society, 482--488.
[11]
J.-P. Duval, R. Kolpakov, G. Kucherov, T. Lecroq, and A. Lefebvre. 2004. Linear-time computation of local periods. Theor. Comput. Sci. 326, 1--3 (2004), 229--240.
[12]
H. Ferrada, T. Gagie, T. Hirvola, and S. J. Puglisi. 2014. Hybrid indexes for repetitive datasets. Philos. Trans. R. Soc. A 372, 2016 (2014).
[13]
P. Ferragina and G. Manzini. 2005. Indexing compressed text. J. ACM 52, 4 (2005), 552--581.
[14]
P. Ferragina and G. Manzini. 2010. On compressing the textual web. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM’10). ACM, 391--400.
[15]
J. Fischer and V. Heun. 2011. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40, 2 (2011), 465--492.
[16]
T. Gagie, P. Gawrychowski, J. Kärkkäinen, Y. Nekrich, and S. J. Puglisi. 2012. A faster grammar-based self-index. In Proceedings of the 6th International Conference on Language and Automata Theory and Applications (LATA’12), Lecture Notes in Computer Science, Vol. 7183. Springer, 240--251.
[17]
T. Gagie, P. Gawrychowski, and S. J. Puglisi. 2011. Faster approximate pattern matching in compressed repetitive texts. In Proceedings of the 22nd International Symposium on Algorithms and Computation (ISAAC’11), Lecture Notes in Computer Science, Vol. 7074. Springer, 653--662.
[18]
K. Goto and H. Bannai. 2013. Simpler and faster Lempel Ziv factorization. In Proceedings of the 2013 Data Compression Conference (DCC’13). IEEE Computer Society, 133--142.
[19]
D. Gusfield and J. Stoye. 2004. Linear time algorithms for finding and representing all the tandem repeats in a string. J. Comput. Syst. Sci. 69, 4 (2004), 525--546.
[20]
C. Hoobin, S. J. Puglisi, and J. Zobel. 2011. Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections. PVLDB 5, 3 (2011), 265--273.
[21]
J. Kärkkäinen, D. Kempa, and S. J. Puglisi. 2012. Slashing the time for BWT inversion. In Proceedings of the 2012 Data Compression Conference (DCC’12). IEEE Computer Society, 99--108.
[22]
J. Kärkkäinen, D. Kempa, and S. J. Puglisi. 2013a. Lightweight Lempel-Ziv parsing. In Proceedings of the 12th International Symposium on Experimental Algorithms (SEA’13), Lecture Notes in Computer Science, Vol. 7933. Springer, 139--150.
[23]
J. Kärkkäinen, D. Kempa, and S. J. Puglisi. 2013b. Linear time Lempel-Ziv factorization: Simple, fast, small. In Proceedings of the 24th Annual Symposium on Combinatorial Pattern Matching (CPM’13), Lecture Notes in Computer Science, Vol. 7922. Springer, 189--200.
[24]
J. Kärkkäinen, G. Manzini, and S. J. Puglisi. 2009. Permuted longest-common-prefix array. In Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM’09), Lecture Notes in Computer Science, Vol. 5577. Springer, 181--192.
[25]
J. Kärkkäinen and S. J. Puglisi. 2010. Medium-space algorithms for inverse BWT. In Proceedings of the 18th Annual European Symposium on Algorithms (ESA’10), Lecture Notes in Computer Science, Vol. 6346. Springer, 451--462.
[26]
J. Kärkkäinen, P. Sanders, and S. Burkhardt. 2006. Linear work suffix array construction. J. ACM 53, 6 (2006), 918--936.
[27]
D. Kempa and S. J. Puglisi. 2013. Lempel-Ziv factorization: Simple, fast, practical. In Proceedings of the 2013 Workshop on Algorithm Engineering and Experiments (ALENEX’13). SIAM, 103--112.
[28]
T. Kociumaka, M. Kubica, J. Radoszewski, W. Rytter, and T. Walen. 2012. A linear time algorithm for seeds computation. In Proceedings of the 23th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’12). SIAM, 1095--1112.
[29]
R. Kolpakov, G. Bana, and G. Kucherov. 2003. mreps: Efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Res. 31, 13 (2003), 3672--3678.
[30]
R. Kolpakov and G. Kucherov. 2003. Finding approximate repetitions under Hamming distance. Theor. Comput. Sci. 303, 1 (2003), 135--156.
[31]
S. Kreft and G. Navarro. 2010. LZ77-like compression with fast random access. In Proceedings of the 2010 Data Compression Conference (DCC’10). IEEE Computer Society, 239--248.
[32]
S. Kreft and G. Navarro. 2013. On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483 (2013), 115--133.
[33]
V. Mäkinen, G. Navarro, J. Sirén, and N. Välimäki. 2010. Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17, 3 (2010), 281--308.
[34]
G. Manzini. 2001. An analysis of the Burrows-Wheeler transform. J. ACM 48, 3 (2001), 407--430.
[35]
G. Navarro. 2012. Indexing highly repetitive collections. In Proceedings of the 23rd International Workshop on Combinatorial Algorithms (IWOCA’12), Lecture Notes in Computer Science, Vol. 7643. Springer, 274--279.
[36]
G. Navarro and V. Mäkinen. 2007. Compressed full-text indexes. ACM Comput. Surv. 39, 1 Article 2 (2007).
[37]
E. Ohlebusch and S. Gog. 2011. Lempel-Ziv factorization revisited. In Proceedings of the 22nd Annual Symposium on Combinatorial Pattern Matching (CPM’11), Lecture Notes in Computer Science, Vol. 6661. Springer, 15--26.
[38]
D. Okanohara and K. Sadakane. 2008. An online algorithm for finding the longest previous factors. In Proceedings of the 16th Annual European Symposium on Algorithms (ESA’08), Lecture Notes in Computer Science, Vol. 5193. Springer, 696--707.
[39]
I. Pavlov. 2012. 7-zip. http://www.7-zip.org/. (2012).
[40]
J. Sirén, N. Välimäki, V. Mäkinen, and G. Navarro. 2008. Run-length compressed indexes are superior for highly repetitive sequence collections. In Proceedings of the 15th International Symposium on String Processing and Information Retrieval (SPIRE’08), Lecture Notes in Computer Science, Vol. 5280. Springer, 164--175.
[41]
F. Wu. 2009. Sequential file prefetching in Linux. In Advanced Operating Systems and Kernel Applications: Techniques and Technologies, Y. Wiseman and S. Jiang (Eds.). IGI Global, Chapter 11, 217--236.
[42]
J. Ziv and A. Lempel. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 3 (1977), 337--343.

Cited By

View all

Index Terms

  1. Lazy Lempel-Ziv Factorization Algorithms

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Journal of Experimental Algorithmics
    ACM Journal of Experimental Algorithmics  Volume 21, Issue
    Special Issue SEA 2014, Regular Papers and Special Issue ALENEX 2013
    2016
    404 pages
    ISSN:1084-6654
    EISSN:1084-6654
    DOI:10.1145/2888418
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 October 2016
    Accepted: 01 October 2014
    Revised: 01 August 2013
    Received: 01 April 2013
    Published in JEA Volume 21

    Author Tags

    1. Burrows-Wheeler transform
    2. LZ77
    3. Lempel-Ziv factorization
    4. Lempel-Ziv parsing
    5. data compression
    6. string processing
    7. suffix array

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • Academy of Finland

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)16
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 20 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media