skip to main content
research-article

Donag: Generating Efficient Patches and Diffs for Compressed Archives

Published: 21 September 2022 Publication History

Abstract

Differencing between compressed archives is a common task in file management and synchronization. Applications include source code distribution, application updates, and document synchronization. General purpose binary differencing tools can create and apply patches to compressed archives, but don’t consider the internal structure of the compressed archive or the file lifecycle. Therefore, they miss opportunities to save space based on the archive’s internal structure and metadata. To address the gap, we develop a content-aware, format independent theory for differencing on compressed archives and propose a canonical form and digest for compressed archives. Based on them, we present Donag, a content-aware differencing and patching algorithm that produces smaller patches than general purpose binary differencing tools on versioned archives by exploiting the compressed archives’ internal structure. Donag uses the VCDiff and BSDiff engines internally. We compare Donag’s patches to ones produced by bsdiff, xdelta3, and Delta++ on three classes of compressed archives: open-source code repositories, large and small applications, and office productivity documents (DOCX, XLSX, PPTX). Donag’s patches are typically 10% to 89% smaller than those produced by bsdiff, xdelta3, and Delta++, with reasonable memory overhead and throughput on commodity hardware. In the worst case, Donag’s patches are negligibly larger.

References

[1]
Stephen Adams. 2009. Software Updates: Courgette. Online. (Jul. 2009). Retrieved 4 Nov. 2018 from http://dev.chromium.org/developers/design-documents/software-updates-courgette.
[2]
Gioele Barabucci. 2013. A Universal Delta Model. Dissertation. Universita di Bologna, Bologna, Italy. https://core.ac.uk/download/pdf/11014284.pdf.
[3]
Gioele Barabucci, Paolo Ciancarini, Angelo Di Iorio, and Fabio Vitali. 2016. Measuring the quality of diff algorithms: A formalization. Computer Standards and Interfaces46 (2016), 52–65.
[4]
Reed Bittinger, Nils C. Brubaker, Barron Cornelius Housel III, and Steve Wang. 2000. Method and system for differencing container files. US Patent. (Nov. 2000). https://patents.google.com/patent/US6148340A/en. Patent No. US6148340A, Filed 30 Apr. 1998, Issued 14 Nov. 2000.
[5]
John Boyer and Glenn Marcy. 2008. Canonical XML Version 1.1. W3C Recommendation. World Wide Web Consortium.
[6]
A. Z. Broder. 1997. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171). 21–29.
[7]
Wayne E. Carlson. 1991. A survey of computer graphics image encoding and storage formats. SIGGRAPH Comput. Graph. 25, 2 (April 1991), 67–75.
[8]
David James Clarke IV. 2004. CNE for NetWare 6 Study Guide (1st ed.). Novell Press.
[9]
Wayne Davison. 2020. rsync. (Aug. 2020). Retrieved 2 Sep. 2021 from https://rsync.samba.org/.
[10]
Laurent Denoue, Scott Carter, and Matthew Cooper. 2018. SlideDiff: Animating textual and media changes in slides. In Proceedings of the ACM Symposium on Document Engineering 2018, DocEng 2018, Halifax, NS, Canada, August 28-31, 2018. ACM, 37:1–37:4.
[11]
P. Deutsch. 1996. RFC1951: DEFLATE Compressed Data Format Specification version 1.3. IETF.
[12]
David Ehrmann. 2019. VCDiff-java. Online. Retrieved 22 Sep. 2020 from https://github.com/ehrmann/vcdiff-java. Version 0.1.1.
[13]
Garrick D. Evans, Liang Han, Carolyn E. Kreisel, and Tong Zhang. 2011. Dynamic manipulation of archive files. US Patent. (Sep. 2011). https://patents.google.com/patent/US8024382B2/en. Patent No. US8024382B2, Filed 20 Jan. 2009, Issued 20 Sep. 2011.
[14]
P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. 2006. Compressing and searching XML data via two zips. In Proceedings of the 15th International Conference on World Wide Web (WWW ’06). ACM, New York, NY, USA, 751–760.
[15]
Google Code Labs. 2011. CRX Package Format. Retrieved 3 Oct. 2019 from http://www.adambarth.com/experimental/crx/docs/crx.html.
[16]
D. A. Huffman. 1952. A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40, 9 (Sep. 1952), 1098–1101.
[17]
James J. Hunt, Kiem-Phong Vo, and Walter F. Tichy. 1998. Delta algorithms: An empirical analysis. ACM Trans. Softw. Eng. Methodol. 7, 2 (April 1998), 192–214.
[18]
Shmuel T. Klein, Tamar C. Serebro, and Dana Shapira. 2008. Modeling delta encoding of compressed files. International Journal of Foundations of Computer Science 19, 01 (2008), 137–146.
[19]
Shmuel T. Klein and Dana Shapira. 2007. Compressed delta encoding for LZSS encoded files. In 2007 Data Compression Conference (DCC’07). 113–122.
[20]
D. Korn, J. MacDonald, J. Mogul, and K. Vo. 2012. The VCDIFF Generic Differencing and Compression Data Format. RFC RFC 3284. IETF. https://tools.ietf.org/html/rfc3284.
[21]
David G. Korn and Kiem-Phong Vo. 2002. Engineering a differencing and compression data format. In USENIX Annual Technical Conference 2002. USENIX.
[22]
Jim Kurose and Keith Ross. 2020. Computer Networking: A Top Down Approach Powerpoint Slides. (2020). Retrieved 23 Sep. 2020 from http://gaia.cs.umass.edu/kurose_ross/ppt.htm.
[23]
Debra A. Lelewer and Daniel S. Hirschberg. 1987. Data compression. ACM Comput. Surv. 19, 3 (Sep. 1987), 261–296.
[24]
Eelco Lempsink and Andres Löh. 2009. gdiff: Generic diff and patch. Retrieved 6 Oct. 2019 from http://hackage.haskell.org/package/gdiff.
[25]
Xing Lin, Guanlin Lu, Fred Douglis, Philip Shilane, and Grant Wallace. 2014. Migratory compression: Coarse-grained data reordering to improve compressibility. In 12th USENIX Conference on File and Storage Technologies (FAST 14). USENIX Association, Santa Clara, CA, 256–273. https://www.usenix.org/conference/fast14/technical-sessions/presentation/lin.
[26]
Josh Macdonald. 2016. xdelta3. Retrieved 22 Sep. 2020 from http://xdelta.org/.
[27]
Matthew Malensek. 2019. jbsdiff. Retrieved 10 Dec. 2020 from https://github.com/malensek/jbsdiff.
[28]
Michael J. May, Etamar Laron, Khalid Zoabi, and Havah Gerhardt. 2019. On the lifecycle of the file. ACM Trans. Storage 15, 1, Article 1 (Feb. 2019), 45 pages.
[29]
Microsoft. 2020. Windows Sysinternals. Retrieved 1 Oct. 2020 from https://docs.microsoft.com/en-us/sysinternals/.
[31]
Stephen L. Nelson and E. C. Nelson. 2015. Excel Data Analysis For Dummies (3rd ed.). John Wiley & Sons, Inc.
[32]
Colin Percival. 2006. Binary diff/patch utility. (2006). Retrieved 6 Oct. 2019 from http://www.daemonology.net/bsdiff/.
[33]
PKWare Inc. 2012. APPNOTE.TXT - .ZIP File Format Specification (version 6.3.3 ed.). PKWare Inc.
[34]
OPhone Platform. 2010. The Structure of Android Package (APK) Files. (Nov. 2010). Retrieved 3 Oct. 2019 from https://web.archive.org/web/20110208193918http://en.ophonesdn.com/article/show/354.
[35]
N. Samteladze and K. Christensen. 2014. DELTA++: Reducing the size of Android application updates. IEEE Internet Computing 18, 2 (Mar. 2014), 50–57.
[36]
Philip Shilane, Mark Huang, Grant Wallace, and Windsor Hsu. 2012. WAN Optimized replication of backup datasets using stream-informed delta compression. In 10th USENIX Conference on File and Storage Technologies (FAST 12). USENIX Association, San Jose, CA. https://www.usenix.org/conference/fast12/wan-optimized-replication-backup-datasets-using-stream-informed-delta-compression.
[37]
James A. Storer and Thomas G. Szymanski. 1982. Data compression via textual substitution. J. ACM 29, 4 (Oct. 1982), 928–951.
[38]
Torsten Sul and Nasir Memon. 2003. Lossless Compression Handbook. Academic Press, An Imprint of Elsevier Science, Chapter Algorithms for Delta Compression and Remote File Synchronization, 269–290.
[39]
Sean C. Sullivan and James Stewart. 2004. zipdiff. Online. (2004). Retrieved 22 Sep. 2020 from http://zipdiff.sourceforge.net/index.html.
[40]
Scott Watanabe. 2010. Solaris 10 ZFS Essentials (1st ed.). Prentice Hall.
[41]
Wen Xia, Hong Jiang, Dan Feng, Lei Tian, Min Fu, and Yukun Zhou. 2014. Ddelta: A deduplication-inspired fast delta compression approach. Performance Evaluation 79 (2014), 258–272. Special Issue: Performance 2014.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage
ACM Transactions on Storage  Volume 18, Issue 3
August 2022
244 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3555792
  • Editor:
  • Sam H. Noh
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 September 2022
Online AM: 27 July 2022
Accepted: 21 December 2021
Revised: 10 December 2021
Received: 08 April 2021
Published in TOS Volume 18, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Differencing
  2. delta files
  3. compression
  4. canonical forms
  5. ZIP archives

Qualifiers

  • Research-article
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)66
  • Downloads (Last 6 weeks)6
Reflects downloads up to 31 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media