research-article

GEqO: ML-Accelerated Semantic Equivalence Detection

Authors:

Brandon Haynes,

Yuanyuan TianAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 1, Issue 4

Article No.: 223, Pages 1 - 25

https://doi.org/10.1145/3626710

Published: 12 December 2023 Publication History

Abstract

Large scale analytics engines have become a core dependency for modern data-driven enterprises to derive business insights and drive actions. These engines support a large number of analytic jobs processing huge volumes of data on a daily basis, and workloads are often inundated with overlapping computations across multiple jobs. Reusing common computation is crucial for efficient cluster resource utilization and reducing job execution time. Detecting common computation is the first and key step for reducing this computational redundancy. However, detecting equivalence on large-scale analytics engines requires efficient and scalable solutions that are fully automated. In addition, to maximize computation reuse, equivalence needs to be detected at the semantic level instead of just the syntactic level (i.e., the ability to detect semantic equivalence of seemingly different-looking queries). Unfortunately, existing solutions fall short of satisfying these requirements.

In this paper, we take a major step towards filling this gap by proposing GEqO, a portable and lightweight machine-learning-based framework for efficiently identifying semantically equivalent computations at scale. GEqO introduces two machine-learning-based filters that quickly prune out nonequivalent subexpressions and employs a semi-supervised learning feedback loop to iteratively improve its model with an intelligent sampling mechanism. Further, with its novel database-agnostic featurization method, GEqO can transfer the learning from one workload and database to another. Our extensive empirical evaluation shows that, on TPC-DS-like queries, GEqO yields significant performance gains-up to 200x faster than automated verifiers-and finds up to 2x more equivalences than optimizer and signature-based equivalence detection approaches.

References

[1]

Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases. Vol. 8. Addison-Wesley Reading.

Digital Library

[2]

Agiwal, Ankur and Lai, Kevin and Manoharan, Gokul Nath Babu and Roy, Indrajit and Sankaranarayanan, Jagan and Zhang, Hao and Zou, Tao and Chen, Min and Chen, Jim and Dai, Ming and others. 2021. Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google., Vol. 14, 12 (2021).

[3]

Sanjay Agrawal, Surajit Chaudhuri, and Vivek R Narasayya. 2000. Automated selection of materialized views and indexes in SQL databases. In VLDB, Vol. 2000. 496--505.

[4]

Rafi Ahmed, Randall Bello, Andrew Witkowski, and Praveen Kumar. 2020. Automated generation of materialized views in oracle. In VLDB, Vol. 13. 3046--3058.

Digital Library

[5]

Amazon. 2023. Amazon Redshift: Automated materialized views. https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-auto-mv.html. Accessed: 2023.

[6]

Apache Software Foundation. 2023 a. Apache Calcite: The foundation for your next high-performance database. https://calcite.apache.org.

[7]

Apache Software Foundation. 2023 b. Apache Spark: Unified Engine for large-scale data analytics. https://spark.apache.org.

[8]

Rada Chirkova and Jun Yang. 2012. Materialized views. Foundations and Trends® in Databases, Vol. 4, 4 (2012), 295--405.

[9]

Shumo Chu, Brendan Murphy, Jared Roesch, Alvin Cheung, and Dan Suciu. 2018. Axiomatic foundations and algorithms for deciding semantic equivalences of SQL queries. In VLDB, Vol. 11. 1482--1495.

Digital Library

[10]

Shumo Chu, Konstantin Weitz, Alvin Cheung, and Dan Suciu. 2017. HoTTSQL: Proving query rewrites with univalent SQL semantics. In SIGPLAN, Vol. 52. 510--524.

Digital Library

[11]

Sara Cohen. 2006. Equivalence of queries combining set and bag-set semantics. In PODS. 70--79.

[12]

Dageville, Benoit and Cruanes, Thierry and Zukowski, Marcin and Antonov, Vadim and Avanes, Artin and Bock, Jon and Claybaugh, Jonathan and Engovatov, Daniel and Hentschel, Martin and Huang, Jiansheng and others. 2016. The Snowflake Elastic Data Warehouse. In SIGMOD. 215--226.

[13]

Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. In TACAS. 337--340.

[14]

Leonardo de Moura, Soonho Kong, Jeremy Avigad, Floris Van Doorn, and Jakob von Raumer. 2015. The Lean theorem prover (system description). In CADE-25. 378--388.

[15]

Alin Deutsch. 2018. FOL Modeling of Integrity Constraints (Dependencies). In Encyclopedia of Database Systems, Second Edition.

[16]

Yuyang Dong, Chuan Xiao, Takuma Nozawa, Masafumi Enomoto, and Masafumi Oyamada. 2023. DeepJoin: Joinable Table Discovery with Pre-trained Language Models. In VLDB, Vol. 16. 2458--2470.

Digital Library

[17]

Gabriel Ebner, Sebastian Ullrich, Jared Roesch, Jeremy Avigad, and Leonardo de Moura. 2017. A metaprogramming framework for formal verification. In ICFP, Vol. 1. 1--29.

[18]

Jonathan Goldstein and Per-Åke Larson. 2001. Optimizing queries using materialized views: a practical, scalable solution. In SIGMOD, Vol. 30. 331--342.

Digital Library

[19]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press.

Digital Library

[20]

Google. 2023. BigQuery. https://cloud.google.com/bigquery.

[21]

Goetz Graefe. 1995. The Cascades Framework for Query Optimization. IEEE Data Eng. Bull., Vol. 18 (1995), 19--29.

[22]

Goetz Graefe and William J. McKenna. 1993. The Volcano Optimizer Generator: Extensibility and Efficient Search. In ICDE. 209--218.

[23]

Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon Redshift and the Case for Simpler Data Warehouses. In SIGMOD. 1917--1923.

[24]

Alon Y Halevy. 2001. Answering queries using views: A survey. The VLDB Journal, Vol. 10, 4 (2001), 270--294.

Digital Library

[25]

Rojeh Hayek and Oded Shmueli. 2020. Improved Cardinality Estimation by Learning Queries Containment Rates. In EDBT. 157--168.

[26]

Tin Kam Ho. 1995. Random decision forests. In ICDAR, Vol. 1. 278--282.

[27]

Qiang Huang, Jianlin Feng, Yikai Zhang, Qiong Fang, and Wilfred Ng. 2015. Query-aware locality-sensitive hashing for approximate nearest neighbor search. In VLDB, Vol. 9. 1--12.

Digital Library

[28]

Yannis E Ioannidis and Raghu Ramakrishnan. 1995. Containment of conjunctive queries: Beyond relations as sets. TODS, Vol. 20, 3 (1995), 288--324.

Digital Library

[29]

TS Jayram, Phokion G Kolaitis, and Erik Vee. 2006. The containment problem for real conjunctive queries with inequalities. In PODS. 80--89.

[30]

Alekh Jindal, Konstantinos Karanasos, Sriram Rao, and Hiren Patel. 2018a. Selecting subexpressions to materialize at datacenter scale. In VLDB, Vol. 11. 800--812.

Digital Library

[31]

Alekh Jindal, Shi Qiao, Hiren Patel, Abhishek Roy, Jyoti Leeka, and Brandon Haynes. 2021. Production Experiences from Computation Reuse at Microsoft. In EDBT. 623--634.

[32]

Alekh Jindal, Shi Qiao, Hiren Patel, Zhicheng Yin, Jieming Di, Malay Bag, Marc Friedman, Yifung Lin, Konstantinos Karanasos, and Sriram Rao. 2018b. Computation reuse in analytics job service at Microsoft. In ICDM. 191--203.

[33]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR.

[34]

Xinyu Liu, Qi Zhou, Joy Arulraj, and Alessandro Orso. 2022. Automatic Detection of Performance Bugs in Database Systems using Equivalent Queries. In ICSE. 225--236.

[35]

Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. In PAMI, Vol. 42. 824--836.

Digital Library

[36]

Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, and Tim Kraska. 2021. Bao: Making learned query optimization practical. In SIGMOD. 1275--1288.

[37]

Ryan C. Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A Learned Query Optimizer. In VLDB, Vol. 12. 1705--1718.

Digital Library

[38]

Microsoft. 2023. Azure Synapse Analytics. https://azure.microsoft.com/en-us/services/synapse-analytics.

[39]

Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neural networks over tree structures for programming language processing. In AAAI, Vol. 30.

[40]

Parimarjan Negi, Matteo Interlandi, Ryan Marcus, Mohammad Alizadeh, Tim Kraska, Marc Friedman, and Alekh Jindal. 2021. Steering Query Optimizers: A Practical Take on Big Data Workloads. In ICMD. 2557--2569.

[41]

John Neter, Michael H Kutner, Christopher J Nachtsheim, and William Wasserman. 1996. Applied linear statistical models. (1996).

[42]

Jennifer Ortiz, Magdalena Balazinska, Johannes Gehrke, and S. Sathiya Keerthi. 2019. An Empirical Analysis of Deep Learning for Cardinality Estimation. CoRR, Vol. abs/1905.06425 (2019).

[43]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. NeurIPS, Vol. 32 (2019).

[44]

Christine Paulin-Mohring. 2011. Introduction to the Coq proof-assistant for practical software verification. In LASER. 45--95.

[45]

Jianbin Qin, Wei Wang, Chuan Xiao, Ying Zhang, and Yaoshu Wang. 2021. High-Dimensional Similarity Query Processing for Data Science. In SIGKDD. 4062--4063.

[46]

Timos K. Sellis. 1988. Multiple-Query Optimization. In TODS, Vol. 13. 23--52.

Digital Library

[47]

Ross Tate, Michael Stepp, Zachary Tatlock, and Sorin Lerner. 2009. Equality saturation: a new approach to optimization. In POPL. 264--276.

[48]

Yicheng Tu, Mehrad Eslami, Zichen Xu, and Hadi Charkhgard. 2022. Multi-Query Optimization Revisited: A Full-Query Algebraic Method. In Big Data. 252--261.

[49]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NIPS. 5998--6008.

[50]

Zhaoguo Wang, Zhou Zhou, Yicun Yang, Haoran Ding, Gansen Hu, Ding Ding, Chuzhe Tang, Haibo Chen, and Jinyang Li. 2022. WeTune: Automatic Discovery and Verification of Query Rewrite Rules. In SIGMOD. 94--107.

[51]

Haitao Yuan, Guoliang Li, Ling Feng, Ji Sun, and Yue Han. 2020. Automatic view generation with deep learning and reinforcement learning. In ICDE. 1501--1512.

[52]

Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken, and Darren Shakib. 2012. SCOPE: Parallel Databases Meet MapReduce. In VLDB, Vol. 21. 611--636.

[53]

Qi Zhou, Joy Arulraj, Shamkant Navathe, William Harris, and Dong Xu. 2019. Automated verification of query equivalence using satisfiability modulo theories. VLDB, Vol. 12, 11, 1276--1288.

Digital Library

[54]

Qi Zhou, Joy Arulraj, Shamkant B. Navathe, William Harris, and Jinpeng Wu. 2022. SPES: A Symbolic Approach to Proving Query Equivalence Under Bag Semantics. In ICDE. 2735--2748.

[55]

Xiaojin Zhu and Andrew B Goldberg. 2009. Introduction to semi-supervised learning. Synthesis lectures on artificial intelligence and machine learning, Vol. 3, 1 (2009), 1--130.

[56]

Yiwen Zhu, Subru Krishnan, Konstantinos Karanasos, Isha Tarte, Conor Power, Abhishek Modi, Manoj Kumar, Deli Zhang, Kartheek Muthyala, Nick Jurgens, et al. 2021. KEA: Tuning an Exabyte-Scale Data Infrastructure. In SIGMOD. 2667--2680.

Index Terms

GEqO: ML-Accelerated Semantic Equivalence Detection
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
        Query optimization
    2. Information integration

Recommendations

A format for semantic equivalence comparison

This paper presents a new format for process algebras, the extended tyft/tyxt format which generalises the tyft/tyxt format of Groote and Vaandrager. The format differs from most previous formats in that the labels on transitions are treated as many-...
Foundations of SPARQL query optimization
ICDT '10: Proceedings of the 13th International Conference on Database Theory

We study fundamental aspects related to the efficient processing of the SPARQL query language for RDF, proposed by the W3C to encode machine-readable information in the Semantic Web. Our key contributions are (i) a complete complexity analysis for all ...
DL-Learner-A framework for inductive learning on the Semantic Web

In this system paper, we describe the DL-Learner framework, which supports supervised machine learning using OWL and RDF for background knowledge representation. It can be beneficial in various data and schema analysis tasks with applications in ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 1, Issue 4

PACMMOD

December 2023

1317 pages

EISSN:2836-6573

DOI:10.1145/3637468

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 December 2023

Published in PACMMOD Volume 1, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
172
Total Downloads

Downloads (Last 12 months)172
Downloads (Last 6 weeks)5

Reflects downloads up to 27 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents