research-article

FusionQuery: On-demand Fusion Queries over Multi-source Heterogeneous Data

Authors:

Yunjun GaoAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 17, Issue 6

Pages 1337 - 1349

https://doi.org/10.14778/3648160.3648174

Published: 01 February 2024 Publication History

Abstract

Centralised data management systems (e.g., data lakes) support queries over multi-source heterogeneous data. However, the query results from multiple sources commonly involve between-source conflicts, which makes query results unreliable and confusing and degrades the usability of centralised data management systems. Therefore, resolving the between-sourced conflicts is one of the most important problems for centralised data management systems. To solve it, many batch data fusion-based methods have been proposed, which require traversing all the data in the centralised data management systems and cause scalability and flexibility issues.

To address these issues, this paper explores the problem of on-demand fusion queries, where the between-sourced conflicts are solved with only the query-related data; moreover, we propose an efficient on-demand fusion query framework, FusionQuery, which consists of a query stage and a fusion stage. In the query stage, we frame the heterogeneous data query problem as a knowledge graph matching problem and present a line graph-based method to accelerate it. In the fusion stage, we develop an Expectation Maximization-style algorithm to iteratively updates data veracity and source trustworthiness. Furthermore, we design an incremental estimation method of source trustworthiness to address the lack of sufficient observations. Extensive experiments on two real-world datasets demonstrate that FusionQuery outperforms state-of-the-art data fusion methods in terms of both effectiveness and efficiency.

References

[1]

2023. FusionQuery: full version. https://github.com/JunHao-Zhu/FusionQuery/blob/main/technical_report.pdf.

[2]

Guy Aglionby and Simone Teufel. 2022. Faithful Knowledge Graph Explanations in Commonsense Question Answering. In EMNLP. 10811--10817.

[3]

Mohammad Shahmeer Ahmad, Zan Ahmad Naeem, Mohamed Y. Eltabakh, Mourad Ouzzani, and Nan Tang. 2023. RetClean: Retrieval-Based Data Cleaning Using Foundation Models and Data Lakes. CoRR abs/2303.16909 (2023).

[4]

Yoshua Bengio. 2000. Gradient-Based Optimization of Hyperparameters. Neural Comput. 12, 8 (2000), 1889--1900.

Digital Library

[5]

Prajjwal Bhargava and Vincent Ng. 2022. Commonsense Knowledge Reasoning and Generation with Pre-trained Language Models: A Survey. In AAAI, Vol. 36. 12317--12325.

[6]

Bibek Bhattarai, Hang Liu, and H. Howie Huang. 2019. CECI: Compact Embedding Cluster Index for Scalable Subgraph Matching. In SIGMOD. 1447--1462.

Digital Library

[7]

Fei Bi, Lijun Chang, Xuemin Lin, Lu Qin, and Wenjie Zhang. 2016. Efficient Subgraph Matching by Postponing Cartesian Products. In SIGMOD. 1199--1214.

[8]

Léon Bottou et al. 1991. Stochastic gradient learning in neural networks. Proceedings of Neuro-Nımes 91, 8 (1991), 12.

[9]

Klaus Broelemann, Thomas Gottron, and Gjergji Kasneci. 2017. LTD-RBM: Robust and Fast Latent Truth Discovery Using Restricted Boltzmann Machines. In ICDE. 143--146.

[10]

Gabrielle Karine Canalle, Ana Carolina Salgado, and Bernadette Farias Lóscio. 2021. A survey on data fusion: what for? in what form? what is next? J. Intell. Inf. Syst. 57, 1 (2021), 25--50.

[11]

Yunfan Chen, Lei Chen, and Chen Jason Zhang. 2017. CrowdFusion: A Crowd-sourced Approach on Data Fusion Refinement. In ICDE. 127--130.

[12]

Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. 2015. KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015. ACM, 1247--1261.

Digital Library

[13]

Rajarshi Das, Manzil Zaheer, Dung Thai, Ameya Godbole, Ethan Perez, Jay Yoon Lee, Lizhen Tan, Lazaros Polymenakos, and Andrew McCallum. 2021. Case-based Reasoning for Natural Language Queries over Knowledge Bases. In EMNLP. 9594--9611.

[14]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018).

[15]

Hong-Hai Do and Erhard Rahm. 2002. COMA---a system for flexible combination of schema matching approaches. In VLDB'02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 610--621.

[16]

Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. 2014. Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In SIGKDD. 601--610.

[17]

Xin Luna Dong, Laure Berti-Équille, and Divesh Srivastava. 2009. Integrating Conflicting Data: The Role of Source Dependence. Proc. VLDB Endow. 2, 1 (2009), 550--561.

Digital Library

[18]

Xin Luna Dong, Evgeniy Gabrilovich, Kevin Murphy, Van Dang, Wilko Horn, Camillo Lugaresi, Shaohua Sun, and Wei Zhang. 2015. Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources. Proc. VLDB Endow. 8, 9 (2015), 938--949.

Digital Library

[19]

Xin Luna Dong and Divesh Srivastava. 2015. Big Data Integration. Morgan & Claypool Publishers.

[20]

Valeria Fionda and Giuseppe Pirrò. 2020. Learning Triple Embeddings from Knowledge Graphs. In AAAI, Vol. 34. 3874--3881.

[21]

Yunjun Gao, Xiaoze Liu, Junyang Wu, Tianyi Li, Pengfei Wang, and Lu Chen. 2022. ClusterEA: Scalable Entity Alignment with Stochastic Training and Normalized Mini-batch Similarities. In SIGKDD. 421--431.

[22]

Congcong Ge, Pengfei Wang, Lu Chen, Xiaoze Liu, Baihua Zheng, and Yunjun Gao. 2021. CollaborER: A Self-supervised Entity Resolution Framework Using Multi-features Collaboration. CoRR abs/2108.08090 (2021).

[23]

Michael N. Gubanov. 2017. PolyFuse: A Large-Scale Hybrid Data Fusion System. In ICDE. 1575--1578.

[24]

Myoungji Han, Hyunjoon Kim, Geonmo Gu, Kunsoo Park, and Wook-Shin Han. 2019. Efficient Subgraph Matching: Harmonizing Dynamic Programming, Adaptive Matching Order, and Failing Set Together. In SIGMOD. 1429--1446.

[25]

Sen Hu, Lei Zou, Jeffrey Xu Yu, Haixun Wang, and Dongyan Zhao. 2018. Answering Natural Language Questions by Subgraph Matching over Knowledge Graphs. IEEE Trans. Knowl. Data Eng. 30, 5 (2018), 824--837.

[26]

Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical Reparameterization with Gumbel-Softmax. In ICLR. https://openreview.net/forum?id=rkE3y85ee

[27]

Xin Jin, Zhengyi Yang, Xuemin Lin, Shiyu Yang, Lu Qin, and You Peng. 2021. FAST: FPGA-based Subgraph Matching on Massive Graphs. In ICDE. 1452--1463.

[28]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2021. Billion-Scale Similarity Search with GPUs. IEEE Trans. Big Data 7, 3 (2021), 535--547.

[29]

Christos Koutras, George Siachamis, Andra Ionescu, Kyriakos Psarakis, Jerry Brons, Marios Fragkoulis, Christoph Lofi, Angela Bonifati, and Asterios Katsifodimos. 2021. Valentine: Evaluating Matching Techniques for Dataset Discovery. In ICDE. 468--479.

[30]

Longbin Lai, Zhu Qing, Zhengyi Yang, Xin Jin, Zhengmin Lai, Ran Wang, Kongzhang Hao, Xuemin Lin, Lu Qin, Wenjie Zhang, Ying Zhang, Zhengping Qian, and Jingren Zhou. 2019. Distributed Subgraph Matching on Timely Dataflow. Proc. VLDB Endow. 12, 10 (2019), 1099--1112.

Digital Library

[31]

Qi Li, Yaliang Li, Jing Gao, Lu Su, Bo Zhao, Murat Demirbas, Wei Fan, and Jiawei Han. 2014. A Confidence-Aware Approach for Truth Discovery on Long-Tail Data. Proc. VLDB Endow. 8, 4 (2014), 425--436.

Digital Library

[32]

Qi Li, Yaliang Li, Jing Gao, Bo Zhao, Wei Fan, and Jiawei Han. 2014. Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In SIGMOD. 1187--1198.

[33]

Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, and Divesh Srivastava. 2012. Truth Finding on the Deep Web: Is the Problem Solved? Proc. VLDB Endow. 6, 2 (2012), 97--108.

Digital Library

[34]

Yaliang Li, Jing Gao, Chuishi Meng, Qi Li, Lu Su, Bo Zhao, Wei Fan, and Jiawei Han. 2015. A Survey on Truth Discovery. SIGKDD Explor. 17, 2 (2015), 1--16.

Digital Library

[35]

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep Entity Matching with Pre-Trained Language Models. Proc. VLDB Endow. 14, 1 (2020), 50--60.

Digital Library

[36]

Xueling Lin and Lei Chen. 2018. Domain-Aware Multi-Truth Discovery from Conflicting Sources. Proc. VLDB Endow. 11, 5 (2018), 635--647.

Digital Library

[37]

Shanshan Lyu, Wentao Ouyang, Yongqing Wang, Huawei Shen, and Xueqi Cheng. 2021. Truth Discovery by Claim and Source Embedding. IEEE Trans. Knowl. Data Eng. 33, 3, 1264--1275.

[38]

Nandana Mihindukulasooriya, Sanju Tiwari, Carlos F. Enguix, and Kusum Lata. 2023. Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text. CoRR abs/2308.02357 (2023).

[39]

Fatemeh Nargesian, Erkang Zhu, Renée J. Miller, Ken Q. Pu, and Patricia C. Arocena. 2019. Data Lake Management: Challenges and Opportunities. Proc. VLDB Endow. 12, 12 (2019), 1986--1989.

Digital Library

[40]

Jeff Pasternack and Dan Roth. 2013. Latent credibility analysis. In WWW. 1009--1020.

[41]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. CoRR abs/1908.10084 (2019).

[42]

Giovanni Simonini, Luca Zecchini, Sonia Bergamaschi, and Felix Naumann. 2022. Entity Resolution On-Demand. Proc. VLDB Endow. 15, 7 (2022), 1506--1518.

Digital Library

[43]

Nan Tang, Chenyu Yang, Ju Fan, and Lei Cao. 2023. VerifAI: Verified Generative AI. CoRR abs/2307.02796 (2023).

[44]

Saravanan Thirumuruganathan, Han Li, Nan Tang, Mourad Ouzzani, Yash Govind, Derek Paulsen, Glenn Fung, and AnHai Doan. 2021. Deep Learning for Blocking in Entity Matching: A Design Space Exploration. Proc. VLDB Endow. 14, 11 (2021), 2459--2472.

Digital Library

[45]

James Thorne, Majid Yazdani, Marzieh Saeidi, Fabrizio Silvestri, Sebastian Riedel, and Alon Y. Halevy. 2021. Database reasoning over text. In ACL/IJCNLP. 3091--3104.

[46]

Ha Nguyen Tran, Jung-Jae Kim, and Bingsheng He. 2015. Fast Subgraph Matching on Large Graphs using Graphics Processors. In DASFAA (Lecture Notes in Computer Science), Vol. 9049. 299--315.

[47]

Pengfei Wang, Xiaocan Zeng, Lu Chen, Fan Ye, Yuren Mao, Junhao Zhu, and Yunjun Gao. 2022. PromptEM: Prompt-tuning for Low-resource Generalized Entity Matching. Proc. VLDB Endow. 16, 2 (2022), 369--378.

Digital Library

[48]

Xianzhi Wang, Quan Z. Sheng, Xiu Susie Fang, Lina Yao, Xiaofei Xu, and Xue Li. 2015. An Integrated Bayesian Approach for Effective Multi-Truth Discovery. In CIKM. 493--502.

[49]

Houping Xiao, Jing Gao, Zhaoran Wang, Shiyu Wang, Lu Su, and Han Liu. 2016. A Truth Discovery Approach with Theoretical Guarantee. In SIGKDD. 1925--1934.

[50]

Xiaoxin Yin, Jiawei Han, and Philip S. Yu. 2008. Truth Discovery with Multiple Conflicting Information Providers on the Web. IEEE Trans. Knowl. Data Eng. 20, 6 (2008), 796--808.

Digital Library

[51]

Ran Yu, Ujwal Gadiraju, Besnik Fetahu, and Stefan Dietze. 2017. FuseM: Query-Centric Data Fusion on Structured Web Markup. In ICDE. 179--182.

[52]

Ye Yuan, Delong Ma, Zhenyu Wen, Zhiwei Zhang, and Guoren Wang. 2021. Subgraph Matching over Graph Federation. Proc. VLDB Endow. 15, 3 (2021), 437--450.

Digital Library

[53]

Ye Yuan, Delong Ma, Aoqian Zhang, and Guoren Wang. 2022. Consistent Subgraph Matching over Large Graphs. In ICDE. 2536--2548.

[54]

Li Zeng, Lei Zou, M. Tamer Özsu, Lin Hu, and Fan Zhang. 2020. GSI: GPU-friendly Subgraph Isomorphism. In ICDE. 1249--1260.

[55]

Bo Zhao, Benjamin I. P. Rubinstein, Jim Gemmell, and Jiawei Han. 2012. A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration. Proc. VLDB Endow. 5, 6 (2012), 550--561.

Digital Library

Cited By

Yan LChen XLiu YLi Z(2024)Research and implementation of a multi-source sensing data posture fusion display systemProceeding of the 2024 5th International Conference on Computer Science and Management Technology10.1145/3708036.3708062(151-156)Online publication date: 18-Oct-2024
https://dl.acm.org/doi/10.1145/3708036.3708062

Recommendations

Multi-source data fusion study in scientometrics

This paper provides an introduction to multi-source data fusion (MSDF) and comprehensively overviews the ingredients and challenges of MSDF in scientometrics. As compared to the MSDF methods in the sensor and other fields, and considering the features ...
A General Multi-Source Data Fusion Framework
ICMLC '19: Proceedings of the 2019 11th International Conference on Machine Learning and Computing

With the development of the Internet, the increase of information sources and speed of information release and transmission have led to a sharp increase in the amount of information. To enable users finding more accurate and reliable information in the ...
Approximate Queries on Big Heterogeneous Data
BIGDATACONGRESS '15: Proceedings of the 2015 IEEE International Congress on Big Data

The fundamental assumption for query rewriting in heterogeneous environments is that the mappings used for the rewriting are complete, i.e., Every relation and attribute mentioned in the query is associated, through mappings, to relations and attributes ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 17, Issue 6

February 2024

369 pages

Editors:
Meihui Zhang
Beijing Institute of Technology
,
Cyrus Shahabi
University of Southern California

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 February 2024

Published in PVLDB Volume 17, Issue 6

Check for updates

Badges

Artifacts Available / v1.1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
124
Total Downloads

Downloads (Last 12 months)124
Downloads (Last 6 weeks)10

Reflects downloads up to 28 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yan LChen XLiu YLi Z(2024)Research and implementation of a multi-source sensing data posture fusion display systemProceeding of the 2024 5th International Conference on Computer Science and Management Technology10.1145/3708036.3708062(151-156)Online publication date: 18-Oct-2024
https://dl.acm.org/doi/10.1145/3708036.3708062

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents