skip to main content
research-article
Public Access

BayesWipe: A Scalable Probabilistic Framework for Improving Data Quality

Published: 25 October 2016 Publication History

Abstract

Recent efforts in data cleaning of structured data have focused exclusively on problems like data deduplication, record matching, and data standardization; none of the approaches addressing these problems focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like Conditional Functional Dependencies (which have to be provided by domain experts or learned from a clean sample of the database). In this article, we provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. We thus avoid the necessity for a domain expert or clean master data. We also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. We evaluate our methods over both synthetic and real data.

References

[1]
Marcelo Arenas, Leopoldo Bertossi, and Jan Chomicki. 1999. Consistent query answers in inconsistent databases. In PODS. ACM, 68--79.
[2]
A. Asuncion and D. J. Newman. 2007. UCI Machine Learning Repository [http://www.ics.uci.edu/∼mlearn/MLRepository.html]. Irvine, CA: University of California. School of Information and Computer Science.
[3]
Adam L. Berger, Stephen A. Della Pietra, and Vincent J. Della Pietra. 1996. A maximum entropy approach to natural language processing. Computational Linguistics 22 (1996), 39--71.
[4]
Leopoldo E. Bertossi, Solmaz Kolahi, and Laks V. S. Lakshmanan. 2011. Data cleaning and query answering with matching dependencies and matching functions. In ICDT.
[5]
George Beskales, Ihab F. Ilyas, Lukasz Golab, and Artur Galiullin. 2013b. On the relative trust between inconsistent data and inaccurate constraints. In ICDE. IEEE.
[6]
George Beskales, Ihab F. Ilyas, Lukasz Golab, and Artur Galiullin. 2013a. Sampling from repairs of conditional functional dependency violations. VLDB J. (2013), 1--26.
[7]
P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. 2005. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD. ACM.
[8]
P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. 2007. Conditional functional dependencies for data cleaning. In ICDE. IEEE, 746--755.
[9]
Jihad Boulos, Nilesh Dalvi, Bhushan Mandhani, Shobhit Mathur, Chris Re, and Dan Suciu. 2005. MYSTIQ: A system for finding more answers by using probabilities. In SIGMOD. 891--893.
[10]
Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. 2008. Webtables: Exploring the power of tables on the web. Proc. VLDB Endow. 1, 1 (2008), 538--549.
[11]
F. Chiang and R. J. Miller. 2008. Discovering data quality rules. Proceedings of the VLDB Endowment 1, 1 (2008), 1166--1177.
[12]
Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. 2015. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD’15). ACM, New York, NY, 1247--1261.
[13]
Computing Research Association. 2012. Challenges and Opportunities with Big Data. Retrieved from http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf.
[14]
Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, and Shuai Ma. 2007. Improving data quality: Consistency and accuracy. In Proceedings of the 33rd International Conference on Very Large Data Bases. VLDB Endowment, 315--326.
[15]
G. Cormode, L. Golab, K. Flip, A. McGregor, D. Srivastava, and X. Zhang. 2009. Estimating the confidence of conditional functional dependencies. In Proceedings of the 35th SIGMOD International Conference on Management of Data. ACM, 469--482.
[16]
Nilesh Dalvi and Dan Suciu. 2007. Efficient query evaluation on probabilistic databases. The VLDB Journal 16, 4 (2007), 523--544.
[17]
Tamraparni Dasu and Ji Meng Loh. 2012. Statistical distortion: Consequences of data cleaning. VLDB 5, 11 (2012), 1674--1683.
[18]
Sushovan De. 2014. Unsupervised Bayesian Data Cleaning Techniques For Structured Data. Ph.D. Dissertation. Arizona State University.
[19]
Sushovan De, Yuheng Hu, Yi Chen, and Subbarao Kambhampati. 2014. BayesWipe: A multimodal system for data cleaning and consistent query answering on structured bigdata. In Proceedings of the 2014 IEEE International Conference on Big Data. IEEE, 15--24.
[20]
Xin Luna Dong, Laure Berti-Equille, and Divesh Srivastava. 2009. Truth discovery and copying detection in a dynamic world. VLDB 2, 1 (2009), 562--573.
[21]
Wenfei Fan and Floris Geerts. 2012. Foundations of data quality management. Synth. Lect. Data Manag. 4, 5 (2012), 1--217.
[22]
Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2008. Conditional functional dependencies for capturing data inconsistencies. TODS 33, 2 (2008), 6.
[23]
W. Fan, F. Geerts, L. Lakshmanan, and M. Xiong. 2009. Discovering conditional functional dependencies. In ICDE. IEEE.
[24]
Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2012. Towards certain fixes with editing rules and master data. The VLDB Journal 21, 2 (2012), 213--238.
[25]
I. P. Fellegi and D. Holt. 1976. A systematic approach to automatic edit and imputation. J. Am. Stat. Assoc. (1976), 17--35.
[26]
Ariel Fuxman, Elham Fazli, and Renée J Miller. 2005. Conquer: Efficient management of inconsistent databases. In SIGMOD. ACM, 155--166.
[27]
Lukasz Golab, Howard J. Karloff, Flip Korn, Divesh Srivastava, and Bei Yu. 2008. On generating near-optimal tableaux for conditional functional dependencies. Proc. VLDB 1, 1 (2008), 376--390.
[28]
Patrick Gray. 2013. Before big data, clean data. Retrieved from http://www.techrepublic.com/blog/big-data-analytics/before-big-data-clean-data/.
[29]
A. Hartemink. 2005. Banjo: Bayesian network inference with Java objects. Retrieved from http://www.cs.duke.edu/∼amink/software/banjo.
[30]
Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke S. Zettlemoyer, and Daniel S. Weld. 2011. Knowledge-based weak supervision for information extraction of overlapping relations. In ACL. The Association for Computer Linguistics, 541--550.
[31]
Raphael Hoffmann, Congle Zhang, and Daniel S. Weld. 2010. Learning 5000 relational extractors. In ACL. The Association for Computer Linguistics, 286--295.
[32]
Sean Kandel, Jeffrey Heer, Catherine Plaisant, Jessie Kennedy, Frank van Ham, Nathalie Henry Riche, Chris Weaver, Bongshin Lee, Dominique Brodbeck, and Paolo Buono. 2011. Research directions in data wrangling: Visuatizations and transformations for usable and credible data. Inform. Vis. (2011), 271--288.
[33]
E. M. Knorr, R. T. Ng, and V. Tucakov. 2000. Distance-based outliers: Algorithms and applications. VLDB J. 8, 3 (2000), 237--253.
[34]
Jeremy Kubica and Andrew Moore. 2003. Probabilistic noise identification and data cleaning. In ICDM. IEEE, 131--138.
[35]
Heather Leslie. 2010. Health data quality -- a two-edged sword. Retrieved from http://omowizard.wordpress.com/2010/02/21/health-data-quality-a-two-edged-sword/.
[36]
M. Li, Y. Zhang, M. Zhu, and M. Zhou. 2006. Exploring distributional similarity based models for query spelling correction. In ICCL. Association for Computational Linguistics, 1025--1032.
[37]
Chris Mayfield, Jennifer Neville, and Sunil Prabhakar. 2009. A statistical method for integrated data cleaning and imputation. Purdue University Computer Science Technical Reports.
[38]
T. Minka, Win J. M., J. P. Guiver, and D. A. Knowles. 2010. Infer.NET 2.4. Microsoft Research Cambridge. Retrieved from http://research.microsoft.com/infernet.
[39]
Judea Pearl. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers.
[40]
Vijayshankar Raman and Joseph M. Hellerstein. 2001. Potter’s wheel: An interactive data cleaning system. In VLDB. Morgan Kaufmann Publishers Inc., 381--390.
[41]
E. S. Ristad and P. N. Yianilos. 1998. Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 5 (1998), 522--532.
[42]
S. J. Russell and P. Norvig. 2010. Artificial Intelligence: A Modern Approach. Prentice Hall.
[43]
Parag Singla and Pedro Domingos. 2006. Entity resolution with markov logic. In ICDM. IEEE, 572--582.
[44]
Dan Suciu and Nilesh Dalvi. 2005. Foundations of probabilistic answers to queries. SIGMOD 14, 16 (2005), 963--963.
[45]
Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. Crowder: Crowdsourcing entity resolution. Proc. VLDB Endow. 5, 11 (2012), 1483--1494.
[46]
Jiannan Wang, Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, Tim Kraska, and Tova Milo. 2014. A Sample-and-clean framework for fast and accurate query processing on dirty data. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD’14). ACM, New York, NY, 469--480.
[47]
Jiannan Wang and Nan Tang. 2014. Towards dependable data repairing with fixing rules. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 457--468.
[48]
Tim Weninger and Jiawei Han. 2013. Exploring structure and content on the web: Extraction and integration of the semi-structured web. In WSDM. ACM, 779--780.
[49]
Garrett Wolf, Aravind Kalavagattu, Hemal Khatri, Raju Balakrishnan, Bhaumik Chokshi, Jianchun Fan, Yi Chen, and Subbarao Kambhampati. 2009. Query processing over incomplete autonomous databases: Query rewriting using learned data dependencies. VLDB J. (2009).
[50]
Hui Xiong, Gaurav Pandey, Michael Steinbach, and Vipin Kumar. 2006. Enhancing data analysis with noise removal. TKDE 18, 3 (2006), 304--319.
[51]
Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F. Ilyas. 2011. Guided data repair. VLDB 4, 5 (2011), 279--289.
[52]
Ce Zhang. 2015. DeepDive: A Data Management System for Automatic Knowledge Base Construction. Ph.D. Dissertation. University of Wisconsin--Madison.
[53]
Congle Zhang, Raphael Hoffmann, and Daniel S. Weld. 2012. Ontological smoothing for relation extraction with minimal supervision. In AAAI. AAAI Press.
[54]
Yudian Zheng, Jiannan Wang, Guoliang Li, Reynold Cheng, and Jianhua Feng. 2015. QASCA: A quality-aware task assignment system for crowdsourcing applications. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1031--1046.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality
Journal of Data and Information Quality  Volume 8, Issue 1
Special Issue on Web Data Quality
November 2016
125 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/3012403
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 October 2016
Accepted: 01 August 2016
Revised: 01 July 2016
Received: 01 November 2015
Published in JDIQ Volume 8, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data quality
  2. offline and online cleaning
  3. statistical data cleaning

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • ONR
  • Leir Charitable Foundations
  • ARO
  • NSF CAREER
  • Google Research

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)82
  • Downloads (Last 6 weeks)10
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media