skip to main content
article

Truth finding on the deep web: is the problem solved?

Published: 01 December 2012 Publication History

Abstract

The amount of useful information available on the Web has been growing at a dramatic pace in recent years and people rely more and more on the Web to fulfill their information needs. In this paper, we study truthfulness of Deep Web data in two domains where we believed data are fairly clean and data quality is important to people's lives: Stock and Flight. To our surprise, we observed a large amount of inconsistency on data from different sources and also some sources with quite low accuracy. We further applied on these two data sets state-of-the-art data fusion methods that aim at resolving conflicts and finding the truth, analyzed their strengths and limitations, and suggested promising research directions. We wish our study can increase awareness of the seriousness of conflicting data on the Web and in turn inspire more research in our community to tackle this problem.

References

[1]
L. Berti-Equille, A. D. Sarma, X. L. Dong, A. Marian, and D. Srivastava. Sailing the information ocean with awareness of currents: Discovery and application of source dependence. In CIDR, 2009.
[2]
L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Probabilistic models to reconcile complex data from inaccurate data sources. In CAiSE, 83-97, 2010.
[3]
J. Bleiholder and F. Naumann. Data fusion. ACM Computing Surveys, 41(1):1-41, 2008.
[4]
N. Dalvi, A. Machanavajjhala, and B. Pang. An analysis of structured data on the web. PVLDB, 5(7):680-691, 2012.
[5]
X. L. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global detection of complex copying relationships between sources. PVLDB, 3(1):1358-1369, 2010.
[6]
X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: the role of source dependence. PVLDB, 2(1):550-561, 2009.
[7]
X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1):562-573, 2009.
[8]
X. L. Dong and F. Naumann. Data fusion-resolving data conflicts for integration. PVLDB, 2(2):1654-1655, 2009.
[9]
X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. PVLDB, 6(2), 2013.
[10]
A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In WSDM, 131-140, 2010.
[11]
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. In SODA, 668-677, 1998.
[12]
X. Li, X. L. Dong, K. B. Lyons, W. Meng, and D. Srivastava. Truth Finding on the Deep Web: Is the Problem Solved? http://lunadong.com/publication/webfusion_report.pdf.
[13]
J. Pasternack and D. Roth. Knowing what to believe (when you already know something). In COLING, 877-885, 2010.
[14]
J. Pasternack and D. Roth. Making better informed trust decisions with generalized fact-finding. In IJCAI, 2324-2329, 2011.
[15]
D. Srivastava and S. Venkatasubramanian. Information theory for data management. PVLDB, 2(2):1662-1663, 2009.
[16]
M. Wu and A. Marian. Corroborating answers from multiple web sources. In Proc. of the WebDB Workshop, 2007.
[17]
M. Wu and A. Marian. A framework for corroborating answers from multiple web sources. Inf. Syst., 36(2):431-449, 2011.
[18]
X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng., 20:796-808, 2008.
[19]
X. Yin and W. Tan. Semi-supervised truth discovery. In WWW, 217-226, 2011.
[20]
B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6):550-561, 2012.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 6, Issue 2
December 2012
120 pages

Publisher

VLDB Endowment

Publication History

Published: 01 December 2012
Published in PVLDB Volume 6, Issue 2

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)1
Reflects downloads up to 23 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media