skip to main content
10.5555/648312.755367guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Cleansing Data for Mining and Warehousing

Published: 30 August 1999 Publication History

Abstract

Given the rapid growth of data, it is important to extract, mine and discover useful information from databases and data warehouses. The process of data cleansing is crucial because of the "garbage in, garbage out" principle. "Dirty" data files are prevalent because of incorrect or missing data values, inconsistent value naming conventions, and incomplete information. Hence, we may have multiple records refering to the same real world entity. In this paper, we examine the problem of detecting and removing duplicating records. We present several effcient techniques to pre-process the records before sorting them so that potentially matching records will be brought to a close neighbourhood. Based on these techniques, we implement a data cleansing system which can detect and remove more duplicate records than existing methods.

References

[1]
D. Bitton and D.J. DeWitt. Duplicate record elimination in large data files. ACM Transactions on Database Systems, 1995.
[2]
M. Hernandez and S. Stolfo. The merge/purge problem for large databases. Proc. of ACM SIGMOD Int. Conference on Management of Data pages 127-138, 1995.
[3]
M. Hernandez. A generation of band joins and the merge/purge problem. Technical report CUCS-005-1995, Department of Computer Science, Columbia University, 1995.
[4]
C. Jacquemin and J. Royaute. Retrieving terms and their variants in a lexicalized unification-based framework. Proc. of the ACM-SIGIR Conference on Research and Development in Information Retrieval pages 132-141, 1994.
[5]
A.E. Monge and C.P. Elkan. The field matching problem: Algorithms and applications. Proc. of the 2nd Int. Conference on Knowledge Discovery and Data Mining pages 267-270, 1996.
[6]
A. Siberschatz, M. Stonebraker, and J.D. Ullman. Database research: achievements and opportunities into the 21st century. A report of an NSF workshop on the future of database research. SIGMOD RECORD, March 1996.
[7]
T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology 147:195-197, 1981.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
DEXA '99: Proceedings of the 10th International Conference on Database and Expert Systems Applications
August 1999
1101 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 30 August 1999

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 29 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media