Article

Cleansing Data for Mining and Warehousing

Authors:

Mong-Li Lee,

Tok Wang Ling,

Hongjun Lu,

Yee Teng KoAuthors Info & Claims

DEXA '99: Proceedings of the 10th International Conference on Database and Expert Systems Applications

Pages 751 - 760

Published: 30 August 1999 Publication History

Abstract

Given the rapid growth of data, it is important to extract, mine and discover useful information from databases and data warehouses. The process of data cleansing is crucial because of the "garbage in, garbage out" principle. "Dirty" data files are prevalent because of incorrect or missing data values, inconsistent value naming conventions, and incomplete information. Hence, we may have multiple records refering to the same real world entity. In this paper, we examine the problem of detecting and removing duplicating records. We present several effcient techniques to pre-process the records before sorting them so that potentially matching records will be brought to a close neighbourhood. Based on these techniques, we implement a data cleansing system which can detect and remove more duplicate records than existing methods.

References

[1]

D. Bitton and D.J. DeWitt. Duplicate record elimination in large data files. ACM Transactions on Database Systems, 1995.

Crossref

Google Scholar

[2]

M. Hernandez and S. Stolfo. The merge/purge problem for large databases. Proc. of ACM SIGMOD Int. Conference on Management of Data pages 127-138, 1995.

Crossref

Google Scholar

[3]

M. Hernandez. A generation of band joins and the merge/purge problem. Technical report CUCS-005-1995, Department of Computer Science, Columbia University, 1995.

Google Scholar

[4]

C. Jacquemin and J. Royaute. Retrieving terms and their variants in a lexicalized unification-based framework. Proc. of the ACM-SIGIR Conference on Research and Development in Information Retrieval pages 132-141, 1994.

Google Scholar

[5]

A.E. Monge and C.P. Elkan. The field matching problem: Algorithms and applications. Proc. of the 2nd Int. Conference on Knowledge Discovery and Data Mining pages 267-270, 1996.

Google Scholar

[6]

A. Siberschatz, M. Stonebraker, and J.D. Ullman. Database research: achievements and opportunities into the 21st century. A report of an NSF workshop on the future of database research. SIGMOD RECORD, March 1996.

Crossref

Google Scholar

[7]

T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology 147:195-197, 1981.

Google Scholar

Cited By

View all

Pervaiz FVashistha AAnderson RChen JMankoff JGomes C(2019)Examining the challenges in development data pipelineProceedings of the 2nd ACM SIGCAS Conference on Computing and Sustainable Societies10.1145/3314344.3332496(13-21)Online publication date: 3-Jul-2019
https://dl.acm.org/doi/10.1145/3314344.3332496
Kim J(2015)Expert system for sectorized cell configuration by radio fingerprint data analytics in wireless cellular networksExpert Systems with Applications: An International Journal10.1016/j.eswa.2014.11.06442:7(3517-3527)Online publication date: 1-May-2015
https://dl.acm.org/doi/10.1016/j.eswa.2014.11.064
Klein ALehner W(2009)Representing Data Quality in Sensor Data Streaming EnvironmentsJournal of Data and Information Quality10.1145/1577840.15778451:2(1-28)Online publication date: 1-Sep-2009
https://dl.acm.org/doi/10.1145/1577840.1577845
Show More Cited By

Recommendations

Alliance Rules for Data Warehouse Cleansing
ICSPS '09: Proceedings of the 2009 International Conference on Signal Processing Systems

Data Cleansing is an activity performed on the data sets of data warehouse to enhance and maintain the quality and consistency of the data. This paper addresses the problems related with dirty data, entrance of dirty data and detection of dirty data in ...
Record Linkage in Data Warehousing: State-of-the-Art Analysis and Research Perspectives
DEXA '11: Proceedings of the 2011 22nd International Workshop on Database and Expert Systems Applications

While a wide collection of research proposals and results in the context of record linkage for relational databases exists, the problem of effectively supporting record linkage in Data Warehousing is still an open research challenge. Contrary to this ...
A Review on Data Cleansing Methods for Big Data
Abstract
Massive amounts of data are available for the organization which will influence their business decision. Data collected from the various resources are dirty and this will affect the accuracy of prediction result. Data cleansing offers a better ...

Comments

Information & Contributors

Information

Published In

DEXA '99: Proceedings of the 10th International Conference on Database and Expert Systems Applications

August 1999

1101 pages

ISBN:3540664483

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 30 August 1999

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 29 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Pervaiz FVashistha AAnderson RChen JMankoff JGomes C(2019)Examining the challenges in development data pipelineProceedings of the 2nd ACM SIGCAS Conference on Computing and Sustainable Societies10.1145/3314344.3332496(13-21)Online publication date: 3-Jul-2019
https://dl.acm.org/doi/10.1145/3314344.3332496
Kim J(2015)Expert system for sectorized cell configuration by radio fingerprint data analytics in wireless cellular networksExpert Systems with Applications: An International Journal10.1016/j.eswa.2014.11.06442:7(3517-3527)Online publication date: 1-May-2015
https://dl.acm.org/doi/10.1016/j.eswa.2014.11.064
Klein ALehner W(2009)Representing Data Quality in Sensor Data Streaming EnvironmentsJournal of Data and Information Quality10.1145/1577840.15778451:2(1-28)Online publication date: 1-Sep-2009
https://dl.acm.org/doi/10.1145/1577840.1577845
Khoshgoftaar TVan Hulse J(2009)Empirical case studies in attribute noise detectionIEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews10.1109/TSMCC.2009.201381539:4(379-388)Online publication date: 1-Jul-2009
https://dl.acm.org/doi/10.1109/TSMCC.2009.2013815
Michelson MKnoblock C(2008)Creating relational data from unstructured and ungrammatical data sourcesJournal of Artificial Intelligence Research10.5555/1622655.162267131:1(543-590)Online publication date: 1-Mar-2008
https://dl.acm.org/doi/10.5555/1622655.1622671
Klein AVarde APei J(2007)Incorporating quality aspects in sensor data streamsProceedings of the ACM first Ph.D. workshop in CIKM10.1145/1316874.1316888(77-84)Online publication date: 9-Nov-2007
https://dl.acm.org/doi/10.1145/1316874.1316888
Doerr MPapagelis M(2007)A Method for Estimating the Precision of Placename MatchingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2007.103319:8(1089-1101)Online publication date: 1-Aug-2007
https://dl.acm.org/doi/10.1109/TKDE.2007.1033
Kalashnikov DMehrotra S(2006)Domain-independent data cleaning via analysis of entity-relationship graphACM Transactions on Database Systems10.1145/1138394.113840131:2(716-767)Online publication date: 1-Jun-2006
https://dl.acm.org/doi/10.1145/1138394.1138401
Michelson MKnoblock C(2005)Semantic annotation of unstructured and ungrammatical textProceedings of the 19th international joint conference on Artificial intelligence10.5555/1642293.1642468(1091-1098)Online publication date: 30-Jul-2005
https://dl.acm.org/doi/10.5555/1642293.1642468
Khoshgoftaar TVan Hulse J(2005)Identifying noisy features with the Pairwise Attribute Noise Detection AlgorithmIntelligent Data Analysis10.5555/1239090.12390969:6(589-602)Online publication date: 1-Nov-2005
https://dl.acm.org/doi/10.5555/1239090.1239096
Show More Cited By

Abstract

References

Cited By

Recommendations

Alliance Rules for Data Warehouse Cleansing

Record Linkage in Data Warehousing: State-of-the-Art Analysis and Research Perspectives

A Review on Data Cleansing Methods for Big Data

Comments

Information

Published In

Publisher

Publication History

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Share

Share this Publication link

Share on social media

Affiliations