skip to main content
10.1145/956750.956844acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Data quality through knowledge engineering

Published: 24 August 2003 Publication History

Abstract

Traditionally, data quality programs have acted as a preprocessing stage to make data suitable for a data mining or analysis operation. Recently, data quality concepts have been applied to databases that support business operations such as provisioning and billing. Incorporating business rules that drive operations and their associated data processes is critically important to the success of such projects. However, there are many practical complications. For example, documentation on business rules is often meager. Rules change frequently. Domain knowledge is often fragmented across experts, and those experts do not always agree. Typically, rules have to be gathered from subject matter experts iteratively, and are discovered out of logical or procedural sequence, like a jigsaw puzzle. Our approach is to impement business rules as constraints on data in a classical expert system formalism sometimes called production rules. Our system works by allowing good data to pass through a system of constraints unchecked. Bad data violate constraints and are flagged, and then fed back after correction. Constraints are added incrementally as better understanding of the business rules is gained. We include a real-life case study.

References

[1]
M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. LOF: Identifying density-based local outliers. In Proc. ACM SIGMOD Conf., pages 93--104, 2000.]]
[2]
T. Dasu. and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, New York, 2003.]]
[3]
L. D. Delwiche. and S. J. Slaughter. The little sas book-a primer, 1998. Second edition.]]
[4]
C. L. Forgy. Ops5 user's manual, 1981. Technical Report CMU-CS-81-135.]]
[5]
E. J. Friedman-Hill. Jess. http://herzberg.ca.sandia.gov/jess, 1997. Sandia National Laboratories.]]
[6]
J. C. Giarratano. Expert Systems: Principals and Programming. BrooksCole Publishing Co., 1998.]]
[7]
M. Hernandez nd S. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998.]]
[8]
T. Johnson and T. Dasu. Comparing mssive high-dimensional data sets. In Knowledge Discovery and Data Mining, pages 229--233, 1998.]]
[9]
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. In Proc. Intl. Conf. Very Large Data Bases, pages 392--403, 1998.]]
[10]
R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. Wiley, New York, 1987.]]
[11]
NASA. Clips. http://siliconvalleynone.com/clips/html. NASA Johnson Space Center.]]
[12]
A. Newell. Production Systems: Models of Control Structures in Visual Information Processing. New York: Academic Press, 1973.]]
[13]
R. Ramakrishnan. and P. J. S. (Ed). Constraints and Databases. Kluwer Academic, 1998.]]
[14]
T. Redman. Data Quality: Management and Technology. Bantam Books, New York, 1992.]]
[15]
T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001.]]
[16]
J. R. Rowland. and G. T. Vesonder. The c5 user manual release 1.0, 1987.]]
[17]
J. Tukey. Exploratory Data Analysis: Addison-Wesley, Reading, 1977.]]
[18]
G. T. Vesonder. Rule-based programming in the unix system. AT&T Technical Journal, pages 69--80, January 1988.]]

Cited By

View all

Index Terms

  1. Data quality through knowledge engineering

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
    August 2003
    736 pages
    ISBN:1581137370
    DOI:10.1145/956750
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 August 2003

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. business operations databases
    2. data quality
    3. static and dynamic constraints

    Qualifiers

    • Article

    Conference

    KDD03
    Sponsor:

    Acceptance Rates

    KDD '03 Paper Acceptance Rate 46 of 298 submissions, 15%;
    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)9
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 14 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media