skip to main content
research-article

The Effects and Interactions of Data Quality and Problem Complexity on Classification

Published: 01 February 2011 Publication History

Abstract

Data quality remains a persistent problem in practice and a challenge for research. In this study we focus on the four dimensions of data quality noted as the most important to information consumers, namely accuracy, completeness, consistency, and timeliness. These dimensions are of particular concern for operational systems, and most importantly for data warehouses, which are often used as the primary data source for analyses such as classification, a general type of data mining. However, the definitions and conceptual models of these dimensions have not been collectively considered with respect to data mining in general or classification in particular. Nor have they been considered for problem complexity. Conversely, these four dimensions of data quality have only been indirectly addressed by data mining research. Using definitions and constructs of data quality dimensions, our research evaluates the effects of both data quality and problem complexity on generated data and tests the results in a real-world case. Six different classification outcomes selected from the spectrum of classification algorithms show that data quality and problem complexity have significant main and interaction effects. From the findings of significant effects, the economics of higher data quality are evaluated for a frequent application of classification and illustrated by the real-world case.

References

[1]
Ali, S. and Smith, K. A. 2006. On learning algorithm selection for classification. Appl. Soft Comput. J. 6, 119--138.
[2]
Apte, C., Liu, B., Pednault, E. P. D., and Smyth, P. 2002. Business applications of data mining. Comm. ACM 45, 49--53.
[3]
Ballou, D., Wang, R., Pazer, H., and Kumar, T. G. 1998. Modeling information manufacturing systems to determine information product quality. Manag. Sci. 44, 462--484.
[4]
Ballou, D. P. and Pazer, H. L. 1985. Modeling data and process quality in multi-input, multi-output information systems. Manag. Sci. 31, 150--162.
[5]
Ballou, D. P. and Pazer, H. L. 2003. Modeling completeness versus consistency tradeoffs in information decision contexts. IEEE Trans. Knowl. Data Engin. 15, 240--243.
[6]
Davenport, T. H. and Harris, J. G. 2007. Competing on Analytics: The New Science of Winning. Harvard Business School Publishing Company, Boston, MA.
[7]
Dillard, R. A. 1992. Using data quality measures in decision-making algorithms. IEEE Intell. Syst. Appl. 7, 63--72.
[8]
Eckerson, W. W. 2002. Data warehousing special report: Data quality and the bottom line. In Applications Development Trends.
[9]
Even, A. and Shankaranarayanan, G. 2007. Utility-driven configuration of data quality in data repositories. Int. J. Inf. Quality 1, 22--40.
[10]
Even, A. and Shankaranarayanan, G. 2009. Dual assessment of data quality in customer databases. ACM J. Inf. Data Quality 1, 3.
[11]
Fisher, C., Lauria, E., and Matheus, C. 2007. In search of an accuracy metric. In Proceedings of the 12th International Conference on Information Quality.
[12]
Ge, M. and Helfert, M. 2006. A framework to assess decision quality using information quality dimensions. In Proceedings of the International Conference on Information Quality.
[13]
Gomes, P., Farinha, J., and Trigueiros, M. J. 2007. A data quality metamodel extension to CWM. In Proceedings of the 4th Asia-Pacific Conference on Conceptual Modeling. 17--26.
[14]
Hadden, J., Tiwari, A., Roy, R., and Ruta, D. 2007. Computer assisted customer churn management: State-of-the-Art and future trends. Comput. Oper. Res. 34, 2902--2917.
[15]
Heinrich, B., Klier, M., and Kaiser, M. 2009. A procedure to develop metrics for currency and its application in CRM. ACM J. Inf. Data Quality 1, 3.
[16]
Hickey, R. 1996. Noise modelling and evaluating learning from examples. Artif. Intell. 82, 157--179.
[17]
Kahn, B. K., Strong, D. M., and Wang, R. Y. 2002. Information quality benchmarks: Product and service performance. Comm. ACM 45, 185--192.
[18]
Karr, A. F., Sanil, A. P., and Banks, D. L. 2006. Data quality: A statistical perspective. Statist. Method. 3, 137--173.
[19]
Klein, B. D., Goodhue, D. L., and Davis, G. B. 1997. Can humans detect errors in data? Impact of base rates, incentives, and goals. MIS Quart. 21, 169--194.
[20]
Kohavi, R., Rothleder, N. J., and Simoudis, E. 2002. Emerging trends in business analytics. Comm. ACM 45, 45--48.
[21]
Lakshminarayan, K., Harp, S. A., and Samad, T. 1999. Imputation of missing data in industrial databases. Appl. Intell. 11, 259--275.
[22]
Lee, Y. W., Pipino, L., Strong, D. M., and Wang, R. Y. 2004. Process-embedded data integrity. J. Datab. Manag. 15, 87--103.
[23]
Lee, Y. W., Pipino, L. L., Funk, J. D., and Wang, R. Y. 2006. Journey to Data Quality. The MIT Press.
[24]
Lee, Y. W., Strong, D. M., Kahn, B. K., and Wang, R. Y. 2002. AIMQ: A methodology for information quality assessment. Inf. Manag. 40, 133--146.
[25]
Madnick, S. and Wang, R. Y. 1992. Introduction to total data quality management (TDQM). Research Program TDQM-92-01, Total Data Quality Management Program, MIT Sloan School of Management.
[26]
March, S. T. and Hevner, A. R. 2007. Integrated decision support systems: A data warehousing perspective. Decis. Support Syst. 43, 1031--1043.
[27]
Oates, T. and Jensen, D. 1997. The effects of training set size on decision tree complexity. In Proceedings of the 14th International Conference on Machine Learning. Morgan Kaufmann Publishers, 254--262.
[28]
Ordonez, C. and García-García, J. 2008. Referential integrity quality metrics. Decis. Support Syst. 44, 495--508.
[29]
Parssian, A. 2006. Managerial decision support with knowledge of accuracy and completeness of the relational aggregate functions. Decis. Support Syst. 42, 1494--1502.
[30]
Parssian, A., Sarkar, S., and Jacob, V. S. 2004. Assessing data quality for information products: Impact of selection, projection, and cartesian product. Manag. Sci. 50, 967--982.
[31]
Pipino, L. L., Lee, Y. W., and Wang, R. Y. 2002. Data quality assessment. Comm. ACM 45, 211--218.
[32]
Quinlan, J. R. 1986. Induction of decision trees. Mach. Learn. 1, 81--106.
[33]
Redman, T. C. 2004. Data: An unfolding quality disaster. DM Rev. 6.
[34]
Reichheld, F. F. and Sasser, W. E. 1990. Zero defections. Harvard Bus. Rev. 68, 105--111.
[35]
Sessions, V. and Valtorta, M. 2006. Learning Bayesian networks from inaccurate data. In Proceedings of the 11th International Conference on Information Quality.
[36]
Shankaranarayanan, G. and Cai, Y. 2006. Supporting data quality management in decision-making. Decis. Support Syst. 42, 302--317.
[37]
Su, Y. and Jin, Z. 2007. Assessment and improvement of data and information quality. In Information Quality Management: Theory and Applications. Idea Group, Inc.
[38]
Swait, J. and Adamowicz, W. 2001. Choice environment, market complexity, and consumer behavior: A theoretical and empirical approach for incorporating decision complexity into models of consumer choice. Organiz. Behav. Hum. Decis. Process. 86, 141--167.
[39]
Wand, Y. and Wang, R. Y. 1996. Anchoring data quality dimensions in ontological foundations. Comm. ACM 39, 86--95.
[40]
Wang, R. Y. and Strong, D. M. 1996. Beyond accuracy: What data quality means to data consumers. J. Manag. Inf. Syst. 12, 5--33.
[41]
Wang, R. Y., Ziad, M., and Lee, Y. W. 2000. Data Quality. Kluwer Academic Publishers.
[42]
Wu, Y., Frizelle, G., and Efstathiou, J. 2007. A study on the cost of operational complexity in customer-supplier systems. Int. J. Product. Econom. 106, 217--229.
[43]
Zhu, X. and Wu, X. 2004. Class noise vs. attribute noise: A quantitative study. Artif. Intell. Rev. 22, 177--210.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality
Journal of Data and Information Quality  Volume 2, Issue 2
February 2011
102 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/1891879
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 February 2011
Accepted: 01 November 2010
Revised: 01 September 2010
Received: 01 December 2008
Published in JDIQ Volume 2, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data quality
  2. data mining
  3. data quality metrics and measurements
  4. information quality

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)117
  • Downloads (Last 6 weeks)14
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media