skip to main content
research-article

Detecting outlying properties of exceptional objects

Published: 23 April 2009 Publication History

Abstract

Assume you are given a data population characterized by a certain number of attributes. Assume, moreover, you are provided with the information that one of the individuals in this data population is abnormal, but no reason whatsoever is given to you as to why this particular individual is to be considered abnormal. In several cases, you will be indeed interested in discovering such reasons. This article is precisely concerned with this problem of discovering sets of attributes that account for the (a priori stated) abnormality of an individual within a given dataset. A criterion is presented to measure the abnormality of combinations of attribute values featured by the given abnormal individual with respect to the reference population. In this respect, each subset of attributes is intended to somehow represent a “property” of individuals. We distinguish between global and local properties. Global properties are subsets of attributes explaining the given abnormality with respect to the entire data population. With local ones, instead, two subsets of attributes are singled out, where the former one justifies the abnormality within the data subpopulation selected using the values taken by the exceptional individual on those attributes included in the latter one. The problem of individuating abnormal properties with associated explanations is formally stated and analyzed. Such a formal characterization is then exploited in order to devise efficient algorithms for detecting both global and local forms of most abnormal properties. The experimental evidence, which is accounted for in the article, shows that the algorithms are both able to mine meaningful information and to accomplish the computational task by examining a negligible fraction of the search space.

References

[1]
Aggarwal, C. C. and Yu, P. 2001. Outlier detection for high dimensional data. In Proceedings of the International Conference on Managment of Data (SIGMOD), 37--46.
[2]
Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of the International Conference on Very Large Databases (VLDB), 487--499.
[3]
Angiulli, F. and Pizzuti, C. 2002. Fast outlier detection in large high-dimensional datasets. In Proceedings of the International Conference on Principles of Data Mining and Knowledge Discovery (PKDD), 15--26.
[4]
Angiulli, F. and Pizzuti, C. 2005. Outlier mining in large high-dimensional datasets. IEEE Trans. Knowl. Data Eng. 17, 2, 203--215.
[5]
Arning, A., Aggarwal, C., and Raghavan, P. 1996. A linear method for deviation detection in large databases. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD), 164--169.
[6]
Barnett, V. and Lewis, T. 1994. Outliers in Statistical Data. John Wiley & Sons.
[7]
Breiman, L., Friedman, J., Olshen, R., and Stone, C. 1984. Classification and Regression Trees. Wadsworth, Belmont.
[8]
Breunig, M. M., Kriegel, H., Ng, R., and Sander, J. 2000. Lof: Identifying density-based local outliers. In Proceedings of the International Conference on Managment of Data (SIGMOD), 93--104.
[9]
Chaudhuri, S. and Dayal, U. 1997. An overview of data warehousing and olap technology. SIGMOD Rec. 26, 1, 65--74.
[10]
Chen, Z. and Toda, S. 1995. The complexity of selecting maximal solutions. Inf. Comput. 1192, 2, 231--239.
[11]
Codd, E., Codd, S., and Salley, C. 1993. Providing OLAP (on-line analytical processing) to user-analysts: An it mandate. Tech. rep., Codd & Date, Inc.
[12]
De Benedictis, G., Rose, G., Carrieri, G., Luca, M. D., Falcone, E., Passarino, G., Bonafè, M., Monti, D., Baggio, G., Bertolini, S., Mari, D., Mattace, R., and Franceschi, C. 1999. Mitochondrial DNA inherited variants are associates with successful aging and longevity in humans. FASEB J. 13, 12, 1532--1536.
[13]
Garasto, S., Berardelli, M., Rango, F. D., Mari, V., Feraco, E., and Benedictis, G. D. 2004. A study of the average effect of the 3'apob-vntr polymorphism on lipidemic parameters could explain why the short alleles (< 35 repeats) are rare in centenarians. BMC Medical Genetics 5, 3.
[14]
Garey, M. R. and Johnson, D. S. 1979. Computer and Intractability. W. H. Freeman and Company, New York.
[15]
Gini, C. 1921. Measurement of inequality of incomes. The Economic J. 31, 124--126.
[16]
Griffiths, A. J. F., Miller, J. H., Suzuki, D. T., Lewontin, R. C., and Gelbart, W. M. 1996. An Introduction to Genetic Analysis. W. H. Freeman.
[17]
Hodge, V. and Austin, J. 2004. A survey of outlier detection methodologies. Artif. Intell. Rev. 22, 2, 85--126.
[18]
Karp, R. 1972. Reducibility among combinatorial problems. In Complexity of Computer Computations. Plenum, New York, 85--103.
[19]
Klösgen, W. 1996. Explora: A multipattern and multistrategy discovery assistant. In Advances in Knowledge Discovery and Data Mining (KDD), 249--271.
[20]
Knorr, E. and Ng, R. 1998. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the International Conference on Very Large Databases (VLDB), 392--403.
[21]
Knorr, E. and Ng, R. 1999. Finding intensional knowledge of distance-based outliers. In Proceedings of the International Conference on Very Large Databases(VLDB), 211--222.
[22]
Knorr, E., Ng, R., and Tucakov, V. 2000. Distance-based outlier: Algorithms and applications. VLDB J. 8, 3-4, 237--253.
[23]
Knuth, D. 1998. The Art of Computer Programming, Volume 3: Sorting and Searching, 2nd ed. Addison-Wesley.
[24]
Krentel, M. 1988. The complexity of optimization problems. J. Comput. Syst. Sci. 36, 3, 490--509.
[25]
Mehta, M., Agrawal, R., and Rissanen, J. 1996. Sliq: A fast scalable classifier for data mining. In International Conference on Extending Database Technology (EDBT), 18--32.
[26]
Newman, D. J., Hettich, S., Blake, C. L., and Merz, C. 1998. UCI repository of machine learning databases.
[27]
Papadimitriou, C. 1994. Computational Complexity. Addison-Wesley, USA.
[28]
Passarino, G., Montesanto, A., Dato, S., Giordano, S., Domma, F., Mari, V., Feraco, E., and Benedictis, G. D. 2006. Sex and age specificity of susceptibility genes modulating survival at old age. Human Heredity (Int. J. Hum. Medical Genetics) 62, 4, 213--220.
[29]
Ramaswamy, S., Rastogi, R., and Shim, K. 2000. Efficient algorithms for mining outliers from large datasets. In Proceedings of the International Conference on Managment of Data (SIGMOD), 427--438.
[30]
Rymon, R. 1992. Search through systematic set enumeration. In Proceedings of the International Conference on Principles of Knowledge and Reasoning (KR), 539--550.
[31]
Sarawagi, S., Agrawal, R., and Megiddo, N. 1998. Discovery-Driven exploration of OLAP data cubes. In Proceedings of the International Conference on Extending Database Technology (EDBT), 168--182.
[32]
Selman, A. 1994. A taxonomy of complexity classes of functions. J. Comput. Syst. Sci. 48, 2, 357--381.
[33]
Shafer, J., Agrawal, R., and Mehta, M. 1996. Sprint: A scalable parallel classifier for data mining. In Proceedings of the International Conference on Very Large Data Bases (VLDB), 544--555.
[34]
Suzuki, E. 2006. Data mining methods for discovering interesting exceptions from an unsupervised table. J. Universal Comput. Sci. 12, 6, 627--653.
[35]
Tan, P. N., Steinbach, M., and Kumar, V. 2005. Introduction to Data Mining. Addison-Wesley.
[36]
Wei, L., Qian, W., Zhou, A., Jin, W., and Yu, J. 2003. Hot: Hypergraph-Based outlier test for categorical data. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 399--410.
[37]
Zhang, J., Gao, Q., and Wang, H. 2006. A novel method for detecting outlying subspaces in high-dimensional databases using genetic algorithm. In Proceedings of the International Conference on Data Mining (ICDM), 731--740.
[38]
Zhang, J. and Wang, H. 2006. Detecting outlying subspaces for high-dimensional data: The new task, algorithms, and performance. Knowl. Inf. Syst. 10, 3, 333--355.
[39]
Zhu, C., Kitagawa, H., and Faloutsos, C. 2005. Example-Based robust outlier detection in high dimensional datasets. In Proceedings of the International Conference on Data Mining (ICDM), 829--832.

Cited By

View all

Index Terms

  1. Detecting outlying properties of exceptional objects

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Database Systems
    ACM Transactions on Database Systems  Volume 34, Issue 1
    April 2009
    349 pages
    ISSN:0362-5915
    EISSN:1557-4644
    DOI:10.1145/1508857
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 April 2009
    Accepted: 01 September 2008
    Revised: 01 May 2008
    Received: 01 September 2007
    Published in TODS Volume 34, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Data mining
    2. knowledge discovery
    3. outlier characterization

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 04 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media