skip to main content
10.1145/956750.956761acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Efficient data reduction with EASE

Published: 24 August 2003 Publication History

Abstract

A variety of mining and analysis problems --- ranging from association-rule discovery to contingency table analysis to materialization of certain approximate datacubes --- involve the extraction of knowledge from a set of categorical count data. Such data can be viewed as a collection of "transactions," where a transaction is a fixed-length vector of counts. Classical algorithms for solving count-data problems require one or more computationally intensive passes over the entire database and can be prohibitively slow. One effective method for dealing with this ever-worsening scalability problem is to run the algorithms on a small sample of the data. We present a new data-reduction algorithm, called EASE, for producing such a sample. Like the FAST algorithm introduced by Chen et al., EASE is especially designed for count data applications. Both EASE and FAST take a relatively large initial random sample and then deterministically produce a subsample whose "distance" --- appropriately defined --- from the complete database is minimal. Unlike FAST, which obtains the final subsample by quasi-greedy descent, EASE uses epsilon-approximation methods to obtain the final subsample by a process of repeated halving. Experiments both in the context of association rule mining and classical χ2 contingency-table analysis show that EASE outperforms both FAST and simple random sampling, sometimes dramatically.

References

[1]
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of ACM SIGMOD International Conference on Management of Data, 1993.
[2]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of International Conference on Very Large Databases (VLDB), 1994.
[3]
N. Alon and J. H. Spencer. The probabilistic method. Wiley Interscience, New York, 1992.
[4]
H. Bronnimann, B. Chazelle, and J. Matousek. Product range spaces, sensitive sampling, and derandomization. SIAM Journal of Computing, 1999.
[5]
H. Bronnimann, B. Chen, M. Dash, and P. Haas, P. Scheuermann. Efficient data reduction method for on-line association rule discovery. In Proceedings of NSF Workshop on New Generation of Data Mining, 2002.
[6]
B. Chazelle. The discrepancy method. Cambridge University Press, Cambridge, United Kingdom, 2000.
[7]
B. Chen, P. Haas, and P. Scheuermann. A new two-phase sampling based algorithm for discovering association rules. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002.
[8]
J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total. In Proceedings of 12th International Conference on Data Engineering (ICDE), pages 152--159, 1996.
[9]
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proceedings of ACM SIGMOD International Conference on Management of Data, 2000.
[10]
G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proceedings of International Conference on Very Large Databases (VLDB), 2002.
[11]
J. Matousek. Derandomization in computational geometry. Journal of Algorithms, 20(3):545--580, 1996.
[12]
H. Toivonen. Sampling large databases for association rules. In Proceedings of International Conference on Very Large Databases (VLDB), 1996.
[13]
L. G. Valiant. A theory of the learnable. Communications of the Association for Computing Machinery, 27(11):1134--1142, 1984.
[14]
M. J. Zaki, S. Parthasarathy, W. Lin, and M. Ogihara. Evaluation of sampling for data mining of association rules. Technical Report Report RC 617, University of Rochester, Rochester, NY, 1996.

Cited By

View all

Index Terms

  1. Efficient data reduction with EASE

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
    August 2003
    736 pages
    ISBN:1581137370
    DOI:10.1145/956750
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 August 2003

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. OLAP
    2. association rules
    3. count dataset
    4. data streams
    5. frequency estimation
    6. sampling

    Qualifiers

    • Article

    Conference

    KDD03
    Sponsor:

    Acceptance Rates

    KDD '03 Paper Acceptance Rate 46 of 298 submissions, 15%;
    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 06 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media