Article

Efficient data reduction with EASE

Authors:

Peter ScheuermannAuthors Info & Claims

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 59 - 68

https://doi.org/10.1145/956750.956761

Published: 24 August 2003 Publication History

Get Access

Abstract

A variety of mining and analysis problems --- ranging from association-rule discovery to contingency table analysis to materialization of certain approximate datacubes --- involve the extraction of knowledge from a set of categorical count data. Such data can be viewed as a collection of "transactions," where a transaction is a fixed-length vector of counts. Classical algorithms for solving count-data problems require one or more computationally intensive passes over the entire database and can be prohibitively slow. One effective method for dealing with this ever-worsening scalability problem is to run the algorithms on a small sample of the data. We present a new data-reduction algorithm, called EASE, for producing such a sample. Like the FAST algorithm introduced by Chen et al., EASE is especially designed for count data applications. Both EASE and FAST take a relatively large initial random sample and then deterministically produce a subsample whose "distance" --- appropriately defined --- from the complete database is minimal. Unlike FAST, which obtains the final subsample by quasi-greedy descent, EASE uses epsilon-approximation methods to obtain the final subsample by a process of repeated halving. Experiments both in the context of association rule mining and classical χ² contingency-table analysis show that EASE outperforms both FAST and simple random sampling, sometimes dramatically.

References

[1]

R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of ACM SIGMOD International Conference on Management of Data, 1993.

Digital Library

Google Scholar

[2]

R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of International Conference on Very Large Databases (VLDB), 1994.

Digital Library

Google Scholar

[3]

N. Alon and J. H. Spencer. The probabilistic method. Wiley Interscience, New York, 1992.

Google Scholar

[4]

H. Bronnimann, B. Chazelle, and J. Matousek. Product range spaces, sensitive sampling, and derandomization. SIAM Journal of Computing, 1999.

Digital Library

Google Scholar

[5]

H. Bronnimann, B. Chen, M. Dash, and P. Haas, P. Scheuermann. Efficient data reduction method for on-line association rule discovery. In Proceedings of NSF Workshop on New Generation of Data Mining, 2002.

Digital Library

Google Scholar

[6]

B. Chazelle. The discrepancy method. Cambridge University Press, Cambridge, United Kingdom, 2000.

Digital Library

Google Scholar

[7]

B. Chen, P. Haas, and P. Scheuermann. A new two-phase sampling based algorithm for discovering association rules. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002.

Digital Library

Google Scholar

[8]

J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total. In Proceedings of 12th International Conference on Data Engineering (ICDE), pages 152--159, 1996.

Digital Library

Google Scholar

[9]

J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proceedings of ACM SIGMOD International Conference on Management of Data, 2000.

Digital Library

Google Scholar

[10]

G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proceedings of International Conference on Very Large Databases (VLDB), 2002.

Digital Library

Google Scholar

[11]

J. Matousek. Derandomization in computational geometry. Journal of Algorithms, 20(3):545--580, 1996.

Digital Library

Google Scholar

[12]

H. Toivonen. Sampling large databases for association rules. In Proceedings of International Conference on Very Large Databases (VLDB), 1996.

Digital Library

Google Scholar

[13]

L. G. Valiant. A theory of the learnable. Communications of the Association for Computing Machinery, 27(11):1134--1142, 1984.

Digital Library

Google Scholar

[14]

M. J. Zaki, S. Parthasarathy, W. Lin, and M. Ogihara. Evaluation of sampling for data mining of association rules. Technical Report Report RC 617, University of Rochester, Rochester, NY, 1996.

Digital Library

Google Scholar

Cited By

View all

Song SHuang YJiang PYu XZheng WDi SCao QFeng YXie ZCappello FMencagli GDazzi PLowenthal DBadia R(2024)CereSZ: Enabling and Scaling Error-bounded Lossy Compression on Cerebras CS-2Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658691(309-321)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658691
Zhang ZHuang J(2023)Fast Frequent Patterns Mining by Multiple Sampling With Tight Guarantee Under Bayesian StatisticsIEEE Transactions on Cybernetics10.1109/TCYB.2021.312519653:5(2993-3006)Online publication date: May-2023
https://doi.org/10.1109/TCYB.2021.3125196
Kassaie BIrving ETompa F(2021)Computer-Assisted Cohort Identification in PracticeACM Transactions on Computing for Healthcare10.1145/34834113:2(1-28)Online publication date: 20-Dec-2021
https://dl.acm.org/doi/10.1145/3483411
Show More Cited By

Index Terms

Efficient data reduction with EASE
1. Information systems
  1. Information systems applications

Recommendations

Efficient sampling: application to image data
PAKDD'05: Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining

Sampling is an important preprocessing algorithm that is used to mine large data efficiently. Although a simple random sample often works fine for reasonable sample size, accuracy falls sharply with reduced sample size. In kdd'03 we proposed ease that ...
An Efficient Subset-Lattice Algorithm for Mining Closed Frequent Itemsets in Data Streams
TAAI '12: Proceedings of the 2012 Conference on Technologies and Applications of Artificial Intelligence

There are many applications of using association rules in data streams, such as market analysis, network security, sensor networks and web tracking. Mining closed frequent item sets is a further work of mining association rules, which aims to find the ...
Enhanced mining of association rules from data cubes
DOLAP '06: Proceedings of the 9th ACM international workshop on Data warehousing and OLAP

On-line analytical processing (OLAP) provides tools to explore and navigate into data cubes in order to extract interesting information. Nevertheless, OLAP is not capable of explaining relationships that could exist in a data cube. Association rules are ...

Comments

Information & Contributors

Information

Published In

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

August 2003

736 pages

ISBN:1581137370

DOI:10.1145/956750

Conference Chair:
Lise Getoor
University of Maryland, College Park
,
General Chair:
Ted Senator
DARPA
,
Program Chairs:
Pedro Domingos
University of Washington
,
Christos Faloutsos
Carnegie Mellon University

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2003

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

KDD03

Sponsor:

KDD03: The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 24 - 27, 2003

Washington, D.C.

Acceptance Rates

KDD '03 Paper Acceptance Rate 46 of 298 submissions, 15%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

37
Total Citations
View Citations
694
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Song SHuang YJiang PYu XZheng WDi SCao QFeng YXie ZCappello FMencagli GDazzi PLowenthal DBadia R(2024)CereSZ: Enabling and Scaling Error-bounded Lossy Compression on Cerebras CS-2Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658691(309-321)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658691
Zhang ZHuang J(2023)Fast Frequent Patterns Mining by Multiple Sampling With Tight Guarantee Under Bayesian StatisticsIEEE Transactions on Cybernetics10.1109/TCYB.2021.312519653:5(2993-3006)Online publication date: May-2023
https://doi.org/10.1109/TCYB.2021.3125196
Kassaie BIrving ETompa F(2021)Computer-Assisted Cohort Identification in PracticeACM Transactions on Computing for Healthcare10.1145/34834113:2(1-28)Online publication date: 20-Dec-2021
https://dl.acm.org/doi/10.1145/3483411
Hamidi HMousavi R(2018)Analysis and Evaluation of a Framework for Sampling Database in RecommendersJournal of Global Information Management10.4018/JGIM.201801010326:1(41-57)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.4018/JGIM.2018010103
Hamidi HHashemzadeh E(2017)An Approach to Improve Generation of Association Rules in Order to Be Used in RecommendersInternational Journal of Data Warehousing and Mining10.4018/IJDWM.201710010113:4(1-18)Online publication date: 1-Oct-2017
https://dl.acm.org/doi/10.4018/IJDWM.2017100101
Zhang ZPedrycz WHuang J(2017)Efficient frequent itemsets mining through sampling and information granulationEngineering Applications of Artificial Intelligence10.1016/j.engappai.2017.07.01665:C(119-136)Online publication date: 1-Oct-2017
https://dl.acm.org/doi/10.1016/j.engappai.2017.07.016
Riondato MUpfal E(2014)Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance GuaranteesACM Transactions on Knowledge Discovery from Data10.1145/26295868:4(1-32)Online publication date: 29-Aug-2014
https://dl.acm.org/doi/10.1145/2629586
Raj KPadma P(2013)Application of Association Rule Mining: A case study on team India2013 International Conference on Computer Communication and Informatics10.1109/ICCCI.2013.6466294(1-6)Online publication date: Jan-2013
https://doi.org/10.1109/ICCCI.2013.6466294
Elsayed SRajasekaran SAmmar R(2013)Integrating clonal selection and deterministic sampling for efficient associative classification2013 IEEE Congress on Evolutionary Computation10.1109/CEC.2013.6557966(3236-3243)Online publication date: Jun-2013
https://doi.org/10.1109/CEC.2013.6557966
Vaidehi VVardhini MYogeshwaran HInbasagar GBhargavi RHemalatha C(2013)Agent Based Health Monitoring of Elderly People in Indoor Environments Using Wireless Sensor NetworksProcedia Computer Science10.1016/j.procs.2013.06.01419(64-71)Online publication date: 2013
https://doi.org/10.1016/j.procs.2013.06.014
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Efficient sampling: application to image data

An Efficient Subset-Lattice Algorithm for Mining Closed Frequent Itemsets in Data Streams

Enhanced mining of association rules from data cubes