Authors:
Christian Ernst
1
;
Youssef Hmamouche
2
and
Alain Casali
2
Affiliations:
1
Ecole des Mines de St Etienne and LIMOS, France
;
2
Aix Marseille Universite, France
Keyword(s):
Data Mining, Data Preparation, Outliers, Discretization Methods, Parallelism and Multicore Encoding.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Data Reduction and Quality Assessment
;
Knowledge Discovery and Information Retrieval
;
Knowledge-Based Systems
;
Pre-Processing and Post-Processing for Data Mining
;
Symbolic Systems
Abstract:
In light of the fact that data preparation has a substantial impact on data mining results, we provide
an original framework for automatically preparing the data of any given database. Our research focuses, for each
attribute of the database, on two points: (i) Specifying an optimized outlier detection method, and (ii),
Identifying the most appropriate discretization method. Concerning the former, we illustrate that the detection of an outlier
depends on if data distribution is normal or not. When attempting to discern the best discretization method,
what is important is the shape followed by the density function of its distribution law. For this reason,
we propose an automatic choice for finding the optimized discretization method based on a multi-criteria
(Entropy, Variance, Stability) evaluation. Processings are performed in parallel using multicore capabilities.
Conducted experiments validate our approach, showing that it is not always the very same discretization method
that is the best.
(More)