Article

Exploiting partial decision trees for feature subset selection in e-mail categorization

Authors:

Michael DittenbachAuthors Info & Claims

SAC '06: Proceedings of the 2006 ACM symposium on Applied computing

Pages 1105 - 1109

https://doi.org/10.1145/1141277.1141536

Published: 23 April 2006 Publication History

Abstract

In this paper we propose PART_fs which adopts a supervised machine learning algorithm, namely partial decision trees, as a method for feature subset selection. In particular, it is shown that an aggressive reduction of the feature space can be achieved with PART_fs while still allowing for comparable classification results with conventional feature selection metrics. The approach is empirically verified by employing two different document representations and four different text classification algorithms that are applied to a document collection consisting of personal e-mail messages. The results show that a reduction of the feature space in the magnitude of ten is achievable without loss of classification accuracy.

References

[1]

D. Aha, D. Kibler, and M. Albert. Instance-Based Learning Algorithms. Machine Learning, 6(1), 1991.

Digital Library

[2]

H. Berger, M. Köhle, and D. Merkl. On the Impact of Document Representation on Classifier Performance in eMail Categorization. In Proc. Int'l Conf. Information Systems Technology and its Applications, 2005.

[3]

H. Berger and D. Merkl. A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics. In Proc. of the 17th Australian Joint Conf. on Artificial Intelligence, 2004.

Digital Library

[4]

W. B. Cavnar and J. M. Trenkle. N-gram-based text categorization. In Proc. Int'l Symp. on Document Analysis and Information Retrieval, 1994.

[5]

W. W. Cohen. Fast effective rule induction. In Proc. of the Int'l Conf. on Machine Learning, pages 115--123. Morgan Kaufmann, 1995.

[6]

E. Crawford, I. Koprinska, and J. Patrick. Phrases and feature selection in e-mail classification. In Proc. 9th Australasian Document Computing Symp., 2004.

[7]

G. Forman. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3:1289--1305, 2003.

Digital Library

[8]

E. Frank and I. H. Witten. Generating accurate rule sets without global optimization. In Proc. of the Int'l Conf. on Machine Learning, pages 144--151. Morgan Kaufmann Publishers Inc., 1998.

Digital Library

[9]

M. A. Hall and G. Holmes. Benchmarking attribute selection techniques for discrete class data mining. IEEE Transactions on Knowledge and Data Engineering, 15(6):1437--1447, 2003.

Digital Library

[10]

G. H. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. In Proc. of the 11th Int'l Conf. on Machine Learning, pages 121--129, 1994.

[11]

A. McCallum and K. Nigam. A Comparison of Event Models for Naive Bayes Text Classification. In Proc. of AAAI-98 Workshop on Learning for Text Categorization, 1998.

[12]

T. Mitchell. Machine Learning. McGraw-Hill, 1997.

Digital Library

[13]

D. Mladenic. Feature subset selection in text-learning. In Proc. of the 10th European Conf. on Machine Learning, pages 95--100, UK, 1998.

Digital Library

[14]

J. Platt. Fast Training of Support Vector Machines using Sequential Minimal Optimization. In Advances in Kernel Methods - Support Vector Learning, pages 185--208. MIT Press, 1999.

Digital Library

[15]

J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., 1993.

Digital Library

[16]

I. H. Witten and E. Frank. Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco, 2000.

Digital Library

[17]

Y. Yang and X. Liu. A re-examination of text categorization methods. In Proc. of the Int'l ACM SIGIR Conf. on R&D in Information Retrieval, 1999.

Digital Library

Cited By

Ozturk Kiyak ETuysuzoglu GBirant D(2023)Partial Decision Tree Forest: A Machine Learning Model for the GeosciencesMinerals10.3390/min1306080013:6(800)Online publication date: 12-Jun-2023
https://doi.org/10.3390/min13060800
Aloqbi AAlshammari MAlatawi AAljaedi AAlharbi A(2022)Access Control Based on Log File for Internet of Things DevicesInternational Journal of Recent Technology and Engineering (IJRTE)10.35940/ijrte.B7094.071122211:2(61-68)Online publication date: 30-Jul-2022
https://doi.org/10.35940/ijrte.B7094.0711222
Dhande DGaikwad DChoudhari C(2022)Evaluation of Emission Characteristics and Performance of Pomegranate Ethanol Blended S. I. Engine using Artificial Neural Network and Rule Learner ClassifierJournal of The Institution of Engineers (India): Series A10.1007/s40030-022-00639-z103:2(453-466)Online publication date: 3-May-2022
https://doi.org/10.1007/s40030-022-00639-z
Show More Cited By

Index Terms

Exploiting partial decision trees for feature subset selection in e-mail categorization
1. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Best terms: an efficient feature-selection algorithm for text categorization

In this paper, we propose a new feature-selection algorithm for text classification, called best terms (BT). The complexity of BT is linear in respect to the number of the training-set documents and is independent from both the vocabulary size and the ...
A comparison of data preparation approaches for e-mail categorisation

This paper reports on experiments in multi-class e-mail categorisation with supervised and unsupervised machine learning techniques. To this end, Support Vector Machines, decision tree learners, instance-based classifiers, Naive Bayes classification ...
Effective hybrid feature subset selection for multilevel datasets using decision tree classifiers

Feature selection is one of the most significant procedures in machine learning algorithms. It is particularly to improve the performance and prediction accuracy for complex data classification. This paper discusses a hybrid feature selection technique ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SAC '06: Proceedings of the 2006 ACM symposium on Applied computing

April 2006

1967 pages

ISBN:1595931082

DOI:10.1145/1141277

Conference Chair:
Hisham M. Haddad
Kennesaw State University, Kennesaw, Georgia

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 April 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SAC06

Sponsor:

SIGAPP

SAC06: The 2006 ACM Symposium on Applied Computing

April 23 - 27, 2006

Dijon, France

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Upcoming Conference

SAC '25

Sponsor:
sigapp

The 40th ACM/SIGAPP Symposium on Applied Computing

March 31 - April 4, 2025

Catania , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
292
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ozturk Kiyak ETuysuzoglu GBirant D(2023)Partial Decision Tree Forest: A Machine Learning Model for the GeosciencesMinerals10.3390/min1306080013:6(800)Online publication date: 12-Jun-2023
https://doi.org/10.3390/min13060800
Aloqbi AAlshammari MAlatawi AAljaedi AAlharbi A(2022)Access Control Based on Log File for Internet of Things DevicesInternational Journal of Recent Technology and Engineering (IJRTE)10.35940/ijrte.B7094.071122211:2(61-68)Online publication date: 30-Jul-2022
https://doi.org/10.35940/ijrte.B7094.0711222
Dhande DGaikwad DChoudhari C(2022)Evaluation of Emission Characteristics and Performance of Pomegranate Ethanol Blended S. I. Engine using Artificial Neural Network and Rule Learner ClassifierJournal of The Institution of Engineers (India): Series A10.1007/s40030-022-00639-z103:2(453-466)Online publication date: 3-May-2022
https://doi.org/10.1007/s40030-022-00639-z
Waghen KOuali M(2021)Multi-level interpretable logic tree analysis: A data-driven approach for hierarchical causality analysisExpert Systems with Applications10.1016/j.eswa.2021.115035178(115035)Online publication date: Sep-2021
https://doi.org/10.1016/j.eswa.2021.115035
Aghaei ESerpen G(2018)Ensemble classifier for misuse detection using N-gram feature vectors through operating system call tracesInternational Journal of Hybrid Intelligent Systems10.3233/HIS-17024714:3(141-154)Online publication date: 26-Mar-2018
https://doi.org/10.3233/HIS-170247
Mishra KRani R(2017)Churn prediction in telecommunication using machine learning2017 International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS)10.1109/ICECDS.2017.8389853(2252-2257)Online publication date: Aug-2017
https://doi.org/10.1109/ICECDS.2017.8389853
Moore PHu BJackson MWan J(2010)‘Intelligent Context' for Personalized Mobile LearningArchitectures for Distributed and Complex M-Learning Systems10.4018/978-1-60566-882-6.ch012(236-270)Online publication date: 2010
https://doi.org/10.4018/978-1-60566-882-6.ch012
Chenoweth MSong M(2009)Text CategorizationEncyclopedia of Data Warehousing and Mining, Second Edition10.4018/978-1-60566-010-3.ch296(1936-1941)Online publication date: 2009
https://doi.org/10.4018/978-1-60566-010-3.ch296

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents