skip to main content
research-article

An optimal approach for text feature selection

Published: 01 July 2022 Publication History

Highlights

In this paper, an optimal approach for text feature Selection, we work on text categorization and propose a statistical-based feature selection method (MFX) that considers all documents from the same category as one extended document, and chooses the most discriminative terms that are frequent and common across all documents of the same category, but rarely present in other categories. MFX is language independent and backed up with a mathematical formulation that finds the optimal number of features that guarantees accurate text categorization. Experimental results show the superiority of MFX over the state-of-the-art existing techniques. This work is very significant and timely given its applicability in applications such as spam filtering, opinion mining and topic spotting, among others.

Abstract

Traditionally, feature selection is conducted by first deriving a candidate list of features, then ranking and selecting the top features based on predefined threshold. These methods are highly dependent on the choice of the threshold, and therefore lead to sub-optimal text categorization results. In this paper, we address the selection problem by suggesting a one-step method designed to optimally select the subset of features. The selection is formulated mathematically as an optimization problem with the objective of maximizing classification accuracy while simultaneously deriving and choosing the most discriminative features. Our method, MFX, is applicable to many of the conventional methods, with two distinguishing aspects. First, it is based on considering all documents from the same category as one extended document, instead of analyzing individual documents. Second, it considers choosing the most discriminative terms that are frequent and common across all documents of the same category, and minimally present in other categories. Moreover, MFX is language-independent. It was tested on the well-known benchmark Reuters RCV1 dataset. To showcase its language independence, MFX was also tested on Arabic datasets extracted from Arabic news sources. The results indicated that MFX always performed similar to or better than other well-known feature selection methods. MFX with a Support Vector Machine (SVM) classifier was also shown to outperform recent text classification algorithms based on neural networks and word embeddings.

References

[1]
Adhikari, Ashutosh, et al. "Docbert: Bert for document classification." arXiv preprint arXiv:1904.08398 (2019).
[2]
Mehdi Hosseinzadeh Aghdam, Nasser Ghasem-Aghaee, Mohammad Ehsan Basiri, Text feature selection using ant colony optimization, Expert Syst. Appl. 36 (3) (2009) 6843–6853.
[3]
Deepak Agnihotri, Kesari Verma, Priyanka Tripathi, Variable global feature selection scheme for automatic classification of text documents, Expert Syst. Appl. 81 (2017) 268–281.
[4]
ALjazeera, [online] Available: http:www.ALJazeera.com 2007.
[5]
H. Al-Mubaid, S.A. Umair, A new text categorization technique using distributional clustering and learning logic, IEEE Trans. Knowl. Data Eng. 18 (2006) 1156–1165.
[6]
R. Al-Shalabi and M. Evens, "A computational morphology system for Arabic," in Proceedings of the Workshop on Computational Approaches to Semitic Languages, pp. 66–72, 1998.
[7]
R. Al-Shalabi and M. Evens, "A computational morphology system for Arabic," in Proceedings of the Workshop on Computational Approaches to Semitic Languages, 1998, pp. 66–72.
[9]
N. Azam, J. Yao, Comparison of term frequency and document frequency based feature selection metrics in text categorization, Expert Syst. Appl. 39 (5) (2012) 4760–4768.
[10]
Gavin. Brown, A new perspective for information theoretic feature selection, Artificial Intelligence and Statistics, PMLR, 2009.
[11]
G. Caruana, M. Li, A survey of emerging approaches to spam filtering, ACM Comput. Surveys (CSUR) 44 (2012) 9.
[12]
W.B. Cavnar, J.M. Trenkle, N-gram-based text categorization, in: Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR), 48113, 1994, pp. 161–175.
[13]
C.-.M. Chen, H.-.M. Lee, Y.-.J. Chang, Two novel feature selection approaches for web page classification, Expert Syst. Appl. 36 (1) (2009) 260–272.
[14]
CNN, [online] Available: http:www.arabic.cnn.com 2007.
[15]
Sanmay. Das, Filters, wrappers and a boosting-based hybrid for feature selection, Icml. 1 (2001).
[16]
F. Debole, F. Sebastiani, Supervised term weighting for automated text categorization, Stud. Fuzz. Soft Comput. 138 (2004) 81–98.
[17]
Xuelian Deng, et al., Feature selection for text classification: a review, Multimed. Tools Appl. 78 (3) (2019) 3797–3816.
[18]
Devlin, Jacob, et al. "Bert: pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
[19]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
[20]
R. Duwairi, M. Al-Refai and N. Khasawneh, "Stemming versus light stemming as feature selection techniques for Arabic text categorization," 4th International Conference on Innovations in Information Technology (IIT), pp. 446–450, 2007.
[21]
Jennifer G. Dy, Carla E. Brodley, Feature selection for unsupervised learning, J. Mach. Learn. Res. 5 (2004) 845–889. Aug.
[22]
G. Forman, An extensive empirical study of feature selection metrics for text classification, J. Mach. Learn. Res. 3 (2003) 1289–1305.
[23]
D. Fragoudis, D. Meretakis, S. Likothanassis, Best terms: an efficient feature-selection algorithm for text categorization, Knowl. Inf. Syst. 8 (2005) 16–33.
[24]
A. Ghareb, A. Bakar, A. Hamdan, ―Hybrid Feature Selection Based On Enhanced Genetic Algorithm For Text categorization,‖ Expert Systems with Applications, Elsevier, 2015.
[25]
F. Harrag, E. El-Qawasmeh and P. Pichappan, "Improving Arabic text categorization using decision trees," First International Conference On Networked Digital Technologies(NDT), pp. 110–115, 2009.
[26]
Sung-Sam Hong, Wanhee Lee, Myung-Mook Han, The feature selection method based on genetic algorithm for efficient of text clustering and text classification, Int. J. Adv. Soft Comp. Appl. 7 (1) (2015) 2074–8523.
[27]
Ionescu, Radu Tudor, and Andrei M. Butnaru. "Vector of locally-aggregated word embeddings (VLAWE): a novel document-level representation." arXiv preprint arXiv:1902.08850 (2019).
[28]
Alan Julian Izenman, Linear discriminant analysis, Modern Multivariate Statistical Techniques, Springer, New York, NY, 2013, pp. 237–280.
[29]
L. Jiang, H. Zhang, Z. Cai, A novel Bayes model: hidden naive Bayes, IEEE Trans. Knowl. Data Eng. 21 (2009) 1361–1371.
[30]
Khoja. S., Garside. R., “Stemming Arabic text,” [online] Available: http://www.comp.lancs.ac.uk/computing/users/khoja/stemmer.pc, 1999.
[31]
D. Koller and M. Sahami, "Toward optimal feature selection," Proceeding of the 13th International Conference of Machine Learning, 1996.
[32]
Gang Kou, et al., Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods, Appl. Soft Comput. 86 (2020).
[33]
D. Lewis, Y. Yang, T. Rose, and F. Li, RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res., 2004.
[35]
Y. Liao, X. Pan, A novel feature selection approach and feature weight adjustment technique in text classification, 7th ACIS International Conference On Software Engineering Research, Management and Applications, SERA, 2009, pp. 41–44.
[36]
Huan Liu, Lei Yu, Toward integrating feature selection algorithms for classification and clustering, IEEE Trans. Knowl. Data Eng. 17 (4) (2005) 491–502.
[37]
Meixiang Luo and Linkai Luo, "Feature selection for text classification using OR+SVM-RFE," in Control and Decision Conference (CCDC), Chinese, 2010, pp. 1648–1652.
[38]
S.S. Mengle, N. Goharian, Ambiguity measure feature-selection algorithm, J. Am. Soc. Inf. Sci. Technol. 60 (2009) 1037–1050.
[39]
A. Mesleh and G. Kanaan, "Support vector machine text classification system: using ant colony optimization based feature subset selection," International Conference on Computer Engineering and Systems (ICCES), pp. 143–148, 2008.
[40]
P. Mitra, C. Murthy, S.K. Pal, Unsupervised feature selection using feature similarity, IEEE Trans. Pattern Anal. Mach. Intell 24 (2002) 301–312.
[41]
A. Moh'd, A. Mesleh, Chi square feature extraction based SVMs Arabic language text categorization system, J. Comput. Sci. 3 (2007) 430–435.
[42]
H.M. Noaman, S. Elmougy, A. Ghoneim and T. Hamza, "Naive Bayes classifier based Arabic document categorization," the 7th International Conference On Informatics and Systems (INFOS), pp. 1–5, 2010.
[43]
J. Novovicova and A. Malik, "Information-theoretic feature selection algorithms for text classification," in Proceedings IEEE International Joint Conference on Neural Networks (IJCNN), pp. 3272–3277,2005.
[44]
E. Osuna, R. Freund and F. Girosi, "An improved training algorithm for support vector machines," Proceedings of the IEEE Workshop in Neural Networks For Signal Processing, pp. 276–285, 1997.
[45]
Ankit Pal, Muru Selvakumar, Malaikannan Sankarasubbu, MAGNET: multi-label text classification using attention-based graph neural network, ICAART (2) (2020).
[46]
H. Park, S. Kwon and H. Kwon, "Complete Gini-index text (GIT) feature-selection algorithm for text classification," 2nd International Conference on Software Engineering and Data Mining (SEDM), pp. 366–371, 2010.
[47]
Pavel Pudil, Jana Novovičová, Josef Kittler, Floating search methods in feature selection, Pattern Recognit. Lett. 15 (11) (1994) 1119–1125.
[48]
S. Qu, S. Wang and Y. Zou, "Improvement of text feature selection method based on TFIDF," International Seminar on Future Information Technology and Management Engineering, FITME'08, pp. 79–81, 2008.
[49]
M. Saad, W. Ashour, OSAC: open source Arabic corpus, 6th ArchEng International Symposiums, EEECS’10 the 6th International Symposium On Electrical and Electronics Engineering and Computer Science, European University of Lefke, Cyprus, 2010.
[50]
G. Salton, C. Buckley, Term-weighting approaches in automatic text retrieval, Inf. Process. Manag. 24 (1988) 513–523.
[51]
W. Shang, H. Huang, H. Zhu, Y. Lin, Y. Qu, Z. Wang, A novel feature selection algorithm for text categorization, Expert Syst. Appl. 33 (2007) 1–5.
[52]
Shlens, Jonathon. "A tutorial on principal component analysis." arXiv preprint arXiv:1404.1100 (2014).
[53]
H. Uguz, A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm, Knowl.-Based Syst 24 (2011) 1024–1032.
[54]
D. Wang, H. Zhang, R. Liu et al., “Feature selection based on term frequency and T-test for text categorization,” in Proceedings of the 21st ACM International Conference On Information and Knowledge Management, pp. 1482–1486, Maui, HI, USA, October 2012.
[55]
X. Wang, J. Cao, Y. Liu, et al., Text clustering based on the improved TFIDF by the iterative algorithm, Proceedings of 2012 IEEE Symposium on Electrical & Electronics Engineering (EEESYM), IEEE, Kuala Lumpur, Malaysia, 2012, pp. 140–143. June.
[56]
Watandataset, [online] Available: http://www.watan.com 2007.
[57]
Daniela M. Witten, Robert Tibshirani, A framework for feature selection in clustering, J. Am. Stat. Assoc. 105 (490) (2010) 713–726.
[58]
Eric P. Xing, Michael I. Jordan, Richard M. Karp, Feature selection for high-dimensional genomic microarray data, Icml 1 (2001).
[59]
Y. Xu, L. Chen, Term-frequency based feature selection methods for text categorization, Proceedings of the 2010 4th International Conference On Genetic and Evolutionary Computing (ICGEC), IEEE, Shenzhen, China, 2010, pp. 280–283. December.
[60]
J. Yan, N. Liu, B. Zhang, S. Yan, Z. Chen, Q. Cheng, W. Fan and W. Ma, "OCFS: optimal orthogonal centroid feature selection for text categorization," in Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2005, pp. 122–129.
[61]
Jieming Yang, et al., A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization, Inf. Process. Manag. 48 (4) (2012) 741–754.
[62]
Yang, Y.Liu, X. Zhu, Z. Liu and X. Zhang, "A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization," Inf. Process. Manag., 2012.
[63]
Y. Yang and J.O. Pedersen, "A comparative study on feature selection in text categorization," Conference in Machine Learning-International, pp. 412–420, 1997.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Computer Speech and Language
Computer Speech and Language  Volume 74, Issue C
Jul 2022
227 pages

Publisher

Academic Press Ltd.

United Kingdom

Publication History

Published: 01 July 2022

Author Tags

  1. Feature selection
  2. Text categorization
  3. Text mining
  4. Data mining
  5. Arabic text mining

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media