research-article

An optimal approach for text feature selection

Authors:

Wassim El-Hajj,

Hazem HajjAuthors Info & Claims

Volume 74, Issue C

https://doi.org/10.1016/j.csl.2022.101364

Published: 01 July 2022 Publication History

Highlights

•

In this paper, an optimal approach for text feature Selection, we work on text categorization and propose a statistical-based feature selection method (MFX) that considers all documents from the same category as one extended document, and chooses the most discriminative terms that are frequent and common across all documents of the same category, but rarely present in other categories. MFX is language independent and backed up with a mathematical formulation that finds the optimal number of features that guarantees accurate text categorization. Experimental results show the superiority of MFX over the state-of-the-art existing techniques. This work is very significant and timely given its applicability in applications such as spam filtering, opinion mining and topic spotting, among others.

Abstract

Traditionally, feature selection is conducted by first deriving a candidate list of features, then ranking and selecting the top features based on predefined threshold. These methods are highly dependent on the choice of the threshold, and therefore lead to sub-optimal text categorization results. In this paper, we address the selection problem by suggesting a one-step method designed to optimally select the subset of features. The selection is formulated mathematically as an optimization problem with the objective of maximizing classification accuracy while simultaneously deriving and choosing the most discriminative features. Our method, MFX, is applicable to many of the conventional methods, with two distinguishing aspects. First, it is based on considering all documents from the same category as one extended document, instead of analyzing individual documents. Second, it considers choosing the most discriminative terms that are frequent and common across all documents of the same category, and minimally present in other categories. Moreover, MFX is language-independent. It was tested on the well-known benchmark Reuters RCV1 dataset. To showcase its language independence, MFX was also tested on Arabic datasets extracted from Arabic news sources. The results indicated that MFX always performed similar to or better than other well-known feature selection methods. MFX with a Support Vector Machine (SVM) classifier was also shown to outperform recent text classification algorithms based on neural networks and word embeddings.

References

[1]

Adhikari, Ashutosh, et al. "Docbert: Bert for document classification." arXiv preprint arXiv:1904.08398 (2019).

Highlights

Abstract

References

Index Terms

Recommendations

An efficient feature selection using multi-criteria in text categorization for naïve Bayes classifier

A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization

Evolving Feature Selection

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations