skip to main content
10.1145/3056662.3056671acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicscaConference Proceedingsconference-collections
research-article

Improved information gain feature selection method for Chinese text classification based on word embedding

Published: 26 February 2017 Publication History

Abstract

Feature selection is a very important part of text categorization, which can reduce the dimensionality of the text representation by vector space model (VSM), and can avoid the problem of "curse of dimensionality". Information gain (IG) feature selection algorithm is one of the most effective feature selection algorithms, but it is easy to filter out the characteristic words which have a low IG score but have a strong ability of text type identification. Meanwhile, these words are often very similar to the words of high IG score. Aiming at this defect, we propose an improved feature selection method which uses word embedding to calculate the most similar words to the current dictionary selected by IG algorithm and expand the dictionary with these words under certain regulations. Finally we achieve good experimental results in Sogou Chinese text classification corpus and Fudan Chinese text classification corpus.

References

[1]
Ren, Y. G. 2012. Information-gain-based text feature selection method. Computer Science.
[2]
Yang, Y., & Pedersen, J. O. (1998). A Comparative Study on Feature Selection in Text Categorization. Fourteenth International Conference on Machine Learning (Vol.4, pp.412--420). Morgan Kaufmann Publishers Inc.
[3]
Uğuz, H. 2011. A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. 24(7), 1024--1032.
[4]
Javed, K., Maruf, S. and Babri, H. A. 2015. A two-stage markov blanket based feature selection algorithm for text classification. Neurocomputing, 157, 91--104.
[5]
Wang, G.Y. 2012. Cloud Model and Granular Computing.Science Press.
[6]
Li, H. R. 2012. Research on Term Weighting Approach Based on Information Gain and Entropy. CHONGQING UNIVERSITY, 46--47
[7]
Mikolov, T., Chen, K., Corrado, G. and Dean, J. 2013. Efficient estimation of word representations in vector space. Computer Science. arXiv preprint arXiv:1301.3781,2013. DOI= http://arXiv:1301.3781
[8]
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111--3119). DOI= http://arXiv:1310.4546
[9]
http://www.sogou.com/labs/dl/c.html.
[10]
http://www.nlpir.org/download/tc-corpus-answer.rar
[11]
Forman, G. 2003. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3(2), 1289--1305.
[12]
Mikolov, T., Yih, W. T. and Zweig, G. 2013. Linguistic regularities in continuous space word representations. In HLT-NAACL.
[13]
http://finance.sina.com.cn/stock/roll/2016-05-08/doc-ifxryhhh1757728.shtml.
[14]
http://ictclas.nlpir.org/nlpir/.
[15]
Hao, X. and Zhang, L. 2003. A Feature selection Algorithm based on class distribution of word for text classification. symposium of Search Engine and Web Mining. Peking University.
[16]
Kusner M. J., Sun, Y., Kolkin, N. I. and Weinberger, K. Q. 2015. From word embeddings to document distances. 957--966.
[17]
Sun, M.S., Chen, X. X., Zhang, K. X., Guo, Z. P. and Liu, Z. Y. 2016. THULAC: An Efficient Lexical Analyzer for Chinese.
[18]
https://code.google.com/p/word2vec/.
[19]
http://radimrehurek.com/gensim/index.html

Cited By

View all

Index Terms

  1. Improved information gain feature selection method for Chinese text classification based on word embedding

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICSCA '17: Proceedings of the 6th International Conference on Software and Computer Applications
    February 2017
    339 pages
    ISBN:9781450348577
    DOI:10.1145/3056662
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 February 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. IG
    2. text classification
    3. word2vec model

    Qualifiers

    • Research-article

    Conference

    ICSCA 2017

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 07 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A feature selection model for document classification using Tom and Jerry Optimization algorithmMultimedia Tools and Applications10.1007/s11042-023-15828-683:4(10273-10295)Online publication date: 1-Jan-2024
    • (2021)A brain-inspired information processing algorithm and its application in text classificationExpert Systems with Applications: An International Journal10.1016/j.eswa.2021.114828177:COnline publication date: 1-Sep-2021
    • (2021)Web service classification based on information gain theory and bidirectional long short‐term memory with attention mechanismConcurrency and Computation: Practice and Experience10.1002/cpe.620233:13Online publication date: 18-Mar-2021
    • (2020)Targeted Metabolomics as a Tool in Discriminating Endocrine From Primary HypertensionThe Journal of Clinical Endocrinology & Metabolism10.1210/clinem/dgaa954Online publication date: 31-Dec-2020
    • (2019)Web Service Discovery Based on Information Gain Theory and BiLSTM with Attention MechanismMethionine Dependence of Cancer and Aging10.1007/978-3-030-12981-1_45(643-658)Online publication date: 7-Feb-2019
    • (2018)Inter-Category Distribution Enhanced Feature Extraction for Efficient Text ClassificationBig Data – BigData 201810.1007/978-3-319-94301-5_2(17-25)Online publication date: 21-Jun-2018

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media