skip to main content
10.1145/2661829.2662067acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Concept-based Short Text Classification and Ranking

Published: 03 November 2014 Publication History

Abstract

Most existing approaches for text classification represent texts as vectors of words, namely ``Bag-of-Words.'' This text representation results in a very high dimensionality of feature space and frequently suffers from surface mismatching. Short texts make these issues even more serious, due to their shortness and sparsity. In this paper, we propose using ``Bag-of-Concepts'' in short text representation, aiming to avoid the surface mismatching and handle the synonym and polysemy problem. Based on ``Bag-of-Concepts,'' a novel framework is proposed for lightweight short text classification applications. By leveraging a large taxonomy knowledgebase, it learns a concept model for each category, and conceptualizes a short text to a set of relevant concepts. A concept-based similarity mechanism is presented to classify the given short text to the most similar category. One advantage of this mechanism is that it facilitates short text ranking after classification, which is needed in many applications, such as query or ad recommendation. We demonstrate the usage of our proposed framework through a real online application: Channel-based Query Recommendation. Experiments show that our framework can map queries to channels with a high degree of precision (avg. precision=90.3%), which is critical for recommendation applications.

References

[1]
C. C. Aggarwal and C. Zhai. Mining text data. Springer, 2012.
[2]
A. Anagnostopoulos, L. Becchetti, C. Castillo, and A. Gionis. An optimization framework for query recommendation. In WSDM, pages 161--170. ACM, 2010.
[3]
R. Baeza-Yates, C. Hurtado, and M. Mendoza. Query recommendation using query logs in search engines. In EDBT, pages 588--596. Springer, 2005.
[4]
S. M. Beitzel, E. C. Jensen, O. Frieder, D. D. Lewis, A. Chowdhury, and A. Kolcz. Improving automatic query classification via semi-supervised learning. In ICDM, 2005.
[5]
I. Bordino, G. De Francisci Morales, I. Weber, and F. Bonchi. From machu_picchu to rafting the urubamba river: anticipating information needs via the entity-query graph. In WSDM, pages 275--284. ACM, 2013.
[6]
C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. TIST, 2(3):27, 2011.
[7]
M. Chen, X. Jin, and D. Shen. Short text classification improved by learning multi-granularity topics. In IJCAI, pages 1776--1781. AAAI Press, 2011.
[8]
C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273--297, 1995.
[9]
N. Craswell and M. Szummer. Random walks on the click graph. In SIGIR, pages 239--246. ACM, 2007.
[10]
W. B. Croft, M. Bendersky, H. Li, and G. Xu. Query representation and understanding workshop. In SIGIR Forum, volume 44, pages 48--53, 2010.
[11]
H. K. Dai, L. Zhao, Z. Nie, J.-R. Wen, L. Wang, and Y. Li. Detecting online commercial intention (oci). In WWW, 2006.
[12]
V. Dang and W. B. Croft. Diversity by proportionality: an election-based approach to search result diversification. In SIGIR, pages 65--74. ACM, 2012.
[13]
H. Feild and J. Allan. Task-aware query recommendation. In SIGIR, pages 83--92. ACM, 2013.
[14]
E. Gabrilovich and S. Markovitch. Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In AAAI, 2006.
[15]
E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI, volume 7, pages 1606--1611, 2007.
[16]
J. He, V. Hollink, and A. de Vries. Combining implicit and explicit topic representations for result diversification. In SIGIR, pages 851--860. ACM, 2012.
[17]
X. Hu, N. Sun, C. Zhang, and T.-S. Chua. Exploiting internal and external semantics for the clustering of short texts using world knowledge. In CIKM, pages 919--928. ACM, 2009.
[18]
L. Huang. Concept-based text clustering. PhD thesis, The University of Waikato, 2011.
[19]
A. Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. NIPS, 14:841, 2002.
[20]
Y.-H. Kim, S.-Y. Hahn, and B.-T. Zhang. Text filtering by boosting naive bayes classifiers. In SIGIR, 2000.
[21]
T. Lee, Z. Wang, H. Wang, and S.-w. Hwang. Attribute extraction and scoring: A probabilistic approach. In ICDE, pages 194--205. IEEE, 2013.
[22]
P. Li, H. Wang, K. Q. Zhu, Z. Wang, and X. Wu. Computing term similarity by large probabilistic isa knowledge. In CIKM, pages 1401--1410. ACM, 2013.
[23]
R. Li, B. Kao, B. Bi, R. Cheng, and E. Lo. Dqr: a probabilistic approach to diversified query recommendation. In CIKM, pages 16--25. ACM, 2012.
[24]
X. Li, Y.-Y. Wang, and A. Acero. Learning query intent from regularized click graphs. In SIGIR, 2008.
[25]
Y. Li, D. McLean, Z. A. Bandar, J. D. O'shea, and K. Crockett. Sentence similarity based on semantic nets and corpus statistics. TKDE, 18(8):1138--1150, 2006.
[26]
H. T. Ng, W. B. Goh, and K. L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. In SIGIR. ACM, 1997.
[27]
X.-H. Phan, L.-M. Nguyen, and S. Horiguchi. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In WWW, 2008.
[28]
J. R. Quinlan. Induction of decision trees. Machine learning, pages 81--106, 1986.
[29]
M. Sahlgren and R. Cöster. Using bag-of-concepts to improve the performance of support vector machines in text categorization. In COLING, page 487. ACL, 2004.
[30]
G. Salton, A. Wong, and C.-S. Yang. A vector space model for automatic indexing. Communications of the ACM, 1975.
[31]
D. Shen, R. Pan, J.-T. Sun, J. J. Pan, K. Wu, J. Yin, and Q. Yang. Q2c@ust: our winning solution to query classification in kddcup 2005. SIGKDD, 7(2):100--110, 2005.
[32]
D. Shen, R. Pan, J.-T. Sun, J. J. Pan, K. Wu, J. Yin, and Q. Yang. Query enrichment for web-query classification. TOIS, 24(3):320--352, 2006.
[33]
D. Shen, J.-T. Sun, Q. Yang, and Z. Chen. Building bridges for web query classification. In SIGIR, 2006.
[34]
F. Song and W. B. Croft. A general language model for information retrieval. In CIKM, pages 316--321. ACM, 1999.
[35]
Y. Song, H. Wang, Z. Wang, H. Li, and W. Chen. Short text conceptualization using a probabilistic knowledgebase. In IJCAI, pages 2330--2336. AAAI Press, 2011.
[36]
F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In WWW, pages 697--706. ACM, 2007.
[37]
A. Sun. Short text classification using very few words. In SIGIR, pages 1145--1146. ACM, 2012.
[38]
I. Szpektor, A. Gionis, and Y. Maarek. Improving recommendation for long-tail queries via templates. In WWW, pages 47--56. ACM, 2011.
[39]
Z. Wang, H. Wang, and Z. Hu. Head, modifier, and constraint detection in short texts. In ICDE, pages 280--291, 2014.
[40]
W. Wu, H. Li, H. Wang, and K. Q. Zhu. Probase: A probabilistic taxonomy for text understanding. In SIGMOD, pages 481--492. ACM, 2012.
[41]
E. Yeh, D. Ramage, C. D. Manning, E. Agirre, and A. Soroa. Wikiwalk: random walks on wikipedia for semantic relatedness. In ACL Workshop, pages 41--49. ACL, 2009.
[42]
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In SIGIR, pages 334--342. ACM, 2001.
[43]
Z. Zhang and O. Nasraoui. Mining search engine query logs for query recommendation. In WWW, pages 1039--1040, 2006.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management
November 2014
2152 pages
ISBN:9781450325981
DOI:10.1145/2661829
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 November 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. msn channel
  2. query recommendation
  3. short text classification
  4. taxonomy knowledge

Qualifiers

  • Research-article

Funding Sources

Conference

CIKM '14
Sponsor:

Acceptance Rates

CIKM '14 Paper Acceptance Rate 175 of 838 submissions, 21%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)36
  • Downloads (Last 6 weeks)6
Reflects downloads up to 15 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media