skip to main content
10.1145/2632188.2632205acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Short text categorization exploiting contextual enrichment and external knowledge

Published: 11 July 2014 Publication History

Abstract

We address the problem of the categorization of short texts, like those posted by users on social networks and microblogging platforms. We specifically focus on Twitter. Since short texts do not provide sufficient word occurrences, and they often contain abbreviations and acronyms, traditional classification methods such as "Bag-of-Words" have limitations. Our proposed method enriches the original text with a new set of words, to add more semantic value by using information extracted from webpages of the same temporal context. Then we use those words to query Wikipedia, as an external knowledge base, with the final goal to categorize the original text using a predefined set of Wikipedia categories. We also present a first experimental evaluation that confirms the effectiveness of the algorithm design and implementation choices, highlighting some critical issues with short texts.

References

[1]
S. Banerjee, K. Ramanathan, and A. Gupta. Clustering short texts using wikipedia. SIGIR '07, July 2007.
[2]
M. Cataldi, L. di Caro, and C. Schifanella. Emerging topic detection on twitter based on temporal and social terms evaluation. MDMKDD '10, July 2010.
[3]
X. Hu, X. Zhang, C. Lu, E. K. Park, and X. Zhou. Exploiting wikipedia as external knowledge for document clustering. KDD '09, July 2009.
[4]
C. Li, A. Sun, and A. Datta. TSDW: Two-stage word sense disambiguation using wikipedia. JASIST, 64:1203--1223, June 2013.
[5]
W. Meng, L. Lanfen, W. Jing, Y. Penghua, L. Jiaolong, and X. Fei. Improving short text classification using public search engines. IUKM '13, 8032:157--166, July 2013.
[6]
M. Sahami and T. D. Heilman. A web-based kernel function for measuring the similarity of short texts snippets. WWW '06, May 2006.
[7]
R. statistics. Post-hoc analysis for Friedman's test (R Code). http://www.r-statistics.com/2010/02/ post-hoc-analysis-for-friedmans-test-r-code/, 2010. {Online; visited Apr-2014}.
[8]
J. Tang, X. Wang, H. Gao, X. Hu, and H. Lui. Enriching short text representation in microblog for clustering. Frontiers of Computer Science, 6:88--101, February 2012.
[9]
TREC. Official website. http://trec.nist.gov, 2014. {Online; visited Apr-2014}.
[10]
G. Xu, Z. Wu, G. Li, and E. Chen. Improving contextual advertising matching by using wikipedia thesaurus knowledge. Knowledge and Information Systems, April 2014.
[11]
M. Yazdani and A. Popescu-Belis. Computing text semantic relatedness using the contents and links of a hypertext encyclopedia. Artificial Intelligence, 194:176--202, January 2013.

Cited By

View all

Index Terms

  1. Short text categorization exploiting contextual enrichment and external knowledge

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SoMeRA '14: Proceedings of the first international workshop on Social media retrieval and analysis
    July 2014
    72 pages
    ISBN:9781450330220
    DOI:10.1145/2632188
    • Program Chairs:
    • Markus Schedl,
    • Peter Knees,
    • Jialie Shen
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 July 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. context-aware retrieval
    2. enrichment
    3. evaluation
    4. wikipedia

    Qualifiers

    • Research-article

    Conference

    SIGIR '14
    Sponsor:

    Acceptance Rates

    SoMeRA '14 Paper Acceptance Rate 13 of 19 submissions, 68%;
    Overall Acceptance Rate 13 of 19 submissions, 68%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 13 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media