skip to main content
10.1145/1645953.1646291acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
poster

Improving binary classification on text problems using differential word features

Published: 02 November 2009 Publication History

Abstract

We describe an efficient technique to weigh word-based features in binary classification tasks and show that it significantly improves classification accuracy on a range of problems. The most common text classification approach uses a document's ngrams (words and short phrases) as its features and assigns feature values equal to their frequency or TFIDF score relative to the training corpus. Our approach uses values computed as the product of an ngram's document frequency and the difference of its inverse document frequencies in the positive and negative training sets. While this technique is remarkably easy to implement, it gives a statistically significant improvement over the standard bag-of-words approaches using support vector machines on a range of classification tasks. Our results show that our technique is robust and broadly applicable. We provide an analysis of why the approach works and how it can generalize to other domains and problems.

References

[1]
X. Ding, B. Liu, and P. Yu. A holistic lexicon-based approach to opinion mining. In Proc. of the Int. Conf. on Web Search and Web Data Mining, pp. 231--240. ACM, New York, NY, 2008.
[2]
T. Joachims. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Springer, 1997.
[3]
T. Joachims. Making large-scale support vector machine learning practical. MIT Press Cambridge, MA, 1999.
[4]
E. Leopold and J. Kindermann. Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? Machine Learning, 46(1):423--444, 2002.
[5]
J. Martineau and T. Finin. Delta TFIDF: An Improved Feature Space for Sentiment Analysis. In Proc. of the Third AAAI Int. Conf. on Weblogs and Social Media. AAAI Press, May 2009.
[6]
B. Pang and L. Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. Proc. of the 42nd Annual Meeting of the Association for Computational Linguistics, pp. 271--278, July 2004.
[7]
B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up? sentiment classification using machine learning techniques. In Proc. of the Conf. on Empirical Methods in Natural Language Processing, pp. 79--86, July 2002.
[8]
M. Thomas, B. Pang, and L. Lee. Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. In Proc. of the Conf. on Empirical Methods in Natural Language Processing, pp. 327--335, July 2006.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management
November 2009
2162 pages
ISBN:9781605585123
DOI:10.1145/1645953
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 November 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. sentiment analysis
  2. support vector machine
  3. svm
  4. text classification

Qualifiers

  • Poster

Conference

CIKM '09
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)24
  • Downloads (Last 6 weeks)2
Reflects downloads up to 14 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media