skip to main content
10.1145/1148170.1148216acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Tackling concept drift by temporal inductive transfer

Published: 06 August 2006 Publication History

Abstract

Machine learning is the mainstay for text classification. However, even the most successful techniques are defeated by many real-world applications that have a strong time-varying component. To advance research on this challenging but important problem, we promote a natural, experimental framework-the Daily Classification Task-which can be applied to large time-based datasets, such as Reuters RCV1.In this paper we dissect concept drift into three main subtypes. We demonstrate via a novel visualization that the recurrent themes subtype is present in RCV1. This understanding led us to develop a new learning model that transfers induced knowledge through time to benefit future classifier learning tasks. The method avoids two main problems with existing work in inductive transfer: scalability and the risk of negative transfer. In empirical tests, it consistently showed more than 10 points F-measure improvement for each of four Reuters categories tested.

References

[1]
Baker, L. D. and McCallum, A. K. Distributional clustering of words for text classification. In Proc. of the 21st Annual Intl. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR, Melbourne), 1998.]]
[2]
Fawcett, T. ROC Graphs: Notes and Practical Considerations for Data Mining Researchers. Hewlett-Packard Labs, Tech Report HPL-2003-4, 2003. See http://www.hpl.hp.com/techreports/2003]]
[3]
Fawcett, T. and Flach, P. A response to Webb and Ting's 'On the application of ROC analysis to predict classification performance under varying class distributions.' Machine Learning, 58(1):33--38, 2005.]]
[4]
Forman, G. BNS Scaling: A Complement to Feature Selection for SVM Text Classification. Hewlett-Packard Labs technical report, HPL-2006-19, 2006.]]
[5]
Forman, G. Quantifying Trends Accurately Despite Classifier Error and Class Imbalance. Submitted to the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD, Philadelphia), 2006.]]
[6]
Forman, G. Counting Positives Accurately Despite Inaccurate Classification. In Proc. of the European Conf. on Machine learning (ECML, Porto):564--575, 2005.]]
[7]
Forman, G. An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research, Special Issue on Variable and Feature Selection, 3(Mar):1289--1305, 2003.]]
[8]
Gabrilovich, E., and Markovitch, S. Feature Generation for Text Categorization Using World Knowledge. In Proc. of the 19th Intl. Joint Conference for Artificial Intelligence (IJCAI, Edinburgh), 2005.]]
[9]
Han, E. and Karypis, G. Centroid-Based Document Classification: Analysis & Experimental Results. In Proc. of the 4th European Conf. on the Principles of Data Mining and Knowledge Discovery (PKDD): 424--431, 2000.]]
[10]
Hulten, G., Spencer, L., and Domingos, P. Mining time-changing data streams. In Proc. of the 7th Int'l. Conf. on Knowledge Discovery and Data Mining (KDD, San Francisco):97--106, 2001.]]
[11]
Joachims, T. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In Proc. of the 10th European Conf. on Machine Learning (ECML, Berlin):137--142, 1998.]]
[12]
Karypis, G. and Han, E. Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval. In Proc. of the 9th Intl. Conf. on Information and Knowledge Management (CIKM, Virginia):12--19. 2000.]]
[13]
Klinkenberg, R. Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis, Special Issue on Incremental Learning Systems Capable of Dealing with Concept Drift, 8(3):281--300, 2004.]]
[14]
Lewis, D., Yang, Y., Rose, T., and Li, F. RCV1: A New Benchmark Collection for Text Categorization Research. J. of Machine Learning Research, 5(Apr):361--397, 2004.]]
[15]
National Institute of Standards and Technology (NIST) Reuters Distribution, http://trec.nist.gov/data/reuters Also: http://about.reuters.com/researchandstandards/corpus]]
[16]
Scholz, M. and Klinkenberg, R. An Ensemble Classifier for Drifting Concepts. In Proc. of the 2nd Int'l. Workshop on Knowledge Discovery in Data Streams, (ECML,Porto):53--64, 2005.]]
[17]
Silver, D., Bakir, G., Bennett, K., Caruana, R., Pontil, M., Russell, S., Tadepalli, P., organizers. Workshop on Inductive Transfer: 10 Years Later. 19th Conf. on Neural Information Processing Systems (NIPS), Dec. 9, 2005.]]
[18]
Widmer, G., Kubat, M. Learning in the Presence of Concept Drift and Hidden Contexts. Machine Learning, 23(1):69--101, 1996.]]
[19]
Witten, I. and Frank, E., Data mining: Practical machine learning tools and techniques (2nd edition), Morgan Kaufmann, San Francisco, CA, 2005.]]

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
August 2006
768 pages
ISBN:1595933697
DOI:10.1145/1148170
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 August 2006

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. concept drift
  2. inductive transfer
  3. machine learning
  4. support vector machine
  5. text classification
  6. time series
  7. topic identification

Qualifiers

  • Article

Conference

SIGIR06
Sponsor:
SIGIR06: The 29th Annual International SIGIR Conference
August 6 - 11, 2006
Washington, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media