skip to main content
10.1145/2169095.2169098acmotherconferencesArticle/Chapter ViewAbstractPublication PagestempwebConference Proceedingsconference-collections
research-article

Noise robust detection of the emergence and spread of topics on the web

Published: 17 April 2012 Publication History

Abstract

As the same information appears on many Web pages, we often want to know which page is the first one that discussed it, or how the information has spread on the Web as time passes. In this paper, we develop two methods: a method of detecting the first page that discussed the given information, and a method of generating a graph showing how the number of pages discussing it has changed along the timeline. To extract such information, we need to determine which pages discuss the given topic, and also need to determine when these pages were created. For the former step, we design a metric for estimating inclusion degree between information and a page. For the latter step, we develop a technique of extracting creation timestamps on web pages. Although timestamp extraction is a crucial component in temporal Web analysis, no research has shown how to do it in detail. Both steps are, however, still error-prone. In order to improve noise elimination, we examine not only the properties of each page, but also temporal relationship between pages. If temporal relationship between some candidate page and other pages are unlikely in typical patterns of information spread on the Web, we eliminate the candidate page as a noise. Results of our experiments show that our methods achieve high precision and can be used for practical use.

References

[1]
J. Allan et al. Topic detection and tracking pilot study final report. In DARPA Broadcast News Transcription and Understanding Workshop, p.194--218, 1998.
[2]
J. Allan, C. Wade, A. Bolivar. Retrieval and novelty detection at the sentence level. In SIGIR, p.314--321, 2003.
[3]
E. Amitay et al. Trend detection through temporal link analysis. JASIST, 55(14):1270--1281, 2004.
[4]
N. Balasubramanian, J. Allan, W. B. Croft. A comparison of sentence retrieval techniques. In SIGIR, p.813--814, 2007.
[5]
M. Bendersky, W. B. Croft. Finding text reuse on the Web. In WSDM, p.262--271, 2009.
[6]
Y. Bernstein, J. Zobel. A scalable system for identifying co-derivative documents. In SPIRE, p.55--67, 2004.
[7]
Internet archive. http://web.archive.org/.
[8]
A. Jatowt, Y. Kawai, K. Tanaka. Detecting age of page content. In WIDM, p.137--144, 2007.
[9]
X. Jin, S. Spangler, R. Ma, J. Han. Topic initiator detection on the World Wide Web. In WWW, p.481--490, 2010.
[10]
D. Metzler et al. Similarity measures for tracking information flow. In CIKM, p.517--524, 2005.
[11]
D. Metzler, W. B. Croft. A Markov random field model for term dependencies. In SIGIR, p.472--479, 2005.
[12]
S. Nunes, C. Ribeiro, G. David. Using neighbors to date web documents. In WIDM, p.129--136, 2007.
[13]
M. Oita, P. Senellart. Deriving dynamics of Web pages: A rurvey. In TWAW, p.25--32, 2011.
[14]
I. Soboroff. Overview of the TREC 2004 novelty track. In TREC, 2004.
[15]
M. Toyoda, M. Kitsuregawa. What's really new on the web?: identifying new pages from a series of unstable web snapshots. In WWW, p.233--241, 2006.
[16]
https://github.com/kkjk21/Timestamp-Extractor.

Cited By

View all

Index Terms

  1. Noise robust detection of the emergence and spread of topics on the web

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    TempWeb '12: Proceedings of the 2nd Temporal Web Analytics Workshop
    April 2012
    55 pages
    ISBN:9781450311885
    DOI:10.1145/2169095
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 April 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Conference

    TempWeb '12

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 06 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media