article

The opposite of smoothing: a language model approach to ranking query-specific document clusters

Authors:

Eyal KrikonAuthors Info & Claims

Journal of Artificial Intelligence Research, Volume 41, Issue 2

Pages 367 - 395

Published: 01 May 2011 Publication History

Abstract

Exploiting information induced from (query-specific) clustering of top-retrieved documents has long been proposed as a means for improving precision at the very top ranks of the returned results. We present a novel language model approach to ranking query-specific clusters by the presumed percentage of relevant documents that they contain. While most previous cluster ranking approaches focus on the cluster as a whole, our model utilizes also information induced from documents associated with the cluster. Our model substantially outperforms previous approaches for identifying clusters containing a high relevant-document percentage. Furthermore, using the model to produce document ranking yields precision-at-top-ranks performance that is consistently better than that of the initial ranking upon which clustering is performed. The performance also favorably compares with that of a state-of-the-art pseudo-feedback-based retrieval method.

References

[1]

Abdul-Jaleel, N., Allan, J., Croft, W. B., Diaz, F., Larkey, L., Li, X., Smucker, M. D., & Wade, C. (2004). UMASS at TREC 2004 -- novelty and hard. In Proceedings of the Thirteenth Text Retrieval Conference (TREC-13), pp. 715-725.

[2]

Azzopardi, L., Girolami, M., & van Rijsbergen, K. (2004). Topic based language models for ad hoc information retrieval. In Proceedings of International Conference on Neural Networks and IEEE International Conference on Fuzzy Systems, pp. 3281-3286.

[3]

Baliński, J., & Dani lowicz, C. (2005). Re-ranking method based on inter-document distances. Information Processing and Management, 41(4), 759-775.

Digital Library

[4]

Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th International World Wide Web Conference, pp. 107-117.

[5]

Buckley, C. (2004). Why current IR engines fail. In Proceedings of SIGIR, pp. 584-585. Poster.

[6]

Buckley, C., Salton, G., Allan, J., & Singhal, A. (1994). Automatic query expansion using SMART: TREC3. In Proceedings of the Third Text Retrieval Conference (TREC-3), pp. 69-80.

[7]

Collins-Thompson, K., Callan, J., Terra, E., & Clarke, C. L. (2004). The effect of document retrieval quality on factoid question answering performance. In Proceedings of SIGIR, pp. 574-575. Poster.

[8]

Connell, M., Feng, A., Kumaran, G., Raghavan, H., Shah, C., & Allan, J. (2004). UMass at TDT 2004. TDT2004 System Description.

[9]

Crestani, F., & Wu, S. (2006). Testing the cluster hypothesis in distributed information retrieval. Information Processing and Management, 42(5), 1137-1150.

Digital Library

[10]

Croft, W. B. (1980). A model of cluster searching based on classification. Information Systems, 5, 189-195.

[11]

Croft, W. B., & Lafferty, J. (Eds.). (2003). Language Modeling for Information Retrieval. No. 13 in Information Retrieval Book Series. Kluwer.

[12]

Diaz, F. (2005). Regularizing ad hoc retrieval scores. In Proceedings of the Fourteenth International Conference on Information and Knowledge Management (CIKM), pp. 672-679.

[13]

Diaz, F., & Metzler, D. (2006). Improving the estimation of relevance models using large external corpora. In Proceedings of SIGIR, pp. 154-161.

[14]

Erkan, G. (2006a). Language model based document clustering using random walks. In Proceedings of HLT/NAACL, pp. 479-486.

[15]

Erkan, G. (2006b). Using biased random walks for focused summarization. In Proceedings of Document Understanding Conference (DUC).

[16]

Erkan, G., & Radev, D. R. (2004). LexPageRank: Prestige in multi-document text summarization. In Proceedings of EMNLP, pp. 365-371. Poster.

[17]

Geraci, F., Pellegrini, M., Maggini, M., & Sebastiani, F. (2006). Cluster generation and cluster labeling for Web snippets: A fast and accurate hierarchical solution. In Proceedings of the 13th international conference on string processing and information retrieval (SPIRE), pp. 25-37.

[18]

Golub, G. H., & Van Loan, C. F. (1996). Matrix Computations (Third edition). The Johns Hopkins University Press.

[19]

Griffiths, A., Luckhurst, H. C., & Willett, P. (1986). Using interdocument similarity information in document retrieval systems. Journal of the American Society for Information Science (JASIS), 37(1), 3-11. Reprinted in Karen Sparck Jones and Peter Willett, eds., Readings in Information Retrieval, Morgan Kaufmann, pp. 365-373, 1997.

[20]

Harman, D., & Buckley, C. (2004). The NRRC reliable information access (RIA) workshop. In Proceedings of SIGIR, pp. 528-529. Poster.

[21]

Hearst, M. A., & Pedersen, J. O. (1996). Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proceedings of SIGIR, pp. 76-84.

[22]

Jardine, N., & van Rijsbergen, C. J. (1971). The use of hierarchic clustering in information retrieval. Information Storage and Retrieval, 7(5), 217-240.

[23]

Kleinberg, J. (1997). Authoritative sources in a hyperlinked environment. Tech. rep. Research Report RJ 10076, IBM.

[24]

Kurland, O. (2006). Inter-document similarities, language models, and ad hoc retrieval. Ph.D. thesis, Cornell University.

[25]

Kurland, O. (2009). Re-ranking search results using language models of query-specific clusters. Journal of Information Retrieval, 12(4), 437-460.

Digital Library

[26]

Kurland, O., & Domshlak, C. (2008). A rank-aggregation approach to searching for optimal query-specific clusters. In Proceedings of SIGIR, pp. 547-554.

[27]

Kurland, O., & Lee, L. (2004). Corpus structure, language models, and ad hoc information retrieval. In Proceedings of SIGIR, pp. 194-201.

[28]

Kurland, O., & Lee, L. (2005). PageRank without hyperlinks: Structural re-ranking using links induced by language models. In Proceedings of SIGIR, pp. 306-313.

[29]

Kurland, O., & Lee, L. (2006). Respect my authority! HITS without hyperlinks utilizing cluster-based language models. In Proceedings of SIGIR, pp. 83-90.

[30]

Lafferty, J. D., & Zhai, C. (2001). Document language models, query models, and risk minimization for information retrieval. In Proceedings of SIGIR, pp. 111-119.

[31]

Lavrenko, V. (2004). A Generative Theory of Relevance. Ph.D. thesis, University of Massachusetts Amherst.

[32]

Lavrenko, V., Allan, J., DeGuzman, E., LaFlamme, D., Pollard, V., & Thomas, S. (2002). Relevance models for topic detection and tracking. In Proceedings of the Human Language Technology Conference (HLT), pp. 104-110.

[33]

Lavrenko, V., & Croft, W. B. (2001). Relevance-based language models. In Proceedings of SIGIR, pp. 120-127.

[34]

Lavrenko, V., & Croft, W. B. (2003). Relevance models in information retrieval. In Croft, & Lafferty (Croft & Lafferty, 2003), pp. 11-56.

[35]

Leuski, A. (2001). Evaluating document clustering for interactive information retrieval. In Proceedings of the Tenth International Conference on Information and Knowledge Management (CIKM), pp. 33-40.

[36]

Leuski, A., & Allan, J. (1998). Evaluating a visual navigation system for a digital library. In Proceedings of the Second European conference on research and advanced technology for digital libraries (ECDL), pp. 535-554.

[37]

Liu, X., & Croft, W. B. (2004). Cluster-based retrieval using language models. In Proceedings of SIGIR, pp. 186-193.

[38]

Liu, X., & Croft, W. B. (2006a). Experiments on retrieval of optimal clusters. Tech. rep. IR- 478, Center for Intelligent Information Retrieval (CIIR), University of Massachusetts.

[39]

Liu, X., & Croft, W. B. (2006b). Representing clusters for retrieval. In Proceedings of SIGIR, pp. 671-672. Poster.

[40]

Liu, X., & Croft, W. B. (2008). Evaluating text representations for retrieval of the best group of documents. In Proceedings of ECIR, pp. 454-462.

[41]

Mei, Q., Shen, X., & Zhai, C. (2007). Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD international conference, pp. 490-499.

[42]

Mihalcea, R. (2004). Graph-based ranking algorithms for sentence extraction, applied to text summarization. In The Companion Volume to the Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pp. 170-173.

[43]

Mihalcea, R., & Tarau, P. (2004). TextRank: Bringing order into texts. In Proceedings of EMNLP, pp. 404-411. Poster.

[44]

Otterbacher, J., Erkan, G., & Radev, D. R. (2005). Using random walks for question-focused sentence retrieval. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pp. 915-922.

[45]

Palmer, C. R., Pesenty, J., Veldes-Perez, R., Christel, M., Hauptmann, A. G., Ng, D., & Wactlar, H. D. (2001). Demonstration of hierarchical document clustering of digital library retrieval results. In Proceedings of the 1st ACM/IEEE-CS joint conference on digital libraries, p. 451.

[46]

Ponte, J. M., & Croft,W. B. (1998). A language modeling approach to information retrieval. In Proceedings of SIGIR, pp. 275-281.

[47]

Preece, S. E. (1973). Clustering as an output option. In Proceedings of the American Society for Information Science, pp. 189-190.

[48]

Seo, J., & Croft, W. B. (2010). Geometric representations for multiple documents. In Proceedings of SIGIR, pp. 251-258.

[49]

Shanahan, J. G., Bennett, J., Evans, D. A., Hull, D. A., & Montgomery, J. (2003). Clairvoyance Corporation experiments in the TREC 2003. High accuracy retrieval from documents (HARD) track. In Proceedings of the Twelfth Text Retrieval Conference (TREC-12), pp. 152-160.

[50]

Si, L., Jin, R., Callan, J., & Ogilvie, P. (2002). A language modeling framework for resource selection and results merging. In Proceedings of the 11th International Conference on Information and Knowledge Management (CIKM), pp. 391-397.

[51]

Tao, T., Wang, X., Mei, Q., & Zhai, C. (2006). Language model information retrieval with document expansion. In Proceedings of HLT/NAACL, pp. 407-414.

[52]

Tao, T., & Zhai, C. (2006). Regularized esitmation of mixture models for robust pseudorelevance feedback. In Proceedings of SIGIR, pp. 162-169.

[53]

Tombros, A., Villa, R., & van Rijsbergen, C. (2002). The effectiveness of query-specific hierarchic clustering in information retrieval. Information Processing and Management, 38(4), 559-582.

Digital Library

[54]

Treeratpituk, P., & Callan, J. (2006). Automatically labeling hierarchical clusters. In Proceedings of the sixth national conference on digital government research, pp. 167- 176.

[55]

van Rijsbergen, C. J. (1979). Information Retrieval (second edition). Butterworths.

[56]

Voorhees, E. M. (1985). The cluster hypothesis revisited. In Proceedings of SIGIR, pp. 188-196.

[57]

Voorhees, E. M. (2002). Overview of the TREC 2002 question answering track. In The Eleventh Text Retrieval Conference TREC-11, pp. 115-123.

[58]

Wei, X., & Croft, W. B. (2006). LDA-based document models for ad-hoc retrieval. In Proceedings of SIGIR, pp. 178-185.

[59]

Willett, P. (1985). Query specific automatic document classification. International Forum on Information and Documentation, 10(2), 28-32.

[60]

Xu, J., & Croft, W. B. (1996). Query expansion using local and global document analysis. In Proceedings of SIGIR, pp. 4-11.

[61]

Yang, L., Ji, D., Zhou, G., Nie, Y., & Xiao, G. (2006). Document re-ranking using cluster validation and label propagation. In Proceedings of CIKM, pp. 690-697.

[62]

Zamir, O., & Etzioni, O. (1998). Web document clustering: a feasibility demonstration. In Proceedings of SIGIR, pp. 46-54.

[63]

Zhai, C., & Lafferty, J. D. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR, pp. 334-342.

Cited By

Markovskiy ERaiber FSabach SKurland OAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)From Cluster Ranking to Document RankingProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531819(2137-2141)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3531819
Vikraman LMontazeralghaem AHashemi HCroft WAllan JHasibi FFang YAizawa A(2021)Passage Similarity and Diversification in Non-factoid Question AnsweringProceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3471158.3472249(271-280)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3471158.3472249
Bernstein KRaiber FKurland OCulpepper JBalog KSetty VLioma CLiu YZhang MBerberich K(2020)Cluster-Based Document Retrieval with Multiple QueriesProceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval10.1145/3409256.3409825(33-40)Online publication date: 14-Sep-2020
https://dl.acm.org/doi/10.1145/3409256.3409825
Show More Cited By

Index Terms

The opposite of smoothing: a language model approach to ranking query-specific document clusters
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering
    2. Decision support systems
      1. Expert systems

Recommendations

The opposite of smoothing: a language model approach to ranking query-specific document clusters
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Exploiting information induced from (query-specific) clustering of top-retrieved documents has long been proposed as means for improving precision at the very top ranks of the returned results. We present a novel language model approach to ranking query-...
Smoothing clickthrough data for web search ranking
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Incorporating features extracted from clickthrough data (called clickthrough features) has been demonstrated to significantly improve the performance of ranking models for Web search applications. Such benefits, however, are severely limited by the data ...
Smoothing Click Counts for Aggregated Vertical Search
ECIR 2011: Proceedings of the 33rd European Conference on Advances in Information Retrieval - Volume 6611

Clickthrough data is a critical feature for improving web search ranking. Recently, many search portals have provided aggregated search, which retrieves relevant information from various heterogeneous collections called verticals. In addition to the ...

Comments

Information & Contributors

Information

Published In

cover image Journal of Artificial Intelligence Research

Journal of Artificial Intelligence Research Volume 41, Issue 2

May 2011

544 pages

ISSN:1076-9757

Issue’s Table of Contents

Publisher

AI Access Foundation

El Segundo, CA, United States

Publication History

Published: 01 May 2011

Published in JAIR Volume 41, Issue 2

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
34
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Markovskiy ERaiber FSabach SKurland OAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)From Cluster Ranking to Document RankingProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531819(2137-2141)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3531819
Vikraman LMontazeralghaem AHashemi HCroft WAllan JHasibi FFang YAizawa A(2021)Passage Similarity and Diversification in Non-factoid Question AnsweringProceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3471158.3472249(271-280)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3471158.3472249
Bernstein KRaiber FKurland OCulpepper JBalog KSetty VLioma CLiu YZhang MBerberich K(2020)Cluster-Based Document Retrieval with Multiple QueriesProceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval10.1145/3409256.3409825(33-40)Online publication date: 14-Sep-2020
https://dl.acm.org/doi/10.1145/3409256.3409825
Alma’aitah WTalib AOsman M(2020)Opportunities and challenges in enhancing access to metadata of cultural heritage collections: a surveyArtificial Intelligence Review10.1007/s10462-019-09773-w53:5(3621-3646)Online publication date: 1-Jun-2020
https://dl.acm.org/doi/10.1007/s10462-019-09773-w
Sheetrit EKurland OZhu WTao DCheng XCui PRundensteiner ECarmel DHe QXu Yu J(2019)Cluster-Based Focused RetrievalProceedings of the 28th ACM International Conference on Information and Knowledge Management10.1145/3357384.3358087(2305-2308)Online publication date: 3-Nov-2019
https://dl.acm.org/doi/10.1145/3357384.3358087
Bouadjenek MSanner SAzzopardi LHalvey MRuthven IJoho HMurdock VQvarfordt P(2019)Relevance-driven Clustering for Visual Information Retrieval on TwitterProceedings of the 2019 Conference on Human Information Interaction and Retrieval10.1145/3295750.3298914(349-353)Online publication date: 8-Mar-2019
https://dl.acm.org/doi/10.1145/3295750.3298914
Jin XAgun DYang TWu QShen YZhao SMukhopadhyay SZhai CBertino ECrestani FMostafa JTang JSi LZhou XChang YLi YSondhi P(2016)Hybrid Indexing for Versioned Document Search with Cluster-based RetrievalProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983733(377-386)Online publication date: 24-Oct-2016
https://dl.acm.org/doi/10.1145/2983323.2983733
Raviv HKurland OCarmel DPerego RSebastiani FAslam JRuthven IZobel J(2016)Document Retrieval Using Entity-Based Language ModelsProceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval10.1145/2911451.2911508(65-74)Online publication date: 7-Jul-2016
https://dl.acm.org/doi/10.1145/2911451.2911508
Raiber FKurland OGeva STrotman ABruza PClarke CJärvelin K(2014)The correlation between cluster hypothesis tests and the effectiveness of cluster-based retrievalProceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval10.1145/2600428.2609533(1155-1158)Online publication date: 3-Jul-2014
https://dl.acm.org/doi/10.1145/2600428.2609533
Kurland O(2014)The Cluster Hypothesis in Information RetrievalProceedings of the 36th European Conference on IR Research on Advances in Information Retrieval - Volume 841610.1007/978-3-319-06028-6_105(823-826)Online publication date: 13-Apr-2014
https://dl.acm.org/doi/10.1007/978-3-319-06028-6_105
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents