skip to main content
article

The opposite of smoothing: a language model approach to ranking query-specific document clusters

Published: 01 May 2011 Publication History

Abstract

Exploiting information induced from (query-specific) clustering of top-retrieved documents has long been proposed as a means for improving precision at the very top ranks of the returned results. We present a novel language model approach to ranking query-specific clusters by the presumed percentage of relevant documents that they contain. While most previous cluster ranking approaches focus on the cluster as a whole, our model utilizes also information induced from documents associated with the cluster. Our model substantially outperforms previous approaches for identifying clusters containing a high relevant-document percentage. Furthermore, using the model to produce document ranking yields precision-at-top-ranks performance that is consistently better than that of the initial ranking upon which clustering is performed. The performance also favorably compares with that of a state-of-the-art pseudo-feedback-based retrieval method.

References

[1]
Abdul-Jaleel, N., Allan, J., Croft, W. B., Diaz, F., Larkey, L., Li, X., Smucker, M. D., & Wade, C. (2004). UMASS at TREC 2004 -- novelty and hard. In Proceedings of the Thirteenth Text Retrieval Conference (TREC-13), pp. 715-725.
[2]
Azzopardi, L., Girolami, M., & van Rijsbergen, K. (2004). Topic based language models for ad hoc information retrieval. In Proceedings of International Conference on Neural Networks and IEEE International Conference on Fuzzy Systems, pp. 3281-3286.
[3]
Baliński, J., & Dani lowicz, C. (2005). Re-ranking method based on inter-document distances. Information Processing and Management, 41(4), 759-775.
[4]
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th International World Wide Web Conference, pp. 107-117.
[5]
Buckley, C. (2004). Why current IR engines fail. In Proceedings of SIGIR, pp. 584-585. Poster.
[6]
Buckley, C., Salton, G., Allan, J., & Singhal, A. (1994). Automatic query expansion using SMART: TREC3. In Proceedings of the Third Text Retrieval Conference (TREC-3), pp. 69-80.
[7]
Collins-Thompson, K., Callan, J., Terra, E., & Clarke, C. L. (2004). The effect of document retrieval quality on factoid question answering performance. In Proceedings of SIGIR, pp. 574-575. Poster.
[8]
Connell, M., Feng, A., Kumaran, G., Raghavan, H., Shah, C., & Allan, J. (2004). UMass at TDT 2004. TDT2004 System Description.
[9]
Crestani, F., & Wu, S. (2006). Testing the cluster hypothesis in distributed information retrieval. Information Processing and Management, 42(5), 1137-1150.
[10]
Croft, W. B. (1980). A model of cluster searching based on classification. Information Systems, 5, 189-195.
[11]
Croft, W. B., & Lafferty, J. (Eds.). (2003). Language Modeling for Information Retrieval. No. 13 in Information Retrieval Book Series. Kluwer.
[12]
Diaz, F. (2005). Regularizing ad hoc retrieval scores. In Proceedings of the Fourteenth International Conference on Information and Knowledge Management (CIKM), pp. 672-679.
[13]
Diaz, F., & Metzler, D. (2006). Improving the estimation of relevance models using large external corpora. In Proceedings of SIGIR, pp. 154-161.
[14]
Erkan, G. (2006a). Language model based document clustering using random walks. In Proceedings of HLT/NAACL, pp. 479-486.
[15]
Erkan, G. (2006b). Using biased random walks for focused summarization. In Proceedings of Document Understanding Conference (DUC).
[16]
Erkan, G., & Radev, D. R. (2004). LexPageRank: Prestige in multi-document text summarization. In Proceedings of EMNLP, pp. 365-371. Poster.
[17]
Geraci, F., Pellegrini, M., Maggini, M., & Sebastiani, F. (2006). Cluster generation and cluster labeling for Web snippets: A fast and accurate hierarchical solution. In Proceedings of the 13th international conference on string processing and information retrieval (SPIRE), pp. 25-37.
[18]
Golub, G. H., & Van Loan, C. F. (1996). Matrix Computations (Third edition). The Johns Hopkins University Press.
[19]
Griffiths, A., Luckhurst, H. C., & Willett, P. (1986). Using interdocument similarity information in document retrieval systems. Journal of the American Society for Information Science (JASIS), 37(1), 3-11. Reprinted in Karen Sparck Jones and Peter Willett, eds., Readings in Information Retrieval, Morgan Kaufmann, pp. 365-373, 1997.
[20]
Harman, D., & Buckley, C. (2004). The NRRC reliable information access (RIA) workshop. In Proceedings of SIGIR, pp. 528-529. Poster.
[21]
Hearst, M. A., & Pedersen, J. O. (1996). Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proceedings of SIGIR, pp. 76-84.
[22]
Jardine, N., & van Rijsbergen, C. J. (1971). The use of hierarchic clustering in information retrieval. Information Storage and Retrieval, 7(5), 217-240.
[23]
Kleinberg, J. (1997). Authoritative sources in a hyperlinked environment. Tech. rep. Research Report RJ 10076, IBM.
[24]
Kurland, O. (2006). Inter-document similarities, language models, and ad hoc retrieval. Ph.D. thesis, Cornell University.
[25]
Kurland, O. (2009). Re-ranking search results using language models of query-specific clusters. Journal of Information Retrieval, 12(4), 437-460.
[26]
Kurland, O., & Domshlak, C. (2008). A rank-aggregation approach to searching for optimal query-specific clusters. In Proceedings of SIGIR, pp. 547-554.
[27]
Kurland, O., & Lee, L. (2004). Corpus structure, language models, and ad hoc information retrieval. In Proceedings of SIGIR, pp. 194-201.
[28]
Kurland, O., & Lee, L. (2005). PageRank without hyperlinks: Structural re-ranking using links induced by language models. In Proceedings of SIGIR, pp. 306-313.
[29]
Kurland, O., & Lee, L. (2006). Respect my authority! HITS without hyperlinks utilizing cluster-based language models. In Proceedings of SIGIR, pp. 83-90.
[30]
Lafferty, J. D., & Zhai, C. (2001). Document language models, query models, and risk minimization for information retrieval. In Proceedings of SIGIR, pp. 111-119.
[31]
Lavrenko, V. (2004). A Generative Theory of Relevance. Ph.D. thesis, University of Massachusetts Amherst.
[32]
Lavrenko, V., Allan, J., DeGuzman, E., LaFlamme, D., Pollard, V., & Thomas, S. (2002). Relevance models for topic detection and tracking. In Proceedings of the Human Language Technology Conference (HLT), pp. 104-110.
[33]
Lavrenko, V., & Croft, W. B. (2001). Relevance-based language models. In Proceedings of SIGIR, pp. 120-127.
[34]
Lavrenko, V., & Croft, W. B. (2003). Relevance models in information retrieval. In Croft, & Lafferty (Croft & Lafferty, 2003), pp. 11-56.
[35]
Leuski, A. (2001). Evaluating document clustering for interactive information retrieval. In Proceedings of the Tenth International Conference on Information and Knowledge Management (CIKM), pp. 33-40.
[36]
Leuski, A., & Allan, J. (1998). Evaluating a visual navigation system for a digital library. In Proceedings of the Second European conference on research and advanced technology for digital libraries (ECDL), pp. 535-554.
[37]
Liu, X., & Croft, W. B. (2004). Cluster-based retrieval using language models. In Proceedings of SIGIR, pp. 186-193.
[38]
Liu, X., & Croft, W. B. (2006a). Experiments on retrieval of optimal clusters. Tech. rep. IR- 478, Center for Intelligent Information Retrieval (CIIR), University of Massachusetts.
[39]
Liu, X., & Croft, W. B. (2006b). Representing clusters for retrieval. In Proceedings of SIGIR, pp. 671-672. Poster.
[40]
Liu, X., & Croft, W. B. (2008). Evaluating text representations for retrieval of the best group of documents. In Proceedings of ECIR, pp. 454-462.
[41]
Mei, Q., Shen, X., & Zhai, C. (2007). Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD international conference, pp. 490-499.
[42]
Mihalcea, R. (2004). Graph-based ranking algorithms for sentence extraction, applied to text summarization. In The Companion Volume to the Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pp. 170-173.
[43]
Mihalcea, R., & Tarau, P. (2004). TextRank: Bringing order into texts. In Proceedings of EMNLP, pp. 404-411. Poster.
[44]
Otterbacher, J., Erkan, G., & Radev, D. R. (2005). Using random walks for question-focused sentence retrieval. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pp. 915-922.
[45]
Palmer, C. R., Pesenty, J., Veldes-Perez, R., Christel, M., Hauptmann, A. G., Ng, D., & Wactlar, H. D. (2001). Demonstration of hierarchical document clustering of digital library retrieval results. In Proceedings of the 1st ACM/IEEE-CS joint conference on digital libraries, p. 451.
[46]
Ponte, J. M., & Croft,W. B. (1998). A language modeling approach to information retrieval. In Proceedings of SIGIR, pp. 275-281.
[47]
Preece, S. E. (1973). Clustering as an output option. In Proceedings of the American Society for Information Science, pp. 189-190.
[48]
Seo, J., & Croft, W. B. (2010). Geometric representations for multiple documents. In Proceedings of SIGIR, pp. 251-258.
[49]
Shanahan, J. G., Bennett, J., Evans, D. A., Hull, D. A., & Montgomery, J. (2003). Clairvoyance Corporation experiments in the TREC 2003. High accuracy retrieval from documents (HARD) track. In Proceedings of the Twelfth Text Retrieval Conference (TREC-12), pp. 152-160.
[50]
Si, L., Jin, R., Callan, J., & Ogilvie, P. (2002). A language modeling framework for resource selection and results merging. In Proceedings of the 11th International Conference on Information and Knowledge Management (CIKM), pp. 391-397.
[51]
Tao, T., Wang, X., Mei, Q., & Zhai, C. (2006). Language model information retrieval with document expansion. In Proceedings of HLT/NAACL, pp. 407-414.
[52]
Tao, T., & Zhai, C. (2006). Regularized esitmation of mixture models for robust pseudorelevance feedback. In Proceedings of SIGIR, pp. 162-169.
[53]
Tombros, A., Villa, R., & van Rijsbergen, C. (2002). The effectiveness of query-specific hierarchic clustering in information retrieval. Information Processing and Management, 38(4), 559-582.
[54]
Treeratpituk, P., & Callan, J. (2006). Automatically labeling hierarchical clusters. In Proceedings of the sixth national conference on digital government research, pp. 167- 176.
[55]
van Rijsbergen, C. J. (1979). Information Retrieval (second edition). Butterworths.
[56]
Voorhees, E. M. (1985). The cluster hypothesis revisited. In Proceedings of SIGIR, pp. 188-196.
[57]
Voorhees, E. M. (2002). Overview of the TREC 2002 question answering track. In The Eleventh Text Retrieval Conference TREC-11, pp. 115-123.
[58]
Wei, X., & Croft, W. B. (2006). LDA-based document models for ad-hoc retrieval. In Proceedings of SIGIR, pp. 178-185.
[59]
Willett, P. (1985). Query specific automatic document classification. International Forum on Information and Documentation, 10(2), 28-32.
[60]
Xu, J., & Croft, W. B. (1996). Query expansion using local and global document analysis. In Proceedings of SIGIR, pp. 4-11.
[61]
Yang, L., Ji, D., Zhou, G., Nie, Y., & Xiao, G. (2006). Document re-ranking using cluster validation and label propagation. In Proceedings of CIKM, pp. 690-697.
[62]
Zamir, O., & Etzioni, O. (1998). Web document clustering: a feasibility demonstration. In Proceedings of SIGIR, pp. 46-54.
[63]
Zhai, C., & Lafferty, J. D. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR, pp. 334-342.

Cited By

View all
  • (2022)From Cluster Ranking to Document RankingProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531819(2137-2141)Online publication date: 6-Jul-2022
  • (2021)Passage Similarity and Diversification in Non-factoid Question AnsweringProceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3471158.3472249(271-280)Online publication date: 11-Jul-2021
  • (2020)Cluster-Based Document Retrieval with Multiple QueriesProceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval10.1145/3409256.3409825(33-40)Online publication date: 14-Sep-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Artificial Intelligence Research
Journal of Artificial Intelligence Research  Volume 41, Issue 2
May 2011
544 pages

Publisher

AI Access Foundation

El Segundo, CA, United States

Publication History

Published: 01 May 2011
Published in JAIR Volume 41, Issue 2

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)From Cluster Ranking to Document RankingProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531819(2137-2141)Online publication date: 6-Jul-2022
  • (2021)Passage Similarity and Diversification in Non-factoid Question AnsweringProceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3471158.3472249(271-280)Online publication date: 11-Jul-2021
  • (2020)Cluster-Based Document Retrieval with Multiple QueriesProceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval10.1145/3409256.3409825(33-40)Online publication date: 14-Sep-2020
  • (2020)Opportunities and challenges in enhancing access to metadata of cultural heritage collections: a surveyArtificial Intelligence Review10.1007/s10462-019-09773-w53:5(3621-3646)Online publication date: 1-Jun-2020
  • (2019)Cluster-Based Focused RetrievalProceedings of the 28th ACM International Conference on Information and Knowledge Management10.1145/3357384.3358087(2305-2308)Online publication date: 3-Nov-2019
  • (2019)Relevance-driven Clustering for Visual Information Retrieval on TwitterProceedings of the 2019 Conference on Human Information Interaction and Retrieval10.1145/3295750.3298914(349-353)Online publication date: 8-Mar-2019
  • (2016)Hybrid Indexing for Versioned Document Search with Cluster-based RetrievalProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983733(377-386)Online publication date: 24-Oct-2016
  • (2016)Document Retrieval Using Entity-Based Language ModelsProceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval10.1145/2911451.2911508(65-74)Online publication date: 7-Jul-2016
  • (2014)The correlation between cluster hypothesis tests and the effectiveness of cluster-based retrievalProceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval10.1145/2600428.2609533(1155-1158)Online publication date: 3-Jul-2014
  • (2014)The Cluster Hypothesis in Information RetrievalProceedings of the 36th European Conference on IR Research on Advances in Information Retrieval - Volume 841610.1007/978-3-319-06028-6_105(823-826)Online publication date: 13-Apr-2014
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media