skip to main content
10.1145/2911451.2911496acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

That's Not My Question: Learning to Weight Unmatched Terms in CQA Vertical Search

Published: 07 July 2016 Publication History

Abstract

A fundamental task in Information Retrieval (IR) is term weighting. Early IR theory considered both the presence or absence of all terms in the lexicon for ranking and needed to weight them all. Yet, as the size of lexicons grew and models became too complex, common weighting models preferred to aggregate only the weights of the query terms that are matched in candidate documents. Thus, unmatched term contribution in these models is only considered indirectly, such as in probability smoothing with corpus distribution, or in weight normalization by document length. In this work we propose a novel term weighting model that directly assesses the weights of unmatched terms, and show its benefits. Specifically, we propose a Learning To Rank framework, in which features corresponding to matched terms are also "mirrored" in similar features that account only for unmatched terms. The relative importance of each feature is learned via a click-through query log. As a test case, we consider vertical search in Community-based Question Answering(CQA) sites from Web queries. Queries that result in viewing CQA content often contain fine grained information needs and benefit more from unmatched term weighting. We assess our model both via manual evaluation and via automatic evaluation over a clickthrough log. Our results show consistent improvement in retrieval when unmatched information is taken into account. This holds both when only identical terms are considered matched, and when related terms are matched via distributional similarity.

References

[1]
G. Amati, V. Rijsbergen, and C. Joost. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20(4), Oct. 2002.
[2]
J. Arguello, F. Diaz, J. Callan, and J.-F. Crespo. Sources of evidence for vertical selection. In SIGIR, 2009.
[3]
M. Bendersky, D. Metzler, and W. B. Croft. Learning concept importance using a weighted dependence model. In WSDM, 2010.
[4]
A. Berger and J. Lafferty. Information retrieval as statistical translation. In SIGIR, 1999.
[5]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993--1022, 2003.
[6]
L. Cai, G. Zhou, K. Liu, and J. Zhao. Learning the latent topics for question retrieval in community qa. In AFNLP, 2011.
[7]
X. Cao, G. Cong, B. Cui, C. S. Jensen, and C. Zhang. The use of categorization information in language models for question retrieval. In CIKM, 2009.
[8]
Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon. Adapting ranking svm to document retrieval. In SIGIR, 2006.
[9]
D. Carmel, A. Mejer, Y. Pinter, and I. Szpektor. Improving term weighting for community question answering search using syntactic analysis. In CIKM, 2014.
[10]
R.-C. Chen, D. Spina, W. B. Croft, M. Sanderson, and F. Scholer. Harnessing semantics for answer sentence retrieval. In ESAIR Workshop, 2015.
[11]
K. Crammer, A. Kulesza, and M. Dredze. Adaptive regularization of weight vectors. MLJ, 91(2):155--187, 2013.
[12]
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. JAsIs, 41(6):391--407, 1990.
[13]
H. Duan, Y. Cao, C.-Y. Lin, and Y. Yu. Searching questions by identifying question topic and question focus. In ACL, 2008.
[14]
D. Ganguly, D. Roy, M. Mitra, and G. J. Jones. Word embedding based generalized language model for information retrieval. In SIGIR, 2015.
[15]
P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013.
[16]
J. Jeon, W. B. Croft, and J. H. Lee. Finding similar questions in large question and answer archives. In CIKM, 2005.
[17]
R. Jin, A. G. Hauptmann, and C. X. Zhai. Language model for information retrieval. In SIGIR, 2002.
[18]
Q. Liu, E. Agichtein, G. Dror, E. Gabrilovich, Y. Maarek, D. Pelleg, and I. Szpektor. Predicting web searcher satisfaction with existing community-based answers. In SIGIR, 2011.
[19]
T.-Y. Liu. Learning to rank for information retrieval. Found. Trends Inf. Retr., 3(3):225--331, Mar. 2009.
[20]
T. y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. Letor: Benchmark dataset for research on learning to rank for information retrieval. In SIGIR Workshop on Learning to Rank for Information Retrieval, 2007.
[21]
Y. Liu, C. Sun, L. Lin, Y. Zhao, and X. Wang. Computing semantic text similarity using rich features. In PACLIC, 2015.
[22]
C. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.
[23]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS. 2013.
[24]
D. R. Miller, T. Leek, and R. M. Schwartz. A hidden markov model information retrieval system. In SIGIR, 1999.
[25]
V. Murdock and M. Lalmas. Workshop on aggregated search. SIGIR Forum, 42(2):80--83, Nov. 2008.
[26]
J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In SIGIR, 1998.
[27]
F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data reflect retrieval quality? In CIKM, 2008.
[28]
S. Robertson and H. Zaragoza. The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3(4):333--389, Apr. 2009.
[29]
S. E. Robertson and K. S. Jones. Relevance weighting of search terms. Journal of the American Society for Information science, 27(3):129--146, 1976.
[30]
S. E. Robertson, C. J. van Rijsbergen, and M. F. Porter. Probabilistic models of indexing and searching. In SIGIR, 1980.
[31]
G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage., 24(5):513--523, Aug. 1988.
[32]
A. Severyn and A. Moschitti. Learning to rank short text pairs with convolutional deep neural networks. In SIGIR, 2015.
[33]
F. Song and W. B. Croft. A general language model for information retrieval. In CIKM, 1999.
[34]
K. Tymoshenko and A. Moschitti. Assessing the impact of syntactic and semantic structures for answer passages reranking. In CIKM, 2015.
[35]
X. Wei and W. B. Croft. Lda-based document models for ad-hoc retrieval. In SIGIR, 2006.
[36]
R. W. White, M. Richardson, and W.-t. Yih. Questions vs. queries in informational search tasks. In WWW Companion, 2015.
[37]
H. Wu, W. Wu, M. Zhou, E. Chen, L. Duan, and H.-Y. Shum. Improving search relevance for short queries in community question answering. In WSDM, 2014.
[38]
Q. Wu, C. J. Burges, K. M. Svore, and J. Gao. Adapting boosting for information retrieval measures. Inf. Retr., 13(3):254--270, June 2010.
[39]
X. Xue, J. Jeon, and W. B. Croft. Retrieval models for question and answer archives. In SIGIR, 2008.
[40]
C. Zhai. Statistical language models for information retrieval. Synthesis Lectures on HLT, 1(1):1--141, 2008.
[41]
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In SIGIR, 2001.
[42]
W. Zhang, Z. Ming, Y. Zhang, L. Nie, T. Liu, and T. Chua. The use of dependency relation graph to enhance the term weighting in question retrieval. In COLING, 2012.
[43]
G. Zheng and J. Callan. Learning to reweight terms with distributed representations. In SIGIR, 2015.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval
July 2016
1296 pages
ISBN:9781450340694
DOI:10.1145/2911451
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 July 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. community-based question answering
  2. document ranking
  3. unmatched terms

Qualifiers

  • Research-article

Conference

SIGIR '16
Sponsor:

Acceptance Rates

SIGIR '16 Paper Acceptance Rate 62 of 341 submissions, 18%;
Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media