research-article

Bayesian Browsing Model: Exact Inference of Document Relevance from Petabyte-Scale Data

Authors:

Christos FaloutsosAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 4, Issue 4

Article No.: 19, Pages 1 - 26

https://doi.org/10.1145/1857947.1857951

Published: 01 October 2010 Publication History

Abstract

A fundamental challenge in utilizing Web search click data is to infer user-perceived relevance from the search log. Not only is the inference a difficult problem involving statistical reasonings but the bulky size, together with the ever-increasing nature, of the log data imposes extra requirements on scalability. In this paper, we propose the Bayesian Browsing Model (BBM), which performs exact inference of the document relevance, only requires a single pass of the data (i.e., the optimal scalability), and is shown effective.

We present two sets of experiments to evaluate the model effectiveness and scalability. On the first set of over 50 million search instances of 1.1 million distinct queries, BBM outperforms the state-of-the-art competitor by 29.2% in log-likelihood while being 57 times faster. On the second click log set, spanning a quarter of petabyte, we showcase the scalability of BBM: we implemented it on a commercial MapReduce cluster, and it took only 3 hours to compute the relevance for 1.15 billion distinct query-URL pairs.

References

[1]

Aggarwal, G., Feldman, J., Muthukrishnan, S., and Pál, M. 2008. Sponsored search auctions with markovian users. In Proceedings of the 4th International Workshop on Internet and Network Economics (WINE’08). Springer-Verlag, Berlin, 621--628.

Digital Library

[2]

Agichtein, E., Brill, E., and Dumais, S. 2006a. Improving Web search ranking by incorporating user behavior information. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06). ACM, New York, 19--26.

Digital Library

[3]

Agichtein, E., Brill, E., Dumais, S., and Ragno, R. 2006b. Learning user interaction models for predicting Web search result preferences. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06). ACM, New York, 3--10.

Digital Library

[4]

Babcock, B., Babu, S., Datar, M., Motwani, R., and Widom, J. 2002. Models and issues in data stream systems. In Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’02). ACM, New York, 1--16.

Digital Library

[5]

Baeza-Yates, R. 2005. Applications of Web query mining. In Advances in Information Retrieval, 7--22.

Digital Library

[6]

Baeza-Yates, R., Hurtado, C., and Mendoza, M. 2004. Current trends in database technology - edbt 2004 workshop. In Proceedings of the EDBT Workshops on Clustering Information Over the Web. 588--596.

[7]

Bilenko, M. and White, R. W. 2008. Mining the search trails of surfing crowds: Identifying relevant Web sites from user activity. In Proceedings of the 17th International Conference on World Wide Web (WWW’08). ACM, New York, 51--60.

Digital Library

[8]

Bishop, C. M. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin.

Digital Library

[9]

Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning (ICML’05). ACM, New York, 89--96.

Digital Library

[10]

Chapelle, O., Metlzer, D., Zhang, Y., and Grinspan, P. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). ACM, New York, 621--630.

Digital Library

[11]

Chapelle, O. and Zhang, Y. 2009. A dynamic Bayesian network click model for Web search ranking. In Proceedings of the 18th International Conference on World Wide Web (WWW’09). ACM, New York, 1--10.

Digital Library

[12]

Cheng, H. and Cantu-Paz, E. 2010. Personalized click prediction in sponsored search. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM’10).

Digital Library

[13]

Craswell, N., Zoeter, O., Taylor, M., and Ramsey, B. 2008. An experimental comparison of click position-bias models. In Proceedings of the International Conference on Web Search and Web Data Mining (WSDM’08). ACM, New York, 87--94.

Digital Library

[14]

Dean, J. and Ghemawat, S. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation (OSDI’’04). USENIX Association, Berkeley, CA, 10--10.

Digital Library

[15]

Domingos, P. and Hulten, G. 2000. Mining high-speed data streams. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’00). ACM, New York, 71--80.

Digital Library

[16]

Dupret, G. E. and Piwowarski, B. 2008. A user browsing model to predict search engine click data from past observations. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’08). ACM, New York, 331--338.

Digital Library

[17]

Freund, Y., Iyer, R., Schapire, R. E., and Singer, Y. 2003. An efficient boosting algorithm for combining preferences. J. Mach. Learn. Resear. 4, 933--969.

Digital Library

[18]

Gaber, M. M., Zaslavsky, A., and Krishnaswamy, S. 2005. Mining data streams: A review. SIGMOD Rec. 34, 2, 18--26.

Digital Library

[19]

Gao, J., Yuan, W., Li, X., Deng, K., and Nie, J.-Y. 2009. Smoothing clickthrough data for Web search ranking. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). ACM, New York, 355--362.

Digital Library

[20]

Giannella, C., Han, J., Pei, J., Yan, X., and Yu, P. S. 2003. Mining frequent patterns in data streams at multiple time granularities. In Data Mining: Next Generation Challenges and Future Directions.

[21]

Guha, S., Meyerson, A., Mishra, N., and Motwani, R. 2003. Clustering data streams: Theory and practice. IEEE Trans. Knowl. Data Engin. 15, 3, 515--528.

Digital Library

[22]

Guo, F., Li, L., and Faloutsos, C. 2009a. Tailoring click models to user goals. In Proceedings of the Workshop on Web Search Click Data (WSCD’09). ACM, New York, 88--92.

Digital Library

[23]

Guo, F., Liu, C., Kannan, A., Minka, T., Taylor, M., Wang, Y.-M., and Faloutsos, C. 2009b. Click chain model in Web search. In Proceedings of the 18th International Conference on World Wide Web (WWW’09). ACM, New York, 11--20.

Digital Library

[24]

Guo, F., Liu, C., and Wang, Y.-M. 2009c. Efficient multiple-click models in Web search. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining (WSDM’09). ACM, New York, 124--131.

Digital Library

[25]

Joachims, T. 2002. Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’02). ACM, New York, 133--142.

Digital Library

[26]

Joachims, T., Granka, L., Pan, B., Hembrooke, H., and Gay, G. 2005. Accurately interpreting clickthrough data as implicit feedback. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’05). ACM, New York, 154--161.

Digital Library

[27]

Joachims, T., Granka, L., Pan, B., Hembrooke, H., Radlinski, F., and Gay, G. 2007. Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search. ACM Trans. Inform. Syst. 25, 2, 7.

Digital Library

[28]

Kempe, D. and Mahdian, M. 2008. A cascade model for externalities in sponsored search. In Proceedings of the 4th International Workshop on Internet and Network Economics (WINE’08). Springer-Verlag, Berlin, 585--596.

Digital Library

[29]

Liu, C., Li, M., and Wang, Y.-M. 2009. Post-rank reordering: Resolving preference misalignments between search engines and end users. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). ACM, New York, 641--650.

Digital Library

[30]

Liu, T.-Y. 2009. Learning to rank for information retrieval. Found. Trends Inform. Retr. 3, 3, 225--331.

Digital Library

[31]

Minka, T. 2001. Expectation propagation for approximate Bayesian inference. In Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence (UAI’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, 362--369.

Digital Library

[32]

Murphy, K. P. 2001. An introduction to graphical models. www.cs.iastate.edu/~honavar/bayes0.pdf.

[33]

Radlinski, F. and Joachims, T. 2005. Query chains: Learning to rank from implicit feedback. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD’05). ACM, New York, 239--248.

Digital Library

[34]

Richardson, M., Dominowska, E., and Ragno, R. 2007. Predicting clicks: Estimating the click-through rate for new ads. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). ACM, New York, 521--530.

Digital Library

[35]

Shokouhi, M., Scholer, F., and Turpin, A. 2008. Investigating the effectiveness of clickthrough data for document reordering. In Proceedings of the 30th European Conference on Information Retrieval (ECIR’08). 591--595.

Digital Library

[36]

Silverstein, C., Marais, H., Henzinger, M., and Moricz, M. 1999. Analysis of a very large Web search engine query log. SIGIR Forum 33, 1, 6--12.

Digital Library

[37]

Tsai, M.-F., Liu, T.-Y., Qin, T., Chen, H.-H., and Ma, W.-Y. 2007. FRank: A ranking method with fidelity loss. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 383--390.

Digital Library

[38]

Wainwright, M. J. and Jordan, M. I. 2008. Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn. 1, 1-2, 1--305.

Digital Library

[39]

Wang, K., Walker, T., and Zheng, Z. 2009. Pskip: Estimating relevance ranking quality from Web search clickthrough data. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09). ACM, New York, 1355--1364.

Digital Library

[40]

Wedig, S. and Madani, O. 2006. A large-scale analysis of query logs for assessing personalization opportunities. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06). ACM, New York, 742--747.

Digital Library

[41]

Xue, G.-R., Zeng, H.-J., Chen, Z., Yu, Y., Ma, W.-Y., Xi, W., and Fan, W. 2004. Optimizing Web search using Web click-through data. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM’04). ACM, New York, 118--126.

Digital Library

[42]

Yedidia, J. S., Freeman, W. T., and Weiss, Y. 2005. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. Inform. Theor. 51, 7, 2282--2312.

Digital Library

[43]

Zhang, Z. and Nasraoui, O. 2006. Mining search engine query logs for query recommendation. In Proceedings of the 15th International Conference on World Wide Web (WWW’06). ACM, New York, 1039--1040.

Digital Library

[44]

Zhu, Z. A., Chen, W., Minka, T., Zhu, C., and Chen, Z. 2010. A novel click model and its applications to online advertising. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM’10).

Digital Library

Cited By

Hsu CChen TChen H(2022)Experience: Analyzing Missing Web Page Visits and Unintentional Web Page Visits from the Client-side Web LogsJournal of Data and Information Quality10.1145/349039214:2(1-17)Online publication date: 23-Mar-2022
https://dl.acm.org/doi/10.1145/3490392
Yu HJatowt ABlanco RJose JZhou KAzzopardi LHalvey MRuthven IJoho HMurdock VQvarfordt P(2019)A Rank-biased Neural Network Model for Click ModelingProceedings of the 2019 Conference on Human Information Interaction and Retrieval10.1145/3295750.3298920(183-191)Online publication date: 8-Mar-2019
https://dl.acm.org/doi/10.1145/3295750.3298920
Khan IMandal AKumar P(2019)Contextual Price Features for e-Commerce Search Ranking2019 IEEE International Conference on Big Data (Big Data)10.1109/BigData47090.2019.9006033(1851-1857)Online publication date: Dec-2019
https://doi.org/10.1109/BigData47090.2019.9006033
Show More Cited By

Index Terms

Bayesian Browsing Model: Exact Inference of Document Relevance from Petabyte-Scale Data
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Click chain model in web search
WWW '09: Proceedings of the 18th international conference on World wide web

Given a terabyte click log, can we build an efficient and effective click model? It is commonly believed that web search click logs are a gold mine for search business, because they reflect users' preference over web documents presented by the search ...
BBM: bayesian browsing model from petabyte-scale data
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

Given a quarter of petabyte click log data, how can we estimate the relevance of each URL for a given query? In this paper, we propose the Bayesian Browsing Model (BBM), a new modeling technique with following advantages: (a) it does exact inference; (b)...
Efficient multiple-click models in web search
WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data Mining

Many tasks that leverage web search users' implicit feedback rely on a proper and unbiased interpretation of user clicks. Previous eye-tracking experiments and studies on explaining position-bias of user clicks provide a spectrum of hypotheses and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 4, Issue 4

October 2010

121 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/1857947

Issue’s Table of Contents

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2010

Accepted: 01 June 2010

Received: 01 January 2010

Published in TKDD Volume 4, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Division of Biological Infrastructure

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
642
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hsu CChen TChen H(2022)Experience: Analyzing Missing Web Page Visits and Unintentional Web Page Visits from the Client-side Web LogsJournal of Data and Information Quality10.1145/349039214:2(1-17)Online publication date: 23-Mar-2022
https://dl.acm.org/doi/10.1145/3490392
Yu HJatowt ABlanco RJose JZhou KAzzopardi LHalvey MRuthven IJoho HMurdock VQvarfordt P(2019)A Rank-biased Neural Network Model for Click ModelingProceedings of the 2019 Conference on Human Information Interaction and Retrieval10.1145/3295750.3298920(183-191)Online publication date: 8-Mar-2019
https://dl.acm.org/doi/10.1145/3295750.3298920
Khan IMandal AKumar P(2019)Contextual Price Features for e-Commerce Search Ranking2019 IEEE International Conference on Big Data (Big Data)10.1109/BigData47090.2019.9006033(1851-1857)Online publication date: Dec-2019
https://doi.org/10.1109/BigData47090.2019.9006033
Bell ASenthil Kumar PMiranda DCuzzocrea AAllan JPaton NSrivastava DAgrawal RBroder AZaki MCandan SLabrinidis ASchuster AWang H(2018)The Title Says It AllProceedings of the 27th ACM International Conference on Information and Knowledge Management10.1145/3269206.3272033(2233-2241)Online publication date: 17-Oct-2018
https://dl.acm.org/doi/10.1145/3269206.3272033
Markov IBorisov Ade Rijke MLim EWinslett MSanderson MFu ASun JCulpepper SLo EHo JDonato DAgrawal RZheng YCastillo CSun ATseng VLi C(2017)Online Expectation-Maximization for Click ModelsProceedings of the 2017 ACM on Conference on Information and Knowledge Management10.1145/3132847.3133053(2195-2198)Online publication date: 6-Nov-2017
https://dl.acm.org/doi/10.1145/3132847.3133053
Li LDeng HDong AChang YBaeza-Yates RZha HBarrett RCummings RAgichtein EGabrilovich E(2017)Exploring Query Auto-Completion and Click Logs for Contextual-Aware Web Search and Query SuggestionProceedings of the 26th International Conference on World Wide Web10.1145/3038912.3052593(539-548)Online publication date: 3-Apr-2017
https://dl.acm.org/doi/10.1145/3038912.3052593
Li LDeng HChen JChang Yde Rijke MShokouhi MTomkins AZhang M(2017)Learning Parametric Models for Context-Aware Query Auto-Completion via Hawkes ProcessesProceedings of the Tenth ACM International Conference on Web Search and Data Mining10.1145/3018661.3018698(131-139)Online publication date: 2-Feb-2017
https://dl.acm.org/doi/10.1145/3018661.3018698
Zhang PYu QHou YSong DLi JHu B(2017)A Distribution Separation Method Using Irrelevance Feedback Data for Information RetrievalACM Transactions on Intelligent Systems and Technology10.1145/29946088:3(1-26)Online publication date: 12-Jan-2017
https://dl.acm.org/doi/10.1145/2994608
Khan IKumar PMiranda DGoldberg D(2017)What is skipped: Finding desirable items in e-commerce search by discovering the worst title tokens2017 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2017.8258078(1447-1456)Online publication date: Dec-2017
https://doi.org/10.1109/BigData.2017.8258078
Yu HJatowt ABlanco RJoho HJose J(2017)Decoding multi-click search behavior based on marginal utilityInformation Retrieval10.1007/s10791-016-9289-z20:1(25-52)Online publication date: 1-Feb-2017
https://dl.acm.org/doi/10.1007/s10791-016-9289-z
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents