skip to main content
research-article

Bayesian Browsing Model: Exact Inference of Document Relevance from Petabyte-Scale Data

Published: 01 October 2010 Publication History

Abstract

A fundamental challenge in utilizing Web search click data is to infer user-perceived relevance from the search log. Not only is the inference a difficult problem involving statistical reasonings but the bulky size, together with the ever-increasing nature, of the log data imposes extra requirements on scalability. In this paper, we propose the Bayesian Browsing Model (BBM), which performs exact inference of the document relevance, only requires a single pass of the data (i.e., the optimal scalability), and is shown effective.
We present two sets of experiments to evaluate the model effectiveness and scalability. On the first set of over 50 million search instances of 1.1 million distinct queries, BBM outperforms the state-of-the-art competitor by 29.2% in log-likelihood while being 57 times faster. On the second click log set, spanning a quarter of petabyte, we showcase the scalability of BBM: we implemented it on a commercial MapReduce cluster, and it took only 3 hours to compute the relevance for 1.15 billion distinct query-URL pairs.

References

[1]
Aggarwal, G., Feldman, J., Muthukrishnan, S., and Pál, M. 2008. Sponsored search auctions with markovian users. In Proceedings of the 4th International Workshop on Internet and Network Economics (WINE’08). Springer-Verlag, Berlin, 621--628.
[2]
Agichtein, E., Brill, E., and Dumais, S. 2006a. Improving Web search ranking by incorporating user behavior information. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06). ACM, New York, 19--26.
[3]
Agichtein, E., Brill, E., Dumais, S., and Ragno, R. 2006b. Learning user interaction models for predicting Web search result preferences. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06). ACM, New York, 3--10.
[4]
Babcock, B., Babu, S., Datar, M., Motwani, R., and Widom, J. 2002. Models and issues in data stream systems. In Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’02). ACM, New York, 1--16.
[5]
Baeza-Yates, R. 2005. Applications of Web query mining. In Advances in Information Retrieval, 7--22.
[6]
Baeza-Yates, R., Hurtado, C., and Mendoza, M. 2004. Current trends in database technology - edbt 2004 workshop. In Proceedings of the EDBT Workshops on Clustering Information Over the Web. 588--596.
[7]
Bilenko, M. and White, R. W. 2008. Mining the search trails of surfing crowds: Identifying relevant Web sites from user activity. In Proceedings of the 17th International Conference on World Wide Web (WWW’08). ACM, New York, 51--60.
[8]
Bishop, C. M. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin.
[9]
Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning (ICML’05). ACM, New York, 89--96.
[10]
Chapelle, O., Metlzer, D., Zhang, Y., and Grinspan, P. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). ACM, New York, 621--630.
[11]
Chapelle, O. and Zhang, Y. 2009. A dynamic Bayesian network click model for Web search ranking. In Proceedings of the 18th International Conference on World Wide Web (WWW’09). ACM, New York, 1--10.
[12]
Cheng, H. and Cantu-Paz, E. 2010. Personalized click prediction in sponsored search. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM’10).
[13]
Craswell, N., Zoeter, O., Taylor, M., and Ramsey, B. 2008. An experimental comparison of click position-bias models. In Proceedings of the International Conference on Web Search and Web Data Mining (WSDM’08). ACM, New York, 87--94.
[14]
Dean, J. and Ghemawat, S. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation (OSDI’’04). USENIX Association, Berkeley, CA, 10--10.
[15]
Domingos, P. and Hulten, G. 2000. Mining high-speed data streams. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’00). ACM, New York, 71--80.
[16]
Dupret, G. E. and Piwowarski, B. 2008. A user browsing model to predict search engine click data from past observations. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’08). ACM, New York, 331--338.
[17]
Freund, Y., Iyer, R., Schapire, R. E., and Singer, Y. 2003. An efficient boosting algorithm for combining preferences. J. Mach. Learn. Resear. 4, 933--969.
[18]
Gaber, M. M., Zaslavsky, A., and Krishnaswamy, S. 2005. Mining data streams: A review. SIGMOD Rec. 34, 2, 18--26.
[19]
Gao, J., Yuan, W., Li, X., Deng, K., and Nie, J.-Y. 2009. Smoothing clickthrough data for Web search ranking. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). ACM, New York, 355--362.
[20]
Giannella, C., Han, J., Pei, J., Yan, X., and Yu, P. S. 2003. Mining frequent patterns in data streams at multiple time granularities. In Data Mining: Next Generation Challenges and Future Directions.
[21]
Guha, S., Meyerson, A., Mishra, N., and Motwani, R. 2003. Clustering data streams: Theory and practice. IEEE Trans. Knowl. Data Engin. 15, 3, 515--528.
[22]
Guo, F., Li, L., and Faloutsos, C. 2009a. Tailoring click models to user goals. In Proceedings of the Workshop on Web Search Click Data (WSCD’09). ACM, New York, 88--92.
[23]
Guo, F., Liu, C., Kannan, A., Minka, T., Taylor, M., Wang, Y.-M., and Faloutsos, C. 2009b. Click chain model in Web search. In Proceedings of the 18th International Conference on World Wide Web (WWW’09). ACM, New York, 11--20.
[24]
Guo, F., Liu, C., and Wang, Y.-M. 2009c. Efficient multiple-click models in Web search. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining (WSDM’09). ACM, New York, 124--131.
[25]
Joachims, T. 2002. Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’02). ACM, New York, 133--142.
[26]
Joachims, T., Granka, L., Pan, B., Hembrooke, H., and Gay, G. 2005. Accurately interpreting clickthrough data as implicit feedback. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’05). ACM, New York, 154--161.
[27]
Joachims, T., Granka, L., Pan, B., Hembrooke, H., Radlinski, F., and Gay, G. 2007. Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search. ACM Trans. Inform. Syst. 25, 2, 7.
[28]
Kempe, D. and Mahdian, M. 2008. A cascade model for externalities in sponsored search. In Proceedings of the 4th International Workshop on Internet and Network Economics (WINE’08). Springer-Verlag, Berlin, 585--596.
[29]
Liu, C., Li, M., and Wang, Y.-M. 2009. Post-rank reordering: Resolving preference misalignments between search engines and end users. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). ACM, New York, 641--650.
[30]
Liu, T.-Y. 2009. Learning to rank for information retrieval. Found. Trends Inform. Retr. 3, 3, 225--331.
[31]
Minka, T. 2001. Expectation propagation for approximate Bayesian inference. In Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence (UAI’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, 362--369.
[32]
Murphy, K. P. 2001. An introduction to graphical models. www.cs.iastate.edu/~honavar/bayes0.pdf.
[33]
Radlinski, F. and Joachims, T. 2005. Query chains: Learning to rank from implicit feedback. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD’05). ACM, New York, 239--248.
[34]
Richardson, M., Dominowska, E., and Ragno, R. 2007. Predicting clicks: Estimating the click-through rate for new ads. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). ACM, New York, 521--530.
[35]
Shokouhi, M., Scholer, F., and Turpin, A. 2008. Investigating the effectiveness of clickthrough data for document reordering. In Proceedings of the 30th European Conference on Information Retrieval (ECIR’08). 591--595.
[36]
Silverstein, C., Marais, H., Henzinger, M., and Moricz, M. 1999. Analysis of a very large Web search engine query log. SIGIR Forum 33, 1, 6--12.
[37]
Tsai, M.-F., Liu, T.-Y., Qin, T., Chen, H.-H., and Ma, W.-Y. 2007. FRank: A ranking method with fidelity loss. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 383--390.
[38]
Wainwright, M. J. and Jordan, M. I. 2008. Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn. 1, 1-2, 1--305.
[39]
Wang, K., Walker, T., and Zheng, Z. 2009. Pskip: Estimating relevance ranking quality from Web search clickthrough data. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09). ACM, New York, 1355--1364.
[40]
Wedig, S. and Madani, O. 2006. A large-scale analysis of query logs for assessing personalization opportunities. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06). ACM, New York, 742--747.
[41]
Xue, G.-R., Zeng, H.-J., Chen, Z., Yu, Y., Ma, W.-Y., Xi, W., and Fan, W. 2004. Optimizing Web search using Web click-through data. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM’04). ACM, New York, 118--126.
[42]
Yedidia, J. S., Freeman, W. T., and Weiss, Y. 2005. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. Inform. Theor. 51, 7, 2282--2312.
[43]
Zhang, Z. and Nasraoui, O. 2006. Mining search engine query logs for query recommendation. In Proceedings of the 15th International Conference on World Wide Web (WWW’06). ACM, New York, 1039--1040.
[44]
Zhu, Z. A., Chen, W., Minka, T., Zhu, C., and Chen, Z. 2010. A novel click model and its applications to online advertising. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM’10).

Cited By

View all

Index Terms

  1. Bayesian Browsing Model: Exact Inference of Document Relevance from Petabyte-Scale Data

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 4, Issue 4
    October 2010
    121 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/1857947
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 October 2010
    Accepted: 01 June 2010
    Received: 01 January 2010
    Published in TKDD Volume 4, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Bayesian models
    2. Web search
    3. click log analysis

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 09 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media