skip to main content
research-article

Processing Long Queries Against Short Text: Top-k Advertisement Matching in News Stream Applications

Published: 12 May 2017 Publication History

Abstract

Many real applications in real-time news stream advertising call for efficient processing of long queries against short text. In such applications, dynamic news feeds are regarded as queries to match against an advertisement (ad) database for retrieving the k most relevant ads. The existing approaches to keyword retrieval cannot work well in this search scenario when queries are triggered at a very high frequency. To address the problem, we introduce new techniques to significantly improve search performance. First, we devise a two-level partitioning for tight upper bound estimation and a lazy evaluation scheme to delay full evaluation of unpromising candidates, which can bring three to four times performance boosting in a database with 7 million ads. Second, we propose a novel rank-aware block-oriented inverted index to further improve performance. In this index scheme, each entry in an inverted list is assigned a rank according to its importance in the ad. Then, we introduce a block-at-a-time search strategy based on the index scheme to support a much tighter upper bound estimation and a very early termination. We have conducted experiments with real datasets, and the results show that the rank-aware method can further improve performance by an order of magnitude.

References

[1]
Vo Ngoc Anh, Owen de Kretser, and Alistair Moffat. 2001. Vector-space ranking with effective early termination. In Proceedings of the 2001 SIGIR Conference (SIGIR’01). 35--42.
[2]
Vo Ngoc Anh and Alistair Moffat. 2005. Simplified similarity scoring using term ranks. In Proceedings of the 2005 SIGIR Conference (SIGIR’05). 226--233.
[3]
Vo Ngoc Anh and Alistair Moffat. 2006. Pruned query evaluation using pre-computed impacts. In Proceedings of the 2006 SIGIR Conference (SIGIR’06). 372--379.
[4]
Nima Asadi and Jimmy Lin. 2013. Fast candidate generation for real-time tweet search with Bloom filter chains. ACM Transactions on Information Systems 31, 3, Article No. 13.
[5]
H. Bast, Debapriyo Majumdar, Ralf Schenkel, Martin Theobald, and Gerhard Weikum. 2006. IO-Top-k: Index-access optimized top-k query processing. In Proceedings of the 2006 VLDB Conference (VLDB’06). 475--486.
[6]
Srikanta J. Bedathur, Klaus Berberich, Jens Dittrich, Nikos Mamoulis, and Gerhard Weikum. 2010. Interesting-phrase mining for ad-hoc text analytics. Proceedings of the VLDB Endowment 3, 1, 1348--1357.
[7]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993--1022.
[8]
Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Y. Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of the 2003 CIKM Conference (CIKM’03). 426--434.
[9]
C. Buckley and A. F. Lewit. 1985. Optimization of inverted vector searches. In Proceedings of the 1985 SIGIR Conference (SIGIR’85). 97--110.
[10]
Michael Busch, Krishna Gade, Brian Larson, Patrick Lok, Samuel Luckenbill, and Jimmy Lin. 2012. Earlybird: Real-time search at Twitter. In Proceedings of the 2012 ICDE Conference (ICDE’12). 1360--1369.
[11]
Kaushik Chakrabarti, Surajit Chaudhuri, and Venkatesh Ganti. 2011. Interval-based pruning for top-k processing over compressed lists. In Proceedings of the 2011 ICDE Conference (ICDE’11). 709--720.
[12]
Constantinos Dimopoulos, Sergey Nepomnyachiy, and Torsten Suel. 2013a. A candidate filtering mechanism for fast top-k query processing on modern CPUs. In Proceedings of the 2013 SIGIR Conference (SIGIR’13). 723--732.
[13]
Constantinos Dimopoulos, Sergey Nepomnyachiy, and Torsten Suel. 2013b. Optimizing top-k document retrieval strategies for block-max indexes. In Proceedings of the 2013 WSDM Conference (WSDM’13). 113--122.
[14]
Shuai Ding and Torsten Suel. 2011. Faster top-k document retrieval using block-max indexes. In Proceedings of the 2011 SIGIR Conference (SIGIR’11). 993--1002.
[15]
Ronald Fagin, Amnon Lotem, and Moni Naor. 2003. Optimal aggregation algorithms for middleware. Journal of Computer and System Sciences 66, 4, 614--656.
[16]
Ju Fan, Guoliang Li, and Lizhu Zhou. 2011. Interactive SQL query suggestion: Making databases user-friendly. In Proceedings of the 2011 ICDE Conference (ICDE’11). 351--362.
[17]
Ju Fan, Guoliang Li, Lizhu Zhou, Shanshan Chen, and Jun Hu. 2012. SEAL: Spatio-textual similarity search. Proceedings of the VLDB Endowment 5, 9, 824--835.
[18]
Marcus Fontoura, Vanja Josifovski, Jinhui Liu, Srihari Venkatesan, Xiangfei Zhu, and Jason Y. Zien. 2011. Evaluation strategies for top-k queries over memory-resident inverted indexes. Proceedings of the VLDB Endowment 4, 12, 1213--1224.
[19]
Zhangjie Fu, Kui Ren, Jiangang Shu, Xingming Sun, and Fengxiao Huang. 2016a. Enabling personalized search over encrypted outsourced data with efficiency improvement. IEEE Transactions on Parallel and Distributed Systems 27, 9, 2546--2559.
[20]
Zhangjie Fu, Xingming Sun, Sai Ji, and Guowu Xie. 2016b. Towards efficient content-aware search over encrypted outsourced data in cloud. In Proceedings of the 2016 INFOCOM Conference (INFOCOM’16). 1--9.
[21]
Zhangjie Fu, Xinle Wu, Chaowen Guan, Xingming Sun, and Kui Ren. 2016c. Toward efficient multi-keyword fuzzy search over encrypted outsourced data with accuracy improvement. IEEE Transactions on Information Forensics and Security 11, 12, 2706--2716.
[22]
Chuancong Gao and Sebastian Michel. 2012. Top-k interesting phrase mining in ad-hoc collections using sequence pattern indexing. In Proceedings of the 2012 EDBT Conference (IDBT’12). 264--275.
[23]
Aristides Gionis, Piotr Indyk, and Rajeev Motwani. 1999. Similarity search in high dimensions via hashing. In Proceedings of the 1999 VLDB Conference (VLDB’99). 518--529.
[24]
Ulrich Güntzer, Wolf-Tilo Balke, and Werner Kießling. 2000. Optimizing multi-feature queries for image databases. In Proceedings of the 2000 VLDB Conference (VLDB’00). 419--428.
[25]
Xin Huang, Hong Cheng, Rong-Hua Li, Lu Qin, and Jeffrey Xu Yu. 2015. Top-K structural diversity search in large networks. VLDB Journal 24, 3, 319--343.
[26]
Ihab F. Ilyas, George Beskales, and Mohamed A. Soliman. 2008. A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys 40, 4, Article No. 11.
[27]
Jure Leskovec, Lada A. Adamic, and Bernardo A. Huberman. 2007. The dynamics of viral marketing. ACM Transactions on the Web 1, 1, Article No. 5.
[28]
Cheng Li, Yue Lu, Qiaozhu Mei, Dong Wang, and Sandeep Pandey. 2015b. Click-through prediction for advertising in Twitter timeline. In Proceedings of the 2015 KDD Conference (KDD’15).
[29]
Jianqiang Li, Chunchen Liu, Bo Liu, Rui Mao, Yongcai Wang, Shi Chen, Ji-Jiang Yang, Hui Pan, and Qing Wang. 2015a. Diversity-aware retrieval of medical records. Computers in Industry 69, 81--91.
[30]
Rong-Hua Li, Jeffrey Xu Yu, Xin Huang, Hong Cheng, and Zechao Shang. 2014. Measuring the impact of MVC attack in large complex networks. Information Sciences 278, 685--702.
[31]
Yuchen Li, Dongxiang Zhang, Ziquan Lan, and Kian-Lee Tan. 2016. Context-aware advertisement recommendation for high-speed social news feeding. In Proceedings of the 2016 ICDE Conference (ICDE’16). 505--516.
[32]
Yuchen Li, Dongxiang Zhang, and Kian-Lee Tan. 2015. Real-time targeted influence maximization for online advertisements. Proceedings of the VLDB Endowment 8, 10, 1070--1081.
[33]
Xiaosheng Liu, Jia Zeng, Xi Yang, Jianfeng Yan, and Qiang Yang. 2015. Scalable parallel EM algorithms for latent Dirichlet allocation in multi-core systems. In Proceedings of the 2015 WWW Conference (WWW’15). 669--679.
[34]
Yuanhua Lv and Ariel Fuxman. 2015. In situ insights. In Proceedings of the 2015 SIGIR Conference (SIGIR’15). 655--664.
[35]
Yuanhua Lv, Taesup Moon, Pranam Kolari, Zhaohui Zheng, Xuanhui Wang, and Yi Chang. 2011. Learning to model relatedness for news recommendation. In Proceedings of the 2011 WWW Conference (WWW’11). 57--66.
[36]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY.
[37]
Matthew Michelson and Sofus A. Macskassy. 2010. Discovering users’ topics of interest on Twitter: A first look. In Proceedings of the 2010 Workshop on Analytics for Noisy Unstructured Text Data. 73--80.
[38]
Alistair Moffat and Justin Zobel. 1994. Fast ranking in limited space. In Proceedings of the 1994 ICDE Conference (ICDE’94). 428--437.
[39]
Marco Pennacchiotti and Siva Gurumurthy. 2011. Investigating topic models for social media user recommendation. In Proceedings of the 2011 WWW Conference (WWW’11). 101--102.
[40]
Michael Persin. 1994. Document filtering for fast ranking. In Proceedings of the 1994 SIGIR Conference (SIGIR’94). 339--348.
[41]
Daniel Ramage, Susan T. Dumais, and Daniel J. Liebling. 2010. Characterizing microblogs with topic models. In Proceedings of the 2010 ICWSM Conference (ICWSM’10).
[42]
Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. 1994. Okapi at TREC-3. In Proceedings of the 1994 TREC Conference (TREC’94). 109--126.
[43]
Cristian Rossi, Edleno Silva de Moura, Andre L. Carvalho, and Altigran Soares da Silva. 2013. Fast document-at-a-time query processing using two-tier indexes. In Proceedings of the 2013 SIGIR Conference (SIGIR’13). 183--192.
[44]
Trevor Strohman and W. Bruce Croft. 2007. Efficient document retrieval in main memory. In Proceedings of the 2007 SIGIR Conference (SIGIR’07). 175--182.
[45]
Howard R. Turtle and James Flood. 1995. Query evaluation: Strategies and optimizations. Information Processing and Management 31, 6, 831--850.
[46]
Xing Wei and W. Bruce Croft. 2006. LDA-based document models for ad-hoc retrieval. In Proceedings of the 2006 SIGIR Conference (SIGIR’06). 178--185.
[47]
Linkai Weng, Zhiwei Li, Rui Cai, Yaoxue Zhang, Yuezhi Zhou, Laurence Tianruo Yang, and Lei Zhang. 2011. Query by document via a decomposition-based two-level retrieval approach. In Proceedings of the 2011 SIGIR Conference (SIGIR’11). 505--514.
[48]
Zhihua Xia, Xinhui Wang, Xingming Sun, and Qian Wang. 2016. A secure and dynamic multi-keyword ranked search scheme over encrypted cloud data. IEEE Transactions on Parallel and Distributed Systems 27, 2, 340--352.
[49]
Yin Yang, Nilesh Bansal, Wisam Dakka, Panagiotis G. Ipeirotis, Nick Koudas, and Dimitris Papadias. 2009. Query by document. In Proceedings of the 2009 WSDM Conference (WSDM’09). 34--43.
[50]
Xing Yi and James Allan. 2009. A comparative study of utilizing topic models for information retrieval. In Proceedings of the 2009 ECIR Conference (ICIR’09). 29--41.
[51]
Dongxiang Zhang, Chee-Yong Chan, and Kian-Lee Tan. 2014a. An efficient publish/subscribe index for ecommerce databases. Proceedings of the VLDB Endowment 7, 8, 613--624.
[52]
Dongxiang Zhang, Chee-Yong Chan, and Kian-Lee Tan. 2014b. Processing spatial keyword query as a top-k aggregation query. In Proceedings of the 2014 SIGIR Conference (SIGIR’14). 355--364.
[53]
Dongxiang Zhang, Beng Chin Ooi, and Anthony K. H. Tung. 2010. Locating mapped resources in Web 2.0. In Proceedings of the 2010 ICDE Conference (ICDE’10). 521--532.

Cited By

View all

Index Terms

  1. Processing Long Queries Against Short Text: Top-k Advertisement Matching in News Stream Applications

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 35, Issue 3
    July 2017
    410 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/3026478
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 May 2017
    Accepted: 01 December 2016
    Revised: 01 November 2016
    Received: 01 April 2016
    Published in TOIS Volume 35, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Long queries
    2. inverted index
    3. rank-aware partitioning
    4. short text
    5. top-k retrieval

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • National Nature Science Foundation of China
    • CCF-Tencent Open Research Fund
    • Start-Up Research Grant of Renmin University of China
    • Fundamental Research Funds for the Central Universities
    • Priority Academic Program Development of Jiangsu Higher Education Institutions
    • Jiangsu Collaborative Innovation Center on Atmospheric Environment and Equipment Technology

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)18
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 03 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media