research-article

Improving graph-walk-based similarity with reranking: Case studies for personal information management

Authors:

William W. CohenAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 29, Issue 1

Article No.: 4, Pages 1 - 52

https://doi.org/10.1145/1877766.1877770

Published: 27 December 2010 Publication History

Abstract

Relational or semistructured data is naturally represented by a graph, where nodes denote entities and directed typed edges represent the relations between them. Such graphs are heterogeneous, describing different types of objects and links. We represent personal information as a graph that includes messages, terms, persons, dates, and other object types, and relations like sent-to and has-term. Given the graph, we apply finite random graph walks to induce a measure of entity similarity, which can be viewed as a tool for performing search in the graph. Experiments conducted using personal email collections derived from the Enron corpus and other corpora show how the different tasks of alias finding, threading, and person name disambiguation can be all addressed as search queries in this framework, where the graph-walk-based similarity metric is preferable to alternative approaches, and further improvements are achieved with learning. While researchers have suggested to tune edge weight parameters to optimize the graph walk performance per task, we apply reranking to improve the graph walk results, using features that describe high-level information such as the paths traversed in the walk. High performance, together with practical runtimes, suggest that the described framework is a useful search system in the PIM domain, as well as in other semistructured domains.

References

[1]

Aery, M. and Chakravarthy, S. 2005. emailsift: Email classification based on structure and content. In Proceedings of the IEEE International Conference on Data Mining (ICDM).

Digital Library

[2]

Agarwal, A., Chakrabarti, S., and Aggarwal, S. 2006. Learning to rank network edentities. In Proceedings of the International SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).

Digital Library

[3]

Anyanwu, K., Maduko, A., and Sheth, A. 2005. Semrank:Ranking complex relationship search results on the semantic web. In Proceedings of the International World Wide Web Conference (WWW).

Digital Library

[4]

Balmin, A., Hristidis, V., and Papakonstantinou, Y. 2004. ObjectRank: Authority-Based keyword search in databases. In Proceedings of the International Conference on Very Large Databases (VLDB).

Digital Library

[5]

Balog, K., Azzopardi, L., and de Rijke, M. 2006. Formal models for expert finding in enterprise corpora. In Proceedings of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval.

Digital Library

[6]

Bekkerman, R., El-Yaniv, R., and McCallum, A. 2005. Multi-Way distributional clustering via pairwise interactions. In Proceedings of the International Conference on Machine Learning (ICML).

Digital Library

[7]

Bekkerman, R., McCallum, A., and Huang, G. 2004. Automatic categorization of email into folders: Benchmark experiments on Enron and SRI corpora. Tech. rep. IR-418, Computer Science Department, University of Massachusetts.

[8]

Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., and Sudarshan, S. 2002. Keyword searching and browsing in databases using banks. In Proceedings of the International Conference on Data Engineering (ICDE).

Digital Library

[9]

Brin, S. and Page, L. 1998. The anatomyof a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30.

Digital Library

[10]

Bunescu, R. C. and Mooney, R. J. 2005. A shortest path dependency kernel for relation extraction. In Proceedings of the Annual Meeting of the Association for Computational Linguistics on Human Language Technologies and Conference on Empirical Methods in Natural Language Processing (HLT-EMNLP).

Digital Library

[11]

Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullende, G. 2005. Learning to rank using gradient descent. In Proceedings of the International Conference on Machine Learning (ICML).

Digital Library

[12]

Carvalho, V. R. and Cohen, W. W. 2005. On the collective classification of email “speech acts”. In Proceedings of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval.

Digital Library

[13]

Carvalho, V. R. and Cohen, W. W. 2007. Preventing information leaks in email. In Proceedings of the SIAM International Conference on Data Mining (SDM).

[14]

Carvalho, V. R. and Cohen, W. W. 2008. Ranking users for intelligent message addressing. In Proceedings of the European Conference on IR Research (ECIR).

Digital Library

[15]

Chakrabarti, S. 2007. Dynamic personalized pagerank in entity relation graphs. In Proceedings of the International World Wide Web Conference (WWW).

Digital Library

[16]

Charniak, E. and Johnson, M. 2005. Coarse-to-Fine n-best parsing and max ent discriminative reranking. In Proceedings of the Meeting of the Association of Computational Linguistics (ACL).

Digital Library

[17]

Cohen, W. W. and Minkov, E. 2006. A graph-search framework for associating gene identifiers with documents. BMC Bioinf. 7, 440.

[18]

Cohen, W. W., Ravikumar, P., and Fienberg, S. 2003. A comparison of string distance metrics for name-matching tasks. In Proceedings of the IJCAI Workshop on Information Integration on the Web (IIWEB).

[19]

Cohen, W. W., Schapire, R. E., and Singer, Y. 1999. Learning to order things. J. Artif. Intell. Res. 10, 243--270.

[20]

Collins, M. 2002. Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In Proceedings of the Meeting of the Association of Computational Linguistics (ACL).

Digital Library

[21]

Collins, M. and Koo, T. 2005. Discriminative reranking for natural language parsing. Comput. Linguist. 31, 1, 25--69.

Digital Library

[22]

Collins-Thompson, K. and Callan, J. 2005. Query expansion using random walk models. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM).

Digital Library

[23]

Crestani, F. 1997. Application of spreading activation techniques in information retrieval. Artif. Intell. Rev. 11, 6.

Digital Library

[24]

Culotta, A. and Sorensen, J. 2004. Dependency tree kernels for relation extraction. In Proceedings of the Meeting of the Association of Computational Linguistics (ACL).

Digital Library

[25]

Diehl, C. P., Getoor, L., and Namata, G. 2006. Name reference resolution in organizational email archives. In Proceedings of the SIAM International Conference on Data Mining.

[26]

Diligenti, M., Gori, M., and Maggini, M. 2005. Learning webpage scores by error back-propagation. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI).

Digital Library

[27]

Elsayed, T., Oard, D. W., and Namata, G. 2008. Resolving personal names in email using context expansion. In Proceedings of the Annual Meeting of the Association for Computational Linguistics on Human Language Technologies and Meeting of the Association of Computational Linguistics (HLT-ACL).

[28]

Fogaras, D., Racz, B., Csalogany, K., and Sarlos, T. 2005. Towards scaling fully personalized pagerank: Algorithms, lower bounds, and experiments. Internet Math. 2, 3.

[29]

Freund, Y. and Schapire, R. E. 1999. Large margin classification using the perceptron algorithm. Mach. Learn. 37, 3.

Digital Library

[30]

Friedman, N., Getoor, L., Koller, D., and Pfeffer, A. 1999. Learning probabilistic relational models. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI).

Digital Library

[31]

Getoor, L. and Taskar, B. 2007. Statistical Relational Learning. MIT Press, Cambridge, MA.

[32]

Goldman, R., Shivakumar, N.,Venkatasubramanian, S., and Garcia-Molina, H. 1998. Proximity search in databases. In Proceedings of the International Conference on Very Large Databases (VLDB).

Digital Library

[33]

Guo, L., Shao, F., Botev, C., and Shanmugasundaram, J. 2003. Xrank: Ranked keyword search over xml documents. In Proceedings of the ACM SIGMOD International Conference on Management of Data.

Digital Library

[34]

Haveliwala, T. H. 2002. Topic-Sensitive PageRank. In Proceedings of the International World Wide Web Conference (WWW).

Digital Library

[35]

Holzer, R., Malin, B., and Sweeney, L. 2005. Email alias detection using social network analysis. In Proceedings of the LinkKDD Workshop at the ACM Conference on Knowledge Discovery and Data Mining.

Digital Library

[36]

Hsiung, P., Moore, A., Neill, D., and Schneider, J. 2005. Alias detection in link datasets. In Proceedings of the International Conference on Intelligence Analysis.

[37]

Jeh, G. and Widom, J. 2002. Simrank: A measure of structural-context similarity. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Digital Library

[38]

Jeh, G. and Widom, J. 2003. Scaling personalized web search. In Proceedings of the International World Wide Web Conference (WWW).

Digital Library

[39]

Klimt, B. and Yang, Y. 2004. The enron corpus: A new dataset for email classification research. In Proceedings of the European Conference on Machine Learning (ECML).

[40]

Kok, S., Singla, P., Richardson, M., and Domingos, P. 2005. The alchemy system for statistical relational ai. Tech. rep., Department of Computer Science and Engineering, University of Washington. http://www.cs.washington.edu/ai/alchemy.

[41]

Lehmann, E. 1959. Testing Statistical Hypotheses. Wiley.

[42]

Lewis, D. E. and Knowles, K. A. 1997. Threading electronic mail: A preliminary study. Inform. Process. Manag. 33, 2.

Digital Library

[43]

Malin, B., Airoldi, E. M., and Carley., K. M. 2005. A social network analysis model for name disambiguation in lists. J. Comput. Math.l Organiz. Theory 11, 2.

Digital Library

[44]

McCallum, A., Corrada-Emmanuel, A., and Wang, X. 2005. Topic and role discovery in social networks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI).

Digital Library

[45]

McInerney, J., Haines, K. G., Biafore, S., and Hecht-Nielsen, R. 1989. Back propagation error surfaces can have local minima. In Proceedings of the International Joint Conference on Neural Networks (IJCNN).

[46]

Mihalkova, L. and Mooney, R. J. 2009. Transfer learning fromm in imal target data by mapping across relational domains. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI).

Digital Library

[47]

Minkov, E. and Cohen, W. W. 2006. An email and meeting assistant using graph walks. In Proceedings of the Conference on Email and Anti-Spam (CEAS).

[48]

Minkov, E. andCohen, W. W. 2007. Learning to rank typed graph walks: Local and global approaches. In Proceedings of the WebKDD and SNA-KDD Joint Workshop.

Digital Library

[49]

Minkov, E., Cohen, W. W., and Ng, A. Y. 2006. Contextual search and name disambiguation in email using graphs. In Proceedings of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval.

Digital Library

[50]

Minkov, E., Wang, R., and Cohen, W. 2005. Extracting personal names from emails: Applying name dentity recognition to informal text. In Proceedings of the Annual Meeting of the Association for Computational Linguistics on Human Language Technologies and Conference on Empirical Methods in Natural Language Processing (HLT-EMNLP).

Digital Library

[51]

Neville, J. and Jensen, D. 2004. Dependency networks for relational data. In Proceedings of the IEEE International Conference on Data Mining (ICDM).

Digital Library

[52]

Nie, Z., Zhang, Y., Wen, J.-R., and Ma, W.-Y. 2005. Object-Level ranking: Bringing order to web objects. In Proceedings of the International World Wide Web Conference (WWW).

Digital Library

[53]

Nocedal, J. and Wright, S. J. 1999. Numerical Optimization. Springer Series in Operations Research.

[54]

Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The pagerank citation ranking: Bringing order to the web. Tech. rep., Computer Science Department, Stanford University.

[55]

Pal, C. and McCallum, A. 2006. Cc prediction with graphical models. In Proceedings of the Conference on Email and Anti-Spam (CEAS).

[56]

Pan, J.-Y., Yang, H.-J., Faloutsos, C., and Duygulu, P. 2004. Automatic multimedia crossmodal correlation discovery. In Proceedings of the International SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).

Digital Library

[57]

Petkova, D. and Croft, W. B. 2006. Hierarchical language models for expert finding in enterprise corpora. In Proceedings of the International Conference on Tools with Artificial Intelligence (ICTAI).

Digital Library

[58]

Richardson, M. and Domingos, P. 2002. The intelligent surfer: Probabilistic combination of link and content information in PageRank. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS).

[59]

Richardson, M. and Domingos, P. 2006. Markov logic networks. Mach. Learn. 62, 1-2.

Digital Library

[60]

Ripley, B. 1996. Pattern Recognition and Neural Networks. Cambridge University Press.

Digital Library

[61]

Schapire, R. E. and Singer, Y. 1999. Improved boosting algorithms using confidence-rated predictions. Mach. Learn. 37, 3, 297--336.

Digital Library

[62]

Shen, L. and Joshi, A. K. 2003. A nsvm based voting algorithm with application to parse reranking. In Proceedings of the Conference on Natural Language Learning (CONLL).

Digital Library

[63]

Shen, L. and Joshi, A. K. 2005. Ranking and reranking with perceptron. Mach. Learn. 60, 1--3.

Digital Library

[64]

Shen, L., Sarkar, A., and Och, F. J. 2005. Discriminative reranking for machine translation. In Proceedings of Conference on Human Language Technologies and North American Chapter of the Association for Computational Linguistics (HLT-NAACL).

[65]

Snow, R., Jurafsky, D., and Ng, A. Y. 2005. Learning syntactic patterns for automatic hypernym discovery. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS).

[66]

Toutanova, K., Haghighi, A., and Manning, C. D. 2005. Joint learning improves semantic role labeling. In Proceedings of the Meeting of the Association of Computational Linguistics (ACL).

Digital Library

[67]

Toutanova, K., Manning, C. D., and Ng, A. Y. 2004. Learning random walk models for inducing word dependency distributions. In Proceedings of the IEEE International Conference on Machine Learning (ICML).

Digital Library

[68]

Tsoi, A. C., Morini, G., Scarselli, F., Hagenbuchner, M., and Maggini, M. 2003. Adaptive ranking of web pages. In Proceedings of the International World Wide Web Conference (WWW).

Digital Library

[69]

Wang, Z., Song, Y., and Zhang, C. 2009. Knowledge transfer on hybrid graph. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI).

Digital Library

[70]

Xi, W., Fox, E. A., Fan, W. P., Zhang, B., Chen, Z., Yan, J., and Zhuang, D. 2005. Simfusion: Measuring similarity using unified relationship matrix. In Proceedings of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval.

Digital Library

[71]

Yeh, J.-Y. and Harnly, A. 2006. Email thread reassembly using similarity matching. In Proceedings of the Conference on Email and Anti-Spam (CEAS).

Cited By

Devezas JNunes S(2021)A Review of Graph-Based Models for Entity-Oriented SearchSN Computer Science10.1007/s42979-021-00828-w2:6Online publication date: 30-Aug-2021
https://dl.acm.org/doi/10.1007/s42979-021-00828-w
Zhou XLiang XTang Z(2020)Scalable Triangle Discovery Algorithm for Large Scale-Free Network with Limited Internal MemoryIEEE Transactions on Big Data10.1109/TBDATA.2018.28891206:4(757-769)Online publication date: 1-Dec-2020
https://doi.org/10.1109/TBDATA.2018.2889120
Dangur IBekkerman RMinkov E(2020)Identification of topical subpopulations on social mediaInformation Sciences10.1016/j.ins.2020.04.005528(92-112)Online publication date: Aug-2020
https://doi.org/10.1016/j.ins.2020.04.005
Show More Cited By

Index Terms

Improving graph-walk-based similarity with reranking: Case studies for personal information management
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Information retrieval
    1. Information retrieval query processing
    2. Retrieval models and ranking

Recommendations

Adaptive graph walk based similarity measures in entity-relation graphs
Learning graph walk based similarity measures for parsed text
EMNLP '08: Proceedings of the Conference on Empirical Methods in Natural Language Processing

We consider a parsed text corpus as an instance of a labelled directed graph, where nodes represent words and weighted directed edges represent the syntactic relations between them. We show that graph walks, combined with existing techniques of ...
Strongly walk-regular graphs

We study a generalization of strongly regular graphs. We call a graph strongly walk-regular if there is an @?>1 such that the number of walks of length @? from a vertex to another vertex depends only on whether the two vertices are the same, adjacent, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 29, Issue 1

December 2010

232 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/1877766

Issue’s Table of Contents

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 December 2010

Accepted: 01 July 2010

Revised: 01 July 2010

Received: 01 September 2009

Published in TOIS Volume 29, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Defense Advanced Research Projects Agency

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
670
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Devezas JNunes S(2021)A Review of Graph-Based Models for Entity-Oriented SearchSN Computer Science10.1007/s42979-021-00828-w2:6Online publication date: 30-Aug-2021
https://dl.acm.org/doi/10.1007/s42979-021-00828-w
Zhou XLiang XTang Z(2020)Scalable Triangle Discovery Algorithm for Large Scale-Free Network with Limited Internal MemoryIEEE Transactions on Big Data10.1109/TBDATA.2018.28891206:4(757-769)Online publication date: 1-Dec-2020
https://doi.org/10.1109/TBDATA.2018.2889120
Dangur IBekkerman RMinkov E(2020)Identification of topical subpopulations on social mediaInformation Sciences10.1016/j.ins.2020.04.005528(92-112)Online publication date: Aug-2020
https://doi.org/10.1016/j.ins.2020.04.005
Amal STsai CBrusilovsky PKuflik TMinkov E(2019)Relational Social Recommendation: Application to the Academic DomainExpert Systems with Applications10.1016/j.eswa.2019.01.061Online publication date: Jan-2019
https://doi.org/10.1016/j.eswa.2019.01.061
Balog KBalog K(2018)Semantically Enriched Models for Entity RankingEntity-Oriented Search10.1007/978-3-319-93935-3_4(101-143)Online publication date: 3-Oct-2018
https://doi.org/10.1007/978-3-319-93935-3_4
Gao NDredze MOard D(2017)Person entity linking in email with NIL detectionJournal of the Association for Information Science and Technology10.5555/3204593.320460368:10(2412-2424)Online publication date: 1-Oct-2017
https://dl.acm.org/doi/10.5555/3204593.3204603
Ferdman SMinkov EBekkerman RGefen D(2017)Quantifying the web browser ecosystemPLOS ONE10.1371/journal.pone.017928112:6(e0179281)Online publication date: 23-Jun-2017
https://doi.org/10.1371/journal.pone.0179281
Amal SKuflik TMinkov EBielikova MHerder ECena FDesmarais M(2017)Harvesting Entity-relation Social Networks from the WebProceedings of the 25th Conference on User Modeling, Adaptation and Personalization10.1145/3079628.3079656(351-352)Online publication date: 9-Jul-2017
https://dl.acm.org/doi/10.1145/3079628.3079656
Wasserman Pritsker EKuflik TMinkov EPapadopoulos GKuflik TChen FDuarte CFu W(2017)Assessing the Contribution of Twitter's Textual Information to Graph-based RecommendationProceedings of the 22nd International Conference on Intelligent User Interfaces10.1145/3025171.3025218(511-516)Online publication date: 7-Mar-2017
https://dl.acm.org/doi/10.1145/3025171.3025218
Metzger SSchenkel RSydow M(2017)QBEESJournal of Intelligent Information Systems10.1007/s10844-017-0443-x49:3(333-366)Online publication date: 1-Dec-2017
https://dl.acm.org/doi/10.1007/s10844-017-0443-x
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents