skip to main content
research-article

Improving graph-walk-based similarity with reranking: Case studies for personal information management

Published: 27 December 2010 Publication History

Abstract

Relational or semistructured data is naturally represented by a graph, where nodes denote entities and directed typed edges represent the relations between them. Such graphs are heterogeneous, describing different types of objects and links. We represent personal information as a graph that includes messages, terms, persons, dates, and other object types, and relations like sent-to and has-term. Given the graph, we apply finite random graph walks to induce a measure of entity similarity, which can be viewed as a tool for performing search in the graph. Experiments conducted using personal email collections derived from the Enron corpus and other corpora show how the different tasks of alias finding, threading, and person name disambiguation can be all addressed as search queries in this framework, where the graph-walk-based similarity metric is preferable to alternative approaches, and further improvements are achieved with learning. While researchers have suggested to tune edge weight parameters to optimize the graph walk performance per task, we apply reranking to improve the graph walk results, using features that describe high-level information such as the paths traversed in the walk. High performance, together with practical runtimes, suggest that the described framework is a useful search system in the PIM domain, as well as in other semistructured domains.

References

[1]
Aery, M. and Chakravarthy, S. 2005. emailsift: Email classification based on structure and content. In Proceedings of the IEEE International Conference on Data Mining (ICDM).
[2]
Agarwal, A., Chakrabarti, S., and Aggarwal, S. 2006. Learning to rank network edentities. In Proceedings of the International SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).
[3]
Anyanwu, K., Maduko, A., and Sheth, A. 2005. Semrank:Ranking complex relationship search results on the semantic web. In Proceedings of the International World Wide Web Conference (WWW).
[4]
Balmin, A., Hristidis, V., and Papakonstantinou, Y. 2004. ObjectRank: Authority-Based keyword search in databases. In Proceedings of the International Conference on Very Large Databases (VLDB).
[5]
Balog, K., Azzopardi, L., and de Rijke, M. 2006. Formal models for expert finding in enterprise corpora. In Proceedings of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval.
[6]
Bekkerman, R., El-Yaniv, R., and McCallum, A. 2005. Multi-Way distributional clustering via pairwise interactions. In Proceedings of the International Conference on Machine Learning (ICML).
[7]
Bekkerman, R., McCallum, A., and Huang, G. 2004. Automatic categorization of email into folders: Benchmark experiments on Enron and SRI corpora. Tech. rep. IR-418, Computer Science Department, University of Massachusetts.
[8]
Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., and Sudarshan, S. 2002. Keyword searching and browsing in databases using banks. In Proceedings of the International Conference on Data Engineering (ICDE).
[9]
Brin, S. and Page, L. 1998. The anatomyof a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30.
[10]
Bunescu, R. C. and Mooney, R. J. 2005. A shortest path dependency kernel for relation extraction. In Proceedings of the Annual Meeting of the Association for Computational Linguistics on Human Language Technologies and Conference on Empirical Methods in Natural Language Processing (HLT-EMNLP).
[11]
Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullende, G. 2005. Learning to rank using gradient descent. In Proceedings of the International Conference on Machine Learning (ICML).
[12]
Carvalho, V. R. and Cohen, W. W. 2005. On the collective classification of email “speech acts”. In Proceedings of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval.
[13]
Carvalho, V. R. and Cohen, W. W. 2007. Preventing information leaks in email. In Proceedings of the SIAM International Conference on Data Mining (SDM).
[14]
Carvalho, V. R. and Cohen, W. W. 2008. Ranking users for intelligent message addressing. In Proceedings of the European Conference on IR Research (ECIR).
[15]
Chakrabarti, S. 2007. Dynamic personalized pagerank in entity relation graphs. In Proceedings of the International World Wide Web Conference (WWW).
[16]
Charniak, E. and Johnson, M. 2005. Coarse-to-Fine n-best parsing and max ent discriminative reranking. In Proceedings of the Meeting of the Association of Computational Linguistics (ACL).
[17]
Cohen, W. W. and Minkov, E. 2006. A graph-search framework for associating gene identifiers with documents. BMC Bioinf. 7, 440.
[18]
Cohen, W. W., Ravikumar, P., and Fienberg, S. 2003. A comparison of string distance metrics for name-matching tasks. In Proceedings of the IJCAI Workshop on Information Integration on the Web (IIWEB).
[19]
Cohen, W. W., Schapire, R. E., and Singer, Y. 1999. Learning to order things. J. Artif. Intell. Res. 10, 243--270.
[20]
Collins, M. 2002. Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In Proceedings of the Meeting of the Association of Computational Linguistics (ACL).
[21]
Collins, M. and Koo, T. 2005. Discriminative reranking for natural language parsing. Comput. Linguist. 31, 1, 25--69.
[22]
Collins-Thompson, K. and Callan, J. 2005. Query expansion using random walk models. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM).
[23]
Crestani, F. 1997. Application of spreading activation techniques in information retrieval. Artif. Intell. Rev. 11, 6.
[24]
Culotta, A. and Sorensen, J. 2004. Dependency tree kernels for relation extraction. In Proceedings of the Meeting of the Association of Computational Linguistics (ACL).
[25]
Diehl, C. P., Getoor, L., and Namata, G. 2006. Name reference resolution in organizational email archives. In Proceedings of the SIAM International Conference on Data Mining.
[26]
Diligenti, M., Gori, M., and Maggini, M. 2005. Learning webpage scores by error back-propagation. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI).
[27]
Elsayed, T., Oard, D. W., and Namata, G. 2008. Resolving personal names in email using context expansion. In Proceedings of the Annual Meeting of the Association for Computational Linguistics on Human Language Technologies and Meeting of the Association of Computational Linguistics (HLT-ACL).
[28]
Fogaras, D., Racz, B., Csalogany, K., and Sarlos, T. 2005. Towards scaling fully personalized pagerank: Algorithms, lower bounds, and experiments. Internet Math. 2, 3.
[29]
Freund, Y. and Schapire, R. E. 1999. Large margin classification using the perceptron algorithm. Mach. Learn. 37, 3.
[30]
Friedman, N., Getoor, L., Koller, D., and Pfeffer, A. 1999. Learning probabilistic relational models. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI).
[31]
Getoor, L. and Taskar, B. 2007. Statistical Relational Learning. MIT Press, Cambridge, MA.
[32]
Goldman, R., Shivakumar, N.,Venkatasubramanian, S., and Garcia-Molina, H. 1998. Proximity search in databases. In Proceedings of the International Conference on Very Large Databases (VLDB).
[33]
Guo, L., Shao, F., Botev, C., and Shanmugasundaram, J. 2003. Xrank: Ranked keyword search over xml documents. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[34]
Haveliwala, T. H. 2002. Topic-Sensitive PageRank. In Proceedings of the International World Wide Web Conference (WWW).
[35]
Holzer, R., Malin, B., and Sweeney, L. 2005. Email alias detection using social network analysis. In Proceedings of the LinkKDD Workshop at the ACM Conference on Knowledge Discovery and Data Mining.
[36]
Hsiung, P., Moore, A., Neill, D., and Schneider, J. 2005. Alias detection in link datasets. In Proceedings of the International Conference on Intelligence Analysis.
[37]
Jeh, G. and Widom, J. 2002. Simrank: A measure of structural-context similarity. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[38]
Jeh, G. and Widom, J. 2003. Scaling personalized web search. In Proceedings of the International World Wide Web Conference (WWW).
[39]
Klimt, B. and Yang, Y. 2004. The enron corpus: A new dataset for email classification research. In Proceedings of the European Conference on Machine Learning (ECML).
[40]
Kok, S., Singla, P., Richardson, M., and Domingos, P. 2005. The alchemy system for statistical relational ai. Tech. rep., Department of Computer Science and Engineering, University of Washington. http://www.cs.washington.edu/ai/alchemy.
[41]
Lehmann, E. 1959. Testing Statistical Hypotheses. Wiley.
[42]
Lewis, D. E. and Knowles, K. A. 1997. Threading electronic mail: A preliminary study. Inform. Process. Manag. 33, 2.
[43]
Malin, B., Airoldi, E. M., and Carley., K. M. 2005. A social network analysis model for name disambiguation in lists. J. Comput. Math.l Organiz. Theory 11, 2.
[44]
McCallum, A., Corrada-Emmanuel, A., and Wang, X. 2005. Topic and role discovery in social networks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI).
[45]
McInerney, J., Haines, K. G., Biafore, S., and Hecht-Nielsen, R. 1989. Back propagation error surfaces can have local minima. In Proceedings of the International Joint Conference on Neural Networks (IJCNN).
[46]
Mihalkova, L. and Mooney, R. J. 2009. Transfer learning fromm in imal target data by mapping across relational domains. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI).
[47]
Minkov, E. and Cohen, W. W. 2006. An email and meeting assistant using graph walks. In Proceedings of the Conference on Email and Anti-Spam (CEAS).
[48]
Minkov, E. andCohen, W. W. 2007. Learning to rank typed graph walks: Local and global approaches. In Proceedings of the WebKDD and SNA-KDD Joint Workshop.
[49]
Minkov, E., Cohen, W. W., and Ng, A. Y. 2006. Contextual search and name disambiguation in email using graphs. In Proceedings of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval.
[50]
Minkov, E., Wang, R., and Cohen, W. 2005. Extracting personal names from emails: Applying name dentity recognition to informal text. In Proceedings of the Annual Meeting of the Association for Computational Linguistics on Human Language Technologies and Conference on Empirical Methods in Natural Language Processing (HLT-EMNLP).
[51]
Neville, J. and Jensen, D. 2004. Dependency networks for relational data. In Proceedings of the IEEE International Conference on Data Mining (ICDM).
[52]
Nie, Z., Zhang, Y., Wen, J.-R., and Ma, W.-Y. 2005. Object-Level ranking: Bringing order to web objects. In Proceedings of the International World Wide Web Conference (WWW).
[53]
Nocedal, J. and Wright, S. J. 1999. Numerical Optimization. Springer Series in Operations Research.
[54]
Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The pagerank citation ranking: Bringing order to the web. Tech. rep., Computer Science Department, Stanford University.
[55]
Pal, C. and McCallum, A. 2006. Cc prediction with graphical models. In Proceedings of the Conference on Email and Anti-Spam (CEAS).
[56]
Pan, J.-Y., Yang, H.-J., Faloutsos, C., and Duygulu, P. 2004. Automatic multimedia crossmodal correlation discovery. In Proceedings of the International SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).
[57]
Petkova, D. and Croft, W. B. 2006. Hierarchical language models for expert finding in enterprise corpora. In Proceedings of the International Conference on Tools with Artificial Intelligence (ICTAI).
[58]
Richardson, M. and Domingos, P. 2002. The intelligent surfer: Probabilistic combination of link and content information in PageRank. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS).
[59]
Richardson, M. and Domingos, P. 2006. Markov logic networks. Mach. Learn. 62, 1-2.
[60]
Ripley, B. 1996. Pattern Recognition and Neural Networks. Cambridge University Press.
[61]
Schapire, R. E. and Singer, Y. 1999. Improved boosting algorithms using confidence-rated predictions. Mach. Learn. 37, 3, 297--336.
[62]
Shen, L. and Joshi, A. K. 2003. A nsvm based voting algorithm with application to parse reranking. In Proceedings of the Conference on Natural Language Learning (CONLL).
[63]
Shen, L. and Joshi, A. K. 2005. Ranking and reranking with perceptron. Mach. Learn. 60, 1--3.
[64]
Shen, L., Sarkar, A., and Och, F. J. 2005. Discriminative reranking for machine translation. In Proceedings of Conference on Human Language Technologies and North American Chapter of the Association for Computational Linguistics (HLT-NAACL).
[65]
Snow, R., Jurafsky, D., and Ng, A. Y. 2005. Learning syntactic patterns for automatic hypernym discovery. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS).
[66]
Toutanova, K., Haghighi, A., and Manning, C. D. 2005. Joint learning improves semantic role labeling. In Proceedings of the Meeting of the Association of Computational Linguistics (ACL).
[67]
Toutanova, K., Manning, C. D., and Ng, A. Y. 2004. Learning random walk models for inducing word dependency distributions. In Proceedings of the IEEE International Conference on Machine Learning (ICML).
[68]
Tsoi, A. C., Morini, G., Scarselli, F., Hagenbuchner, M., and Maggini, M. 2003. Adaptive ranking of web pages. In Proceedings of the International World Wide Web Conference (WWW).
[69]
Wang, Z., Song, Y., and Zhang, C. 2009. Knowledge transfer on hybrid graph. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI).
[70]
Xi, W., Fox, E. A., Fan, W. P., Zhang, B., Chen, Z., Yan, J., and Zhuang, D. 2005. Simfusion: Measuring similarity using unified relationship matrix. In Proceedings of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval.
[71]
Yeh, J.-Y. and Harnly, A. 2006. Email thread reassembly using similarity matching. In Proceedings of the Conference on Email and Anti-Spam (CEAS).

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 29, Issue 1
December 2010
232 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/1877766
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 December 2010
Accepted: 01 July 2010
Revised: 01 July 2010
Received: 01 September 2009
Published in TOIS Volume 29, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Graph walk
  2. PIM
  3. learning
  4. semistructured data

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media