research-article

Pangloss: Fast Entity Linking in Noisy Text Environments

Authors:

Michael Conover,

Scott Blackburn,

Pete Skomoroch,

Sam ShahAuthors Info & Claims

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 168 - 176

https://doi.org/10.1145/3219819.3219899

Published: 19 July 2018 Publication History

Abstract

Entity linking is the task of mapping potentially ambiguous terms in text to their constituent entities in a knowledge base like Wikipedia. This is useful for organizing content, extracting structured data from textual documents, and in machine learning relevance applications like semantic search, knowledge graph construction, and question answering. Traditionally, this work has focused on text that has been well-formed, like news articles, but in common real world datasets such as messaging, resumes, or short-form social media, non-grammatical, loosely-structured text adds a new dimension to this problem. This paper presents Pangloss, a production system for entity disambiguation on noisy text. Pangloss combines a probabilistic linear-time key phrase identification algorithm with a semantic similarity engine based on context-dependent document embeddings to achieve better than state-of-the-art results (>5% in F1) compared to other research or commercially available systems. In addition, Pangloss leverages a local embedded database with a tiered architecture to house its statistics and metadata, which allows rapid disambiguation in streaming contexts and on-device disambiguation in low-memory environments such as mobile phones.

References

[1]

J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705, 2016.

[2]

S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data. In The Semantic Web, pages 722--735. Springer, 2007.

Digital Library

[3]

T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific American, 284 (5): 34--43, 2001.

[4]

R. Blanco, G. Ottaviano, and E. Meij. Fast and space-efficient entity linking for queries. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pages 179--188. ACM, 2015.

Digital Library

[5]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3 (Jan): 993--1022, 2003.

Digital Library

[6]

A. Bordes, J. Weston, R. Collobert, Y. Bengio, et al. Learning structured embeddings of knowledge bases. In AAAI, volume 6, page 6, 2011.

[7]

A. Bordes, X. Glorot, J. Weston, and Y. Bengio. Joint learning of words and meaning representations for open-text semantic parsing. In Artificial Intelligence and Statistics, pages 127--135, 2012.

[8]

C. Cherry and H. Guo. The unreasonable effectiveness of word representations for Twitter named entity recognition. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 735--745, 2015.

[9]

K. Clark and C. D. Manning. Improving coreference resolution by learning entity-level distributed representations. arXiv preprint arXiv:1606.01323, 2016.

[10]

S. Cucerzan. Large-scale named entity disambiguation based on wikipedia data. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2007.

[11]

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American society for information science, 41 (6): 391, 1990.

[12]

S. Dong, M. Callaghan, L. Galanis, D. Borthakur, T. Savor, and M. Strum. Optimizing space amplification in RocksDB. In CIDR, 2017.

[13]

M. Dredze, P. McNamee, D. Rao, A. Gerber, and T. Finin. Entity disambiguation for knowledge base population. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 277--285. Association for Computational Linguistics, 2010.

Digital Library

[14]

Y. Fang and M.-W. Chang. Entity linking on microblogs with spatial and temporal signals. Transactions of the Association for Computational Linguistics, 2: 259--272, 2014.

[15]

S. Guo, M.-W. Chang, and E. Kiciman. To link or not to link? A study on end-to-end tweet entity linking. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1020--1030, 2013 a.

[16]

Y. Guo, B. Qin, T. Liu, and S. Li. Microblog entity linking by leveraging extra posts. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 863--868, 2013 b.

[17]

B. Hachey, W. Radford, J. Nothman, M. Honnibal, and J. R. Curran. Evaluating entity linking with wikipedia. Artificial Intelligence, 194: 130--150, 2013.

Digital Library

[18]

F. Hasibi, K. Balog, and S. E. Bratsberg. On the reproducibility of the TAGME entity linking system. In Proceedings of 38th European Conference on Information Retrieval, ECIR '16, pages 436--449. Springer, 2016.

[19]

J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 782--792. Association for Computational Linguistics, 2011.

Digital Library

[20]

S. Kulkarni, A. Singh, G. Ramakrishnan, and S. Chakrabarti. Collective annotation of Wikipedia entities in web text. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 457--466. ACM, 2009.

Digital Library

[21]

A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, and R. Socher. Ask me anything: Dynamic memory networks for natural language processing. In International Conference on Machine Learning, pages 1378--1387, 2016.

Digital Library

[22]

J. Liu, J. Shang, C. Wang, X. Ren, and J. Han. Mining quality phrases from massive text corpora. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1729--1744. ACM, 2015.

Digital Library

[23]

C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55--60, 2014.

[24]

L. McInnes, J. Healy, and S. Astels. HDBSCAN: Hierarchical density based clustering. The Journal of Open Source Software, 2 (11): 205, 2017.

[25]

R. Mihalcea and A. Csomai. Wikify!: Linking documents to encyclopedic knowledge. In Proceedings of the 16th ACM Conference on Information and Knowledge Management, pages 233--242. ACM, 2007.

Digital Library

[26]

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111--3119, 2013.

Digital Library

[27]

D. Milne and I. H. Witten. Learning to link with Wikipedia. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, pages 509--518. ACM, 2008.

Digital Library

[28]

A. Moro, A. Raganato, and R. Navigli. Entity linking meets word sense disambiguation: A unified approach. Transactions of the Association for Computational Linguistics, 2: 231--244, 2014.

[29]

D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30 (1): 3--26, 2007.

[30]

R. Navigli. Word sense disambiguation: A survey. ACM Computing Surveys (CSUR), 41 (2): 10, 2009.

Digital Library

[31]

D. M. Powers. Applications and explanations of Zipf's law. In Proceedings of the joint conferences on new methods in language processing and computational natural language learning, pages 151--160. Association for Computational Linguistics, 1998.

Digital Library

[32]

A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimental study. In Proceedings of the conference on empirical methods in natural language processing, pages 1524--1534. Association for Computational Linguistics, 2011.

Digital Library

[33]

W. Shen, J. Wang, P. Luo, and M. Wang. Linking named entities in tweets with knowledge base via user interest modeling. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 68--76. ACM, 2013.

Digital Library

[34]

W. Shen, J. Wang, and J. Han. Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE Transactions on Knowledge and Data Engineering, 27 (2): 443--460, 2015.

[35]

S. Singh, A. Subramanya, F. Pereira, and A. McCallum. Wikilinks: A large-scale cross-document coreference corpus labeled via links to wikipedia. University of Massachusetts, Amherst, Tech. Rep. UM-CS-2012, 15, 2012.

[36]

A. Singhal. Introducing the knowledge graph: things, not strings. Official google blog, 2012.

[37]

R. Socher, D. Chen, C. D. Manning, and A. Ng. Reasoning with neural tensor networks for knowledge base completion. In Advances in Neural Information Processing Systems, pages 926--934, 2013.

Digital Library

[38]

W. M. Soon, H. T. Ng, and D. C. Y. Lim. A machine learning approach to coreference resolution of noun phrases. Computational Linguistics, 27 (4): 521--544, 2001.

[39]

F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web, pages 697--706. ACM, 2007.

Digital Library

[40]

Z. Wang, J. Zhang, J. Feng, and Z. Chen. Knowledge graph and text jointly embedding. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1591--1601, 2014.

[41]

I. Yamada, H. Takeda, and Y. Takefuji. Enhancing named entity recognition in Twitter messages using entity linking. In Proceedings of the Workshop on Noisy User-generated Text, pages 136--140, 2015.

[42]

I. Yamada, H. Shindo, H. Takeda, and Y. Takefuji. Joint learning of the embedding of words and entities for named entity disambiguation. arXiv preprint arXiv:1601.01343, 2016.

[43]

Z. Zheng, F. Li, M. Huang, and X. Zhu. Learning to link entities with knowledge base. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 483--491. Association for Computational Linguistics, 2010.

Digital Library

[44]

S. Zwicklbauer, C. Seifert, and M. Granitzer. Robust and collective entity disambiguation through semantic embeddings. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 425--434. ACM, 2016.

Digital Library

Cited By

Al-Moslmi TGallofre Ocana ML. Opdahl AVeres C(2020)Named Entity Extraction for Knowledge Graphs: A Literature OverviewIEEE Access10.1109/ACCESS.2020.29739288(32862-32881)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.2973928

Index Terms

Pangloss: Fast Entity Linking in Noisy Text Environments
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Content analysis and feature selection

Recommendations

DAWT: Densely Annotated Wikipedia Texts Across Multiple Languages
WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web Companion

In this work, we open up the DAWT dataset - Densely Annotated Wikipedia Texts across multiple languages. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of the entity. The ...
A graph-based approach for ontology population with named entities
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Automatically populating ontology with named entities extracted from the unstructured text has become a key issue for Semantic Web and knowledge management techniques. This issue naturally consists of two subtasks: (1) for the entity mention whose ...
WebSAIL wikifier at ERD 2014
ERD '14: Proceedings of the first international workshop on Entity recognition & disambiguation

In this paper, we report on our participation in Entity Recognition and Disambiguation Challenge 2014. We present WebSAIL Wikifier, an entity recognition and disambiguation system that identifies and links textual mentions to their referent entities in ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

July 2018

2925 pages

ISBN:9781450355520

DOI:10.1145/3219819

General Chairs:
Yike Guo
Imperial College London
,
Faisal Farooq
IBM

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '18

Sponsor:

KDD '18: The 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 19 - 23, 2018

London, United Kingdom

Acceptance Rates

KDD '18 Paper Acceptance Rate 107 of 983 submissions, 11%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
635
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Al-Moslmi TGallofre Ocana ML. Opdahl AVeres C(2020)Named Entity Extraction for Knowledge Graphs: A Literature OverviewIEEE Access10.1109/ACCESS.2020.29739288(32862-32881)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.2973928

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents