skip to main content
10.1145/3219819.3219899acmotherconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Pangloss: Fast Entity Linking in Noisy Text Environments

Published: 19 July 2018 Publication History

Abstract

Entity linking is the task of mapping potentially ambiguous terms in text to their constituent entities in a knowledge base like Wikipedia. This is useful for organizing content, extracting structured data from textual documents, and in machine learning relevance applications like semantic search, knowledge graph construction, and question answering. Traditionally, this work has focused on text that has been well-formed, like news articles, but in common real world datasets such as messaging, resumes, or short-form social media, non-grammatical, loosely-structured text adds a new dimension to this problem. This paper presents Pangloss, a production system for entity disambiguation on noisy text. Pangloss combines a probabilistic linear-time key phrase identification algorithm with a semantic similarity engine based on context-dependent document embeddings to achieve better than state-of-the-art results (>5% in F1) compared to other research or commercially available systems. In addition, Pangloss leverages a local embedded database with a tiered architecture to house its statistics and metadata, which allows rapid disambiguation in streaming contexts and on-device disambiguation in low-memory environments such as mobile phones.

References

[1]
J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705, 2016.
[2]
S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data. In The Semantic Web, pages 722--735. Springer, 2007.
[3]
T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific American, 284 (5): 34--43, 2001.
[4]
R. Blanco, G. Ottaviano, and E. Meij. Fast and space-efficient entity linking for queries. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pages 179--188. ACM, 2015.
[5]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3 (Jan): 993--1022, 2003.
[6]
A. Bordes, J. Weston, R. Collobert, Y. Bengio, et al. Learning structured embeddings of knowledge bases. In AAAI, volume 6, page 6, 2011.
[7]
A. Bordes, X. Glorot, J. Weston, and Y. Bengio. Joint learning of words and meaning representations for open-text semantic parsing. In Artificial Intelligence and Statistics, pages 127--135, 2012.
[8]
C. Cherry and H. Guo. The unreasonable effectiveness of word representations for Twitter named entity recognition. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 735--745, 2015.
[9]
K. Clark and C. D. Manning. Improving coreference resolution by learning entity-level distributed representations. arXiv preprint arXiv:1606.01323, 2016.
[10]
S. Cucerzan. Large-scale named entity disambiguation based on wikipedia data. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2007.
[11]
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American society for information science, 41 (6): 391, 1990.
[12]
S. Dong, M. Callaghan, L. Galanis, D. Borthakur, T. Savor, and M. Strum. Optimizing space amplification in RocksDB. In CIDR, 2017.
[13]
M. Dredze, P. McNamee, D. Rao, A. Gerber, and T. Finin. Entity disambiguation for knowledge base population. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 277--285. Association for Computational Linguistics, 2010.
[14]
Y. Fang and M.-W. Chang. Entity linking on microblogs with spatial and temporal signals. Transactions of the Association for Computational Linguistics, 2: 259--272, 2014.
[15]
S. Guo, M.-W. Chang, and E. Kiciman. To link or not to link? A study on end-to-end tweet entity linking. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1020--1030, 2013 a.
[16]
Y. Guo, B. Qin, T. Liu, and S. Li. Microblog entity linking by leveraging extra posts. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 863--868, 2013 b.
[17]
B. Hachey, W. Radford, J. Nothman, M. Honnibal, and J. R. Curran. Evaluating entity linking with wikipedia. Artificial Intelligence, 194: 130--150, 2013.
[18]
F. Hasibi, K. Balog, and S. E. Bratsberg. On the reproducibility of the TAGME entity linking system. In Proceedings of 38th European Conference on Information Retrieval, ECIR '16, pages 436--449. Springer, 2016.
[19]
J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 782--792. Association for Computational Linguistics, 2011.
[20]
S. Kulkarni, A. Singh, G. Ramakrishnan, and S. Chakrabarti. Collective annotation of Wikipedia entities in web text. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 457--466. ACM, 2009.
[21]
A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, and R. Socher. Ask me anything: Dynamic memory networks for natural language processing. In International Conference on Machine Learning, pages 1378--1387, 2016.
[22]
J. Liu, J. Shang, C. Wang, X. Ren, and J. Han. Mining quality phrases from massive text corpora. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1729--1744. ACM, 2015.
[23]
C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55--60, 2014.
[24]
L. McInnes, J. Healy, and S. Astels. HDBSCAN: Hierarchical density based clustering. The Journal of Open Source Software, 2 (11): 205, 2017.
[25]
R. Mihalcea and A. Csomai. Wikify!: Linking documents to encyclopedic knowledge. In Proceedings of the 16th ACM Conference on Information and Knowledge Management, pages 233--242. ACM, 2007.
[26]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111--3119, 2013.
[27]
D. Milne and I. H. Witten. Learning to link with Wikipedia. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, pages 509--518. ACM, 2008.
[28]
A. Moro, A. Raganato, and R. Navigli. Entity linking meets word sense disambiguation: A unified approach. Transactions of the Association for Computational Linguistics, 2: 231--244, 2014.
[29]
D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30 (1): 3--26, 2007.
[30]
R. Navigli. Word sense disambiguation: A survey. ACM Computing Surveys (CSUR), 41 (2): 10, 2009.
[31]
D. M. Powers. Applications and explanations of Zipf's law. In Proceedings of the joint conferences on new methods in language processing and computational natural language learning, pages 151--160. Association for Computational Linguistics, 1998.
[32]
A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimental study. In Proceedings of the conference on empirical methods in natural language processing, pages 1524--1534. Association for Computational Linguistics, 2011.
[33]
W. Shen, J. Wang, P. Luo, and M. Wang. Linking named entities in tweets with knowledge base via user interest modeling. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 68--76. ACM, 2013.
[34]
W. Shen, J. Wang, and J. Han. Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE Transactions on Knowledge and Data Engineering, 27 (2): 443--460, 2015.
[35]
S. Singh, A. Subramanya, F. Pereira, and A. McCallum. Wikilinks: A large-scale cross-document coreference corpus labeled via links to wikipedia. University of Massachusetts, Amherst, Tech. Rep. UM-CS-2012, 15, 2012.
[36]
A. Singhal. Introducing the knowledge graph: things, not strings. Official google blog, 2012.
[37]
R. Socher, D. Chen, C. D. Manning, and A. Ng. Reasoning with neural tensor networks for knowledge base completion. In Advances in Neural Information Processing Systems, pages 926--934, 2013.
[38]
W. M. Soon, H. T. Ng, and D. C. Y. Lim. A machine learning approach to coreference resolution of noun phrases. Computational Linguistics, 27 (4): 521--544, 2001.
[39]
F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web, pages 697--706. ACM, 2007.
[40]
Z. Wang, J. Zhang, J. Feng, and Z. Chen. Knowledge graph and text jointly embedding. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1591--1601, 2014.
[41]
I. Yamada, H. Takeda, and Y. Takefuji. Enhancing named entity recognition in Twitter messages using entity linking. In Proceedings of the Workshop on Noisy User-generated Text, pages 136--140, 2015.
[42]
I. Yamada, H. Shindo, H. Takeda, and Y. Takefuji. Joint learning of the embedding of words and entities for named entity disambiguation. arXiv preprint arXiv:1601.01343, 2016.
[43]
Z. Zheng, F. Li, M. Huang, and X. Zhu. Learning to link entities with knowledge base. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 483--491. Association for Computational Linguistics, 2010.
[44]
S. Zwicklbauer, C. Seifert, and M. Granitzer. Robust and collective entity disambiguation through semantic embeddings. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 425--434. ACM, 2016.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2018
2925 pages
ISBN:9781450355520
DOI:10.1145/3219819
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. entity linking
  2. knowledge bases
  3. natural language understanding

Qualifiers

  • Research-article

Conference

KDD '18
Sponsor:

Acceptance Rates

KDD '18 Paper Acceptance Rate 107 of 983 submissions, 11%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media