skip to main content
10.1145/3564121.3564818acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaimlsystemsConference Proceedingsconference-collections
research-article

Large-Scale Entity Extraction from Enterprise Data

Published: 16 May 2023 Publication History

Abstract

Adoption of cloud computing by enterprises has exploded in the last decade and most of the applications used by enterprise users have moved to the cloud. These applications include collaboration software(e.g., Word, Excel), instant messaging (e.g., Chat), asynchronous communication (e.g., Email), etc. This has resulted in an exponential increase in the volume of data arising from the interactions of the users with the online applications (such as documents edited, people interacted with, meetings attended, etc.). Activities of a user provide strong insights about her such as meetings attended by the user indicate the set of people the user closely works with and documents edited indicate the topics the user works on, etc. Typically, this data is private and confidential for the enterprise, part of the enterprise, or the individual employee. To provide better experience and assist employees in their activities, it is critical to mine certain entities from this data. In this tutorial, we explain various entities which can be extracted from the enterprise data and assist the employees in their productivity. Specifically, we define and extract various enterprise entities such as tasks, commitments, calendar activity, acronyms, topics, definitions, etc. These entities are extracted using different techniques—tasks and commitments are extracted using intent mining techniques (e.g., sentiment extraction), definitions are extracted using sequence mining techniques, calendars are updated using the user’s flight/hotel booking entities, etc. The entity extraction from enterprise data poses interesting and complex challenge from scalable information extraction point of view: building information extraction models where there is little data to learn from due to privacy and access-control constraints but need highly accurate models to run on a large amount of diverse data from whole of the enterprise. Specifically, we need to overcome the following challenges:
Privacy: For legal and trust reasons, individual user’s data should be accessible only to the persons who it is intended to. Thus, we can’t directly apply the openly available techniques used to mine these entities which all require labeled data.
Efficiency: As enterprises need to process billions of emails, chats, and other documents every day—different for different users—extraction models need to be very efficient.
Scalability: There are a large number of variations in the way information is presented in the enterprise documents. For example, a flight itinerary is represented in different ways by different providers. Definition of the same topic can be expressed differently in different documents. We should be able to extract entities irrespective of the way it is presented in the documents.
Multi-lingual: Users are located across geographies, and hence, the information extraction needs to be done across multiple languages.
To extract these entities, one needs supervised data. How to get labeled data in a privacy preserving manner? How do we build models with the minimum amount of supervised data? We have a large amount of unsupervised data. We present techniques to learn from large, unsupervised data along with small, supervised data. In various techniques user-feedback (e.g., clicks) are used to refine the information extraction models. Feedback is difficult to come by in the enterprise settings. Can we use weak supervision? Can we take an off-the-shelf model (say, for definition classification) and refine it for enterprise settings? We will be covering all these techniques with improved precision and recall in the enterprise settings.

References

[1]
[1]J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, A. Crespo. Extracting Semi-structured Information from the Web. Technical Report. 1997-38, Stanford Info Lab. http://ilpubs.stanford.edu:8090/250/
[2]
[2]Chia-Hui Chang, M. Kayed, M.R. Girgis, and K.F. Shaalan. A Survey of Web Information Extraction Systems. IEEE Transactions on Knowledge and Data Engineering. Volume 18, Issue 10, Pages 1411-1428, Oct. 2006.
[3]
[3]Arun Iyer, Manohar Jonnalagedda, Suresh Parthasarathy, Arjun Radhakrishna, and Sriram K. Rajamani. Synthesis and machine learning for heterogeneous extraction. 40th ACM Conference on Programming Language Design and Implementation (PLDI), 2019.
[4]
[4]Rajeev Gupta, Ranganath Kondapally, Siddharth Guha. Large-Scale Information Extraction from Emails with Data Constraints. BDA 2019
[5]
[5]Weinan Zhang, Amr Ahmed, Jie Yang, et. al. Annotating Needles in the Haystack without Looking: Product Information Extraction from Emails. KDD 2015.
[6]
[6]Junyi Chai, Yujie He, Homa Hashemi, Bing Li, Daraksha Parveen, Ranganath Kondapally, and Wenjin Xu. Automatic Construction of Enterprise Knowledge Base. EMNLP 2021.
[7]
[7]Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, Jamie Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. SIGMOD 2008.
[8]
[8]John Winn,John Guiver,Sam Webster,Yordan Zaykov,Martin Kukla,Dany Fabian. Alexandria: Unsupervised High-Precision Knowledge Base Construction using a Probabilistic Program. Automated Knowledge Base Construction (AKBC), 2019.
[9]
[9]Kai Shu, Subhabrata Mukherjee, Guoqing Zheng, Ahmed Hassan Awadallah, Milad Shokouhi, Susan Dumais. Learning with Weak Supervision for Email Intent Detection. SIGIR 2020.
[10]
[10]Ryen W. White, Ahmed Hassan Awadallah, Robert Sim. Task Completion Detection: A Study in the Context of Intelligent Systems. SIGIR 2019.
[11]
[11]Burr Settles. Active learning literature survey. University of Wisconsin-Madison Department of Computer Sciences, 2009.
[12]
[12]Zhu Xiaojin and Zoubin Ghahramani. Learning from Labeled and Unlabeled data with Label Propagation. Citeceer 2002.
[13]
[13]James B. Wendt, Michael Bendersky, Lluis Garcia-Pueyo, Vanja Josifovski, et. al. Hierarchical Label Propagation and Discovery for Machine Generated Email. WSDM 2016.
[14]
[14]Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. Snorkel: Rapid training data creation with weak supervision. VLDB 2017.
[15]
[15]Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling Knowledge in a Neural Network. Deep Learning Workshop, NIPS 2014.
[16]
[16]Ying Sheng, Sandeep Tata, James B. Wendt, Jing Xie, Qi Zhao, and Marc Najork. Anatomy of a Privacy-Safe Large-Scale Information Extraction System Over Email. KDD 2018.
[17]
[17]Michael Whittaker, Nick Edmonds, Sandeep Tata, James B. Wendt, and Marc Najork. Online Template Induction for Machine-Generated Emails. VLDB 2019.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
AIMLSystems '22: Proceedings of the Second International Conference on AI-ML Systems
October 2022
209 pages
ISBN:9781450398473
DOI:10.1145/3564121
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 May 2023

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

AIMLSystems 2022

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 20
    Total Downloads
  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)2
Reflects downloads up to 24 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media