Abstract
Adoption of cloud computing by enterprises has exploded in the last decade and most of the applications used by enterprise users have moved to the cloud. These applications include collaboration software(e.g., Word, Excel), instant messaging (e.g., Chat), asynchronous communication (e.g., Email), etc. This has resulted in an exponential increase in the volume of data arising from the interactions of the users with the online applications (such as documents edited, people interacted with, meetings attended, etc.). Activities of a user provide strong insights about her such as meetings attended by the user indicate the set of people the user closely works with and documents edited indicate the topics the user works on, etc. Typically, this data is private and confidential for the enterprise, part of the enterprise, or the individual employee. To provide better experience and assist employees in their activities, it is critical to mine certain entities from this data. In this tutorial, we explain various entities which can be extracted from the enterprise data and assist the employees in their productivity. Specifically, we define and extract various enterprise entities such as tasks, commitments, calendar activity, acronyms, topics, definitions, etc. These entities are extracted using different techniques—tasks and commitments are extracted using intent mining techniques (e.g., sentiment extraction), definitions are extracted using sequence mining techniques, calendars are updated using the user’s flight/hotel booking entities, etc. The entity extraction from enterprise data poses interesting and complex challenge from scalable information extraction point of view: building information extraction models where there is little data to learn from due to privacy and access-control constraints but need highly accurate models to run on a large amount of diverse data from whole of the enterprise. Specifically, we need to overcome the following challenges:
Privacy: For legal and trust reasons, individual user’s data should be accessible only to the persons who it is intended to. Thus, we can’t directly apply the openly available techniques used to mine these entities which all require labeled data.
Efficiency: As enterprises need to process billions of emails, chats, and other documents every day—different for different users—extraction models need to be very efficient.
Scalability: There are a large number of variations in the way information is presented in the enterprise documents. For example, a flight itinerary is represented in different ways by different providers. Definition of the same topic can be expressed differently in different documents. We should be able to extract entities irrespective of the way it is presented in the documents.
Multi-lingual: Users are located across geographies, and hence, the information extraction needs to be done across multiple languages.
To extract these entities, one needs supervised data. How to get labeled data in a privacy preserving manner? How do we build models with the minimum amount of supervised data? We have a large amount of unsupervised data. We present techniques to learn from large, unsupervised data along with small, supervised data. In various techniques user-feedback (e.g., clicks) are used to refine the information extraction models. Feedback is difficult to come by in the enterprise settings. Can we use weak supervision? Can we take an off-the-shelf model (say, for definition classification) and refine it for enterprise settings? We will be covering all these techniques with improved precision and recall in the enterprise settings.