Academia.eduAcademia.edu

PSYCHONET 2 - Contextualized and Enriched Psycholinguistic Commonsense Ontology

2011, Proceedings of the International Conference on Knowledge Engineering and Ontology Development

Kent Academic Repository Full text document (pdf) Citation for published version Mohtasseb Billah, Haytham and Ahmed, Amr and Altadmri, Amjad and Cobham, David (2011) PSYCHONET 2: contextualized and enriched psycholinguistic commonsense ontology. In: International Conference on Knowledge Engineering and Ontology Development (KEOD), 25 - 29 October 2011, Paris, France. DOI Link to record in KAR http://kar.kent.ac.uk/45945/ Document Version Publisher pdf Copyright & reuse Content in the Kent Academic Repository is made available for research purposes. Unless otherwise stated all content is protected by copyright and in the absence of an open licence (eg Creative Commons), permissions for further reuse of content should be sought from the publisher, author or other copyright holder. Versions of research The version in the Kent Academic Repository may differ from the final published version. Users are advised to check http://kar.kent.ac.uk for the status of the paper. Users should always cite the published version of record. Enquiries For any further enquiries regarding the licence status of this document, please contact: [email protected] If you believe this document infringes copyright then please contact the KAR admin team with the take-down information provided at http://kar.kent.ac.uk/contact.html PSYCHONET 2 Contextualized and Enriched Psycholinguistic Commonsense Ontology Haytham Mohtasseb, Amr Ahmed, Amjad AlTadmri and David Cobham School of Computer Science, University of Lincoln, Brayford Pool, Lincoln, U.K. {hmohtasseb, aahmed, atadmri, dcobham}@lincoln.ac.uk Keywords: Commonsense knowledge base, Semantic network, Ontology development, Psycholinguistic, Text classification. Abstract: PsychoNet 1 has demonstrated the feasibility of integrating psycholinguistic taxonomy, represented in LIWC, and its semantic textual representation in the form of commonsense ontology, represented in ConceptNet. However, various limitations exist in PsychoNet 1, including the lack of concluding context of the concept annotation. In this paper, we address most of those limitations and introduce a new enhanced and enriched version, PsychoNet 2. PsychoNet 2 utilizes WordNet, in addition to LIWC and ConceptNet, to produce an integrated contextualized psycholinguistic ontology. The first and the main contribution is that, in PsychoNet 2, each concept is annotated by the potential (most representative) contextual psycholinguistic categories, rather than all applicable categories. The second contribution is the enrichment of LIWC through utilizing WordNet. This in fact produced an enriched version of LIWC that may also be used independently in other applications. This has contributed to substantial enrichment of PsychoNet 2 as it facilitated including additional number of concepts that were not included in PsychoNet 1 due to lack of corresponding words in the original LIWC. A sample application of text classification, for a mood prediction task, is presented to demonstrate the introduced enhancements. The results confirm the improved performance of the new PsychoNet 2 against PsychoNet 1. 1 INTRODUCTION The ontology engineering community is increasingly convening to develop more work towards integrating ontologies so that they can share and reuse each other’s knowledge (Noy and Hafner, 1997). PsychoNet 1 (Mohtasseb and Ahmed, 2010b) introduced a novel commonsense knowledgebase that forms the link between the psycholinguistic and its semantic textual representation. It allows the researcher to use one coherent knowledgebase that has the power of semantic commonsense and psycholinguistic taxonomy. There are many types of tagging and integration (more details in Section 2), but this study presents the benefits of integrating LIWC, ConceptNet, and WordNet for a wide range of applications. This paper develops ConceptNet, a commonsense ontology (Liu and Singh, 2004), by adding a psycholinguistic layer, utilizing LIWC (Pennebaker et al., 2001), enriched by the lexical semantic network namely WordNet (Miller, 1995). Furthermore, in PsychoNet 2, only the common highly rated annotations are kept as they represent the context of the concept. The rest of the paper is organized as follows. Sec- tion 2 reviews the recent work related to our domain. Section 3 presents PsychoNet 2 including the enrichment and the contextualization processes. Section 4 shows the application of PsychoNet 2 in mood classification and its results. Finally, the paper is concluded in Section 5. 2 BACKGROUND This section presents an overview of the related work and the existing development in the same area including LIWC, ConceptNet, WordNet, and PsychoNet 1. Linguistic Inquiry Words Count (LIWC) (Pennebaker et al., 2001) has been built by classifying a nominated set of 2000 words (and word stems) into several dozens of psycho categories, based on the judgment of a group of linguistic experts. The categories include positive and negative emotional words , functional words (pronouns, articles, prepositions), health and biology categories, and other contextual categories (e.g., sport, family, religion, death). LIWC had been used successfully in numerous text analyses tasks for analyzing the emotions of users in blog 339 KEOD 2011 - International Conference on Knowledge Engineering and Ontology Development &RQFHSW1HW &RQFHSWV&! /,:& $QQRWDWLRQV$! $QQRWDWLRQ 6\QVHWV6! &RQWH[WXDOL]DWLRQ &$! &6! :RUG1HW &6$! (QULFKPHQW 3V\FKR1HW Figure 1: PsychoNet 2 Building Framework. text (Gill et al., 2008; Hancock et al., 2008; Hancock et al., 2007), identifying the gender of bloggers (Nowson and Oberlander, 2006), recognizing the personality (Gill, 2003; Mairesse et al., 2007), studying the demographic differentiations across the styles of bloggers (Mohtasseb and Ahmed, 2010a), and in authorship identification (Mohtasseb and Ahmed, 2009a; Mohtasseb and Ahmed, 2009b). However, all of these tasks have been applied on the word level rather than the concept level, which is available in PsychoNet 1. The ConceptNet knowledgebase is a semantic network encompasses the spatial, physical, social, temporal aspects of everyday life (Liu and Singh, 2004). ConceptNet is generated automatically from the 700,000 sentences of the Open Mind Common Sense corpus1 . ConceptNet is currently considered to be the largest commonsense semantic network containing over 250,000 nodes. Nodes are semistructured English fragments, interrelated by an ontology of twenty semantic relations (predicates). ConceptNet is very useful in describing real life scenes which makes it a good candidate to be integrated with LIWC that will add the psycholinguistic dimension. WordNet is a large lexical database of English (Miller, 1995). Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets). It is a very rich domain-independent knowledgebase of lexical units that consist of various forms of synonyms. WordNet is effective for studying the relationships within similar words in terms of meaning, generalization or specialization. On the other hand, PsychoNet 1 introduced the first development of ConceptNet towards psycholinguistic direction, utilizing LIWC. It has been built by a fully automated engine that performs lexical analysis on concepts and extracts the corresponding psycholinguistic categories. It allows the researcher to use one coherent knowledgebase that has the power of semantic commonsense and psycholinguistic taxon1 http://web.media.mit.edu/ 340 omy. Moreover, PsychoNet 1 simplified applying text classification tasks in ConceptNet and allows filtering the huge concept graphs based on a key category for a specific application. PsychoNet 2 introduces further improvement on PsychoNet 1 as being explained in the next section. push/OMCS-Research.html 3 PsychoNet 2 In PsychoNet 1 (Mohtasseb and Ahmed, 2010b), each node is a concept associated with a psychometric field that contains the psycholinguistic categories (annotations) and their relevance degree. In PsychoNet 2, many limitations have been addressed including missing concepts and contextualization, and more substantial improvements are introduced through the addition of two new stages as depicted in Figure 1. The first stage, Enrichment, utilizes WordNet to deal with those concepts, existing in ConceptNet, that do not have matching LIWC annotations. The resulting synonym sets, for the original component words, are then annotated using LIWC. This is explained in detail in Section 3.1. Section 3.2 presents the second stage, Contextualization, that starts by selecting the synonym sets that share the same set of annotations. Then, it deduces the high ranked annotations that potentially represent the context of the concept. The following subsections explain the two new stages; Enrichment and Contextualization, respectively. 3.1 Enrichment Through our analysis of PsychoNet 1, it has been found that there were 21498 concepts that have not been included. Moreover, the analysis showed that 31863 words, which belong to the commonsense concepts, do not have matching LIWC categories. To address this and try to annotate and include most concepts, we had to develop a way to enrich LIWC to include those missing words and their variations. Therefore, WordNet is utilized here to expand and enrich the contents of LIWC based on the commonsense words of ConceptNet, as explained below. Assume that W = {w1 , w2 , . . . , wn } is the set of commonsense words that do not have LIWC annotations. For each word wi ∈ W , all synsets (synonym sets) {S1 , S2 , . . . , Sm }, of this word, are extracted using WordNet. Hence, S j = {s1 , s2 , . . . , sl } represents one of the synsets where sk is a synonym for wi within the context of the synset S j . AS j = {a1 , a2 , . . . , az } is the list of LIWC annotation of S j if there were cross joint annotations across all sk . Then, the set of final LIWC PSYCHONET 2 - Contextualized and Enriched Psycholinguistic Commonsense Ontology Table 1: Snapshot of the result showing the added commonsense words to LIWC using WordNet. Word earth earth absorbing live live Synset world,globe ground engross,engage, occupy alive awake exist,survive, subsist alert,alive cereal food grain newspaper audience calculate gift gift crime paper hearing Annotation Relativity,Space Relativity,Space Affective, Pos.Emotion Biological, Health,Death Achievement Biological, Health Ingestion, Biological Work Perceptual, Social,Hearing Money Money account endow,empower, invest,endue endowment,talent Affective, Pos.Emotion law-breaking, Neg.Emotion, offense Affective,Anger annotations Ai of wi is produced by the union of the annotations of synsets Ai = {AS1 ∪ AS2 ∪ · · · ∪ ASv }. According to the approach described above, if a word wi has a non-empty annotation set Ai , then wi is added to the corresponding list of words of its relevant psycholinguistic categories. In addition, the annotation list Ai will contribute to the concept annotation where wi originated from (Section 3.2). Table 1 shows a snapshot of the resulting new words along with the assigned annotations. As a result of the above enrichment stage, 7772 words have been added to LIWC, 8663 new concepts have been included in PsychoNet 2, and 56615 concepts have been enriched with extra annotations. This is a mutual benefit for those who want to use LIWC alone, with this enriched version, and for those who still need to use the full PsychoNet 2 knowledgebase. It is worth mentioning that the number of annotation sets AS j might not be equal to the number of synsets. This is because in some cases there might be a synset S j that has no representative psycholinguistic annotation (i.e. has an empty AS j annotation set). Therefore, we can see in Table 1 that the enrichment process provides two matching synsets with different sets of annotation for the word live, however, it only provides one annotation for the word newspaper. 3.2 Contextualization PsychoNet 1 associates each concept with a list of psycholinguistic annotations and its corresponding frequencies. This is due to the existence of common annotations across the words of the concept. Although there could be multiple annotations for the same word, it should only select the annotations that are related to the context. In PsychoNet 2, it is intended to select the psycholinguistic annotations based on the context of the representing words. This will maintain only psycholinguistic annotations (categories) which suit the context of the concept. Table 2 shows an example of annotations results before and after contextualization. We can see that the concept “a scream of freedom” has conflicting annotations; Neg.Emotion and Pos.Emotion, resulting from its component words. Moreover, it located Hearing which is related to one of the words, but it is outside of the context for this concept. The proposed algorithm ended up with Affective annotation which is more representative of the context of that concept. The same can be seen in the concept “The best way to commit a crime”. Similarly, the concept “coffee shop” has Leisure as the context annotation. Many other concepts have not been included before in PsychoNet 1, such as “hit ball” and “zip code”. But now, in PsychoNet 2, they are included and annotated, thanks to the enrichment of the LIWC by utilizing the WordNet (Section 3.1). 4 APPLICATION: MOOD CLASSIFICATION This section presents a sample text classification application using PsychoNet 2. The contribution lies in accuracy improvement achieved using PsychoNet 2 compared to PsychoNet 1 and LIWC respectively. We utilized the same mood experiment framework and corpus presented in (Mohtasseb and Ahmed, 2010b) for building a classification model distinguishing between moods using LIWC and PsychoNet, for both versions, respectively. The difference between the two experiments derives from creating the learning vectors either by using LIWC to extract the features from words, or by applying psycholinguistic-index function (Mohtasseb and Ahmed, 2010b) over the extracted concepts. For each mood, the F-Measure value of the classification result is calculated. Results presented in table 3 shows that PsychoNet 2 outperforms both LIWC and PsychoNet 1 in all moods. The next section shows a more detailed discussion of the results. 341 KEOD 2011 - International Conference on Knowledge Engineering and Ontology Development Table 2: Snapshot of the result showing the previous and new annotations. Concept a scream of freedom coffee shop swimming pool The best way to commit a crime hit balls zip code Previous Annotations Affective,Hearing,Perceptual, Neg.Emotion,Pos.Emotion Ingestion,Biological,Leisure,Money Relativity,Motion,Leisure Quantifier,Affective,Cognitive,Anger, Certainty,Achievement,Relativity, Neg.Emotion,Pos.Emotion Nil Nil Table 3: Mood classification results using F-Measure. Mood amused cheerful busy happy calm content creative bored contemplative exhausted 4.1 LIWC 0.40 0.39 0.40 0.42 0.33 0.28 0.36 0.39 0.30 0.44 PNet 1 0.56 0.48 0.56 0.56 0.44 0.42 0.24 0.50 0.45 0.30 PNet 2 0.59 0.49 0.67 0.61 0.48 0.52 0.41* 0.53 0.58 0.48* Discussion LIWC has been used successfully in various classification/identification tasks where the target classes are objective facts, such as Gender, Age, or Authorship Identification. However, the results of using LIWC in mood classification are poor and not promising as depicted in Table 3. This is mainly because the target class (mood) is subjective rather than objective, and may not be accurately provided by the user. It is usual that a user tags a number of posts with different moods even where the contents are, to some extent, similar. Hence, this task is challenging and LIWC features alone are not enough to fulfill it. Previous studies in mood prediction confirm this difficulty as they utilized various types of features in order to achieve reasonable results (Mishne, 2005; Leshed, 2006). As demonstrated in the experiment above, using PsychoNet 2 improved the result of mood classification compared to both LIWC and PsychoNet 1. PsychoNet 1 enhanced the result for some moods and improved accuracy to above 50% for others. However, PsychoNet 2 made enhancement in all moods and improved the accuracy to over 60%, for some moods. Furthermore, we can see that LIWC outperformed 342 New Annotations Affective Leisure Leisure Relativity,Cognitive, Certainty,Affective Leisure Relativity,Space PsychoNet 1 in some moods (annotated with stars in Table 3). But the results confirm that PsychoNet-2 outperformed LIWC in all moods. 5 CONCLUSIONS In this paper, we presented PsychoNet 2, a substantially contextualized and enriched psycholinguistic commonsense ontology. The overall main contribution is the creation of one cohesive semantic network and ontology based on the integration of three important text analysis resources namely: ConceptNet, LIWC, and WordNet. This addresses various limitations of PsychoNet 1, including contextualization and missing concepts. The first contribution, in this paper, is the contextual annotation of nodes. This contextualization annotates each node with the most representative contextual psycholinguistic categories. The second contribution is the enrichment of the LIWC, through utilizing the WordNet. This enrichment process added 7772 new words to the LIWC lexicon and associated them with the relevant psycholinguistic categories. Consequently, this enrichment led to the enrichment of the PsychoNet 2 by additional 8663 concepts, which were missing in PsychoNet 1, and improved the annotation of another 56615 concepts. PsychoNet 2 can be used in many applications in text engineering. We present here one application in mood classification. The results confirm the validity of PsychoNet 2 and showed the improvements experienced in all moods compared to LIWC and PsychoNet 1. REFERENCES Gill, A. (2003). Personality and language: The projection and perception of personality in computer-mediated communication. Gill, A. J., French, R. M., Gergle, D., and Oberlander, J. (2008). The language of emotion in short blog texts. PSYCHONET 2 - Contextualized and Enriched Psycholinguistic Commonsense Ontology In Proceedings of the ACM 2008 conference on Computer supported cooperative work, pages 299–302. ACM New York, NY, USA. Hancock, J. T., Gee, K., Ciaccio, K., and Lin, J. M. H. (2008). I’m sad you’re sad: emotional contagion in cmc. In Proceedings of the ACM 2008 conference on Computer supported cooperative work, pages 295– 298. ACM New York, NY, USA. Hancock, J. T., Landrigan, C., and Silver, C. (2007). Expressing emotion in text-based communication. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 929–932. ACM New York, NY, USA. Leshed, G. (2006). Understanding how bloggers feel: recognizing affect in blog posts. In Conference on Human Factors in Computing Systems, pages 1019– 1024. ACM New York, NY, USA. Liu, H. and Singh, P. (2004). Conceptneta practical commonsense reasoning tool-kit. BT Technology Journal, 22(4):211–226. Mairesse, F., Walker, M. A., Mehl, M. R., and Moore, R. K. (2007). Using linguistic cues for the automatic recognition of personality in conversation and text. Journal of Artificial Intelligence Research, 30:457–500. Miller, G. A. (1995). Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41. Mishne, G. (2005). Experiments with mood classification in blog posts. In Proceedings of ACM SIGIR 2005 Workshop on Stylistic Analysis of Text for Information Access. Mohtasseb, H. and Ahmed, A. (2009a). Mining online diaries for blogger identification. In The 2009 International Conference of Data Mining and Knowledge Engineering (ICDMKE’09). Mohtasseb, H. and Ahmed, A. (2009b). More blogging features for author identification. In The 2009 International Conference on Knowledge Discovery (ICKD’09). Mohtasseb, H. and Ahmed, A. (2010a). The Affects of Demographics Differentiations on Authorship Identification, pages 409–417. Springer. Mohtasseb, H. and Ahmed, A. (2010b). Psychonet: a psycholinguistc commonsense ontology. In The International Conference on Knowledge Engineering and Ontology Development KEOD, pages 159–164. Nowson, S. and Oberlander, J. (2006). The identity of bloggers: Openness and gender in personal weblogs. In Proceedings of the AAAI Spring Symposia on Computational Approaches to Analyzing Weblogs. Noy, N. F. and Hafner, C. D. (1997). The state of the art in ontology design. AI magazine, 18(3):53–74. Pennebaker, J. W., Francis, M. E., and Booth, R. J. (2001). Linguistic inquiry and word count: Liwc 2001. Mahway : Lawrence Erlbaum Associates. 343