Academia.eduAcademia.edu

Research Papers

Link mining Using Strength of Frequent pattern of Interaction 1 Seema Mishra , G C Nandi 1 2 [email protected] ,[email protected] 1 Research Scholar, 2 Professors Robotics and AI Lab, CC-1 Building Indian Institute of Information Technology, Allahabad Allahabad, India 2 Abstract: This work addresses the important problem of discovery and analysis of social networks and link between frequent people from surveillance video where large amount of video data is collected routinely. A computer vision approach enabled to solve this problem at lower and with the help of video data obtained from the fixed camera. Camera systems should have the capability to acquiring high-resolution face images of people under challenging conditions. We perform “opportunistic “face recognition on captured images. We present a novel frequent pattern mining based approach to solve this frequent association problem between social networks. Our approach is illustrated with promising results from a fully integrated camera system. Keywords: Social network analysis, Link analysis, Link mining, Law- enforcement, Homeland security, Closed world environment, Data mining, Frequent pattern mining, Knowledge Discovery, Computer Vision. 1. Introduction There are several modern means of communications, like email, chartroom, weblogs, and telephone [12]. Here, we discuss about the new emerging means of communication and interaction is computer vision. Since computer vision techniques are being used tremendously in many areas of surveillance systems, face recognition [1, 18], detecting, recognizing and analysis of activities [9, 10] and events. More specifically, we are attempting to discover higher level understanding and knowledge of group behavior and their frequent relations within the reference of social interaction. This discovered knowledge can be useful in law enforcement, homeland security, organizations, and closed world environment. With the objective of achieving knowledge about social interaction, we can imagine our problem centered with two critical aspects. For example, first aspect is frequent relation of people appearing together. Second aspect is dynamism in a group as people are changing within it. In this research we propose a framework, for automated frequent link analysis in terms of social network pattern. The framework includes two level tasks, low level and high level. The low level task integrates module, face recognition and high level task includes pattern mining for frequent group. Here, the face recognition module should include: 1. Persistently track video sequences and bifurcate into key frames. 2. Detect and recognize faces unambiguously under challenging conditions. For this purpose, only high resolute images of faces are captured. The approach we espoused for discovering frequent persons grouped together oftentimes is based on pattern mining approach [17, 20]. Integration of paper is delimited as follows: In Section 2, we put a description of social network and its creation. In Section 3, we discuss proposed approach in the context of social network data. Section 4, incorporates insinuation of algorithm and discuss data sets and implementation results yielded by fully integrated system. This paper ends with conclusion in section 5. 1. Social Network Social network analysis is used to understand the pattern of interaction caused by social dynamics and events [2, 3, 4, 5, 6, 7]. Social network graph corresponds to G  V , E as we have in mathematical literature where V is set of nodes. Each node is an individual in society and possesses a name and some identifying information. This information could be face images and signatures. Suppose, social connection strength is S ij between two persons, i and j . To establish strength of tie between individuals i and j the following assumptions are required. 1. 2. 3. Persons involved in interaction, i1 , i2 ...im where m is variable should be indentified and recognized in positive manner. The interaction is measured in a group. The frequency with which the groups of people are seen in proximity is observed in appropriate manner over the period of time. The low level job face recognition module compute the probabilistic values i.e. P  [ p1 , p 2 ,... p N ] where each value evaluates the probability that recognized face is corresponds to individual i in the original data base of faces where signature of all individuals are stored. But if new face is detected, it should be stored in database of faces with subsequent index. The index of face in original data base becomes the signature of recognized face. Suppose three persons P, Q, R are recognized from the system j ;  arg max j q j and k '  arg max k r k then we have i '  arg max i p i , Figure 1. Grouping of people: Groups are obtained from capturing images of persons form fixed camera views. Here 1 indicates that particular person is involved in the particular group Figure 2: System for calculating frequent person and link mining Detected faces appear in a group and as a consequence we obtain a transaction of groups (see in figure1) as G  {g1 ,..., g m } where g1  G and here p1 , p 2, ..., pn are persons. 1. Frequent persons and their link mining The proposed model of frequent link is pictured in figure 2. The proposed system is conceptualized as in following algorithm 1. We have divided the model in several modules of key frames extraction, face recognition, frequent pattern mining in faces recognized. 3.1 Key frame extraction Our problem definition is concerned with the video capturing and fragment into several frames. This results to the very large database of images and in particular, much of the captured data may be redundant. Hence, it requires efficient method that enables to extract those frames which has desired information form video information to deal with called representative or key frames. Here, our key frames to be extracted depend on the relevant object in visual contents which should consist of group of people. We let video sequence V starting form time t and comprising of F frames is Vt   f t  n  n  0 ,1 ,...,  and set of key frames kf have  KF  1   F frames with the following expression kf t   f KF t  n 1 , f KF t  n 2 ,..., f KF t  n t  0  n i   F  Let I ( x, y ) be the intensity value of each pixel in m  n image at ( x, y ) location. F Our key frame selection process is based on the idea of reading image from left to right and calculate the gradient point where rapid change in intensity occurs. The method is described in following algorithm (see in algorithm 2). 3.2 Face Recognition The face recognition module computes probabilistic values of recognizing faces that returns the index of most likely matched face from the watch list. As a consequence, we have now the list of index of faces stored in watch list that is used to apply method of computing frequent faces coming along. Evidently, face detection is prior procedure to enable any automated system to solve desire problem [18]. 3.3 Pattern Mining with Social Network Frequent social network are those entities that appear in a group more frequently aims to analyze person’s link behavior with the premise of how frequently they appear with other persons. The association rules problem was introduced by R.Agrawal et al. [14, 15, 16]. The problem of discovering all association rules can be decomposed into two sub problems [AGR 94]: First, find all sets of items that are called frequent item sets that have transaction support above minimum support. Secondly, Use the frequent item sets to generate the desired rules Let us assume D be a set of face database and let F  { f 1 , f 2 ,..., f m } be the set of all face images. For the sake of computation, we have enumerated number of faces form 0 to m-1. Let ms be the minimum support threshold. Of course ms  (0; 1]. If X is a nonempty subset of I, then X is a frequent persons appearing together if and only if T  D X  T  D 2.  ms Experimentation In order to describe the experimental result of the proposed system we captured the video sequence of group of people at a time, divide it to frames and generate a key frames form this data. The whole process of key frame generation, face detection, face recognition, and pattern mining is done offline manner to analyze the data of video. The remaining section of this experimentation will present the result from various aspect of the system. Key frame extraction First of all, we demonstrate the performance of key frame extraction video captured on the particular timing of entering the students, staff, and research scholars in the laboratory. Mostly the subject’s s are entering and leaving constantly with making groups with different person with different time. That form of group is to be discovered as a frequent group. Face Detection and Recognition. After key frame extraction process, this module is concerned with face detection and finally recognizes it from the watch list of faces and returns the index of that recognized face. The detected faces from each group are given as in figure 3. Here we represent only few groups for the sake of understanding. Group1 Group2 Group3 Group4 Group5 Group6 Pattern Mining with Social Network The final result frequent persons are given in figure 4. For the verification we have analyzed that these three persons were involved with strong link because they were doing their master’s project together. 3. Conclusion and future work This paper, addressed an emerging new problem discovering and mining frequent link between social networks. A computer vision and pattern mining approach is utilized to solve the proposed problem computing frequent linked people which may have tremendous applications for practical surveillance systems. With premise of this, we have presented an algorithm to extract key frames form video sequences to identifying individuals and analyzing their interactions using face recognitions. In order to implement a system data were collected in closed world environment using video. Several challenges were encountered to solve the problem such as to model group, face detection and recognition in cluttered environment. In a future more sophisticated and expressive strategy could be used to overcome this problem i.e. close in space with some orientation. Fuzzy concepts could be employed to clustering people to identify the degree of membership that is how strong the one associated with other cluster. References 1. 2. 3. H. Schneiderman and T. Kanade. A statistical method for 3D object detection applied to faces and cars. In IEEE Computer Vision and Pattern Recognition, Hilton Head, SC, volume 1, pages 746–751, 2000. Newman, M. E. J. Detecting community structure in networks. European Physical Journal B 38: 321330. 2004. T. Yu, S. Lim, K. Patwardhan, and N. Krahnstoever. Monitoring, recognizing and discovering social networks. In CVPR, 2009. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. M. E. Newman, “The structure and function of complex networks,” SIAM Review, 45(2): 167–256, 2003. T. Berger-Wolf and J. Saia, “A framework for analysis of dynamic social networks,” DIMACS Technical Report, vol. 28, 2005. J. Sinai, “Combating terrorism insurgency resolution software,” Proc. IEEE Int. Conf. Intelligence and Security Informatics (ISI-2006), pp. 401–406. L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan, “Group formation in large social networks: membership, growth, and evolution,” Proc. 12th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2006, pp. 44–54. Goldberg, M.; Hayvanovych, M.; Hoonlor, A.; Kelley, S.; Magdon-Ismail, M.; Mertsalov, K.; Szymanski, B.; Wallace, W.; , "Discovery, analysis and monitoring of hidden social networks and their evolution," Technologies for Homeland Security, 2008 IEEE Conference on , vol., no., pp.1-6, 12-13 May 2008. N. Vaswani, A. K. R. Chowdhury, and R. Chellappa. “Activity recognition using the dynamics of the configuration of interacting objects”. In CVPR (2), pages 633–642, 2003. Turaga, P., Chellappa, R., Subrahmanian, V. S., and Udrea, O. 2008. “Machine recognition of human activities”: A survey. IEEE Transactions on Circuits and Systems for Video Technology 18, 11 (Nov), 1473-1488. Turk, M.A.; Pentland, A.P. "Face recognition using eigenfaces," Computer Vision and Pattern Recognition, 1991. Proceedings CVPR '91. IEEE Computer Society Conference on , vol., no., pp.586591, 3-6 Jun 1991. Mishra, Seema; Agrawal, Udit; Nandi, G C; “CVPD: A tool based on a social network analysis to combating viruses’ propagation", Communication, Information & Computing Technology (ICCICT), 2012 International Conference on, vol., no., pp.1-5, 19-20 Oct. 2012. R. Agrawal and R. Srikant. “Fast algorithms for mining association rules”. In J.B. Bocca, M. Jarke, and C. Zaniolo, editors, Proceedings 20th International Conference on Very Large Data Bases, pages 487–499. Morgan Kaufmann, 1994. R. Agrawal and R. Srikant. “Fast algorithms for mining association rules”, IBM Research Report RJ9839, IBM Almaden Research Center, San Jose, California, June 1994. Jiawei Han, Micheline Kember.”Data Mining: Concepts and Techniques”.Mongan Kaufmann publishers, 2000.225-278. Agrawal R, Imielinski T, Swarmi A, “Mining Association Rules between Sets of Items in Large Database”. In: proceedings of ACMSIGMOD International conference on Management of Date Washington, D C, 1993, 207-216. M. Hegland. “The apriori algorithm - a tutorial. WSPC/Lecture Notes Series, 9(7)”, March 2005. http://www2.ims.nus.edu.sg/preprints/2005- 29.pdf. Ming-Hsuan Yang; Kriegman, D.J.; Ahuja, N.; "Detecting faces in images: a survey," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol.24, no.1, pp.34-58, Jan 2002. Lakshmi, H.C.V.; PatilKulakarni, S., "Segmentation Algorithm for Multiple Face Detection for Color Images with Skin Tone Regions," Signal Acquisition and Processing, 2010. ICSAP '10. International Conference on, vol., no., pp.162-166, 9-10 Feb. 2010. J. Han, H. Cheng, D. Xin, X. Yan, “Frequent pattern mining: current status and future directions, in: Data Mining and Knowledge Discovery”, 10th Anniversary Issue, 2007, pp. 55–86. Guozhu Liu, and Junming Zhao, “Key Frame Extraction from MPEG Video Stream”, Proceedings of the Second Symposium International Computer Science and Computational Technology(ISCSCT ’09), Huangshan, P. R. China, 26-28,Dec. 2009, pp. 007-011. G. Ciocca and R. Schettini, “An innovative algorithm for key frame extraction in video summarization,” J. Real-Time Image Process., vol. 1, no. 1, pp. 69–88, 2006. R.R. Schulz and R.L. Stevenson, “Extraction of high-resolution frames from video sequences,” IEEE Trans. Image Processing, vol. 5, pp. 996-1011, June 1996.
Heirarchy of communities in Dynamic Social Network Seema Mishra1, G .C . Nandi2 1 Indian Institute of Information Technology, Allahabad 2 Indian Institute of Information Technology, Allahabad {[email protected]; [email protected]} Abstract. Discovering the hierarchy of organizational structure can unveil significant patterns that help in network analysis. In this paper, we derive a hierarchical structure of organization by calculating the individual score of each person using page rank algorithm. After that, a communication graph is plotted that shows power of each individual. Experiments were performed on Enron data sets and results shown were very helpful in identifying primal person over the period of months. Keywords: Dynamic social network analysis, social network analysis, hierarchal structure. 1 Introduction A network structure is the perfect epitome provides a formal way of representing data that emphasizes the association between entities. This representation has a substantial importance gives the insight of knowledge into the data. Since for the work to be done many systems these days are interconnected and behaviors of individuals reflect the function of whole system at large extent. Networks are primarily studied in mathematical framework i.e. graph [1]. In modern era, social network analysis is proliferated area of research, has been in existence for quite some time and experiencing a surge in popularity to understand the behavior of the users at individual and group level [Wasserman & Faust, 1994, Wellman, 1996]. Understanding the behavior of individual social networking methods assuage the analysts to revealing hidden patterns from social communication. In order to model the social network mathematically, most popular data structure typically known as graphs are used where the nodes depict the individual or group of person, or event or organization etc and each link/edge represents connection/relationship between two individual. Social network analysis attempts to 2 understand the network and its components like nodes (social entities commonly known as actor or event) and connections (inter-connection, ties, and links). It has main focus of analyzing individuals and their interdependent relationships among them rather than individuals and their attributes as we deal in conventional data structure. . 2 Dynamic Network Analysis Versatile power of social network is being applied to mining pattern of social interaction in wide ranging applications including: disease modeling [ 4] information transmission and behavior analysis [2,3] and business management and behavior analysis. Network analysis came in to picture as its practical applications in intelligence and surveillance [5] and has become popularized paradigm to uncover anti-social network such like criminal, terrorist and fraud network majorly after the tragically event of September 11th, 2001 which has shattered the whole world. Social interaction could be in any form that depends on the type of data available [6]. It might be verbal or written communication (cell phones, emails, and blogs chatting), scientific collaboration (co-authorship network, citation network), browed websites, and group of animals. This mathematical network model is very successful in analysis of social network but major drawback is that it may miss the temporal aspect of interaction because social interaction is inherently dynamic in nature. The static model of interaction can give the information that could be inaccurate and decision made based merely on this contributed information might lead analyst to faulty direction. Several shortcomings can be highlighted when dealing with static model of social network that could forbid acknowledging the casual relationship of pattern of social interaction [6]: .  What is the rate of spreading diseases while modeling diseases and who is the central person whom should be vaccinated to control spread of it among group of person.  What are the causes and consequences of social structure evolution? Dynamic social network analysis is emerging research area play a crucial role to fill gap between traditional social network analysis and time domain. Dynamic study of network includes classical network analysis, link analysis, and multiagent systems. Dynamic network analysis facilitates the analysis of multiple types of nodes (multi-node) and multiple types of links (multi-plex) simultaneously. On the contrary, static network analysis can only focus at most two mode data and analysis one type of link at a time. There are several characteristics of dynamic network:  Nature of nodes are dynamic, there properties changes with respect to time. 3   Deals with meta-network. Network evolution is consequent of agent-based modeling. 3.1 Community Detection A community is sub part of whole network between which inter-community interaction is relatively frequent and strong than intra-community interaction. It can be in any form for example group, subgroup, and cluster. It may; a) citation network represents related papers on single topics, b) web pages on related topics. Community detection is a classical problem in social network analysis. Commonly, it can be the problem of identifying sub-graphs of original graph and called vertex sparsifier [12]. These small networks uphold the relevant information of original group. Four levels of analyses are being conducted in community identification: 3.2 Analysis of Previous Work Hierarchical methods for community detection falls into two categories: agglomerative and divisive. In former case each node is assumed to be a community and repetitively group together. Similarly, in later case initially whole network is considered a community and divided subsequently into smaller one. Most methods are graph clustering and partition. Distance based structural equivalence [7] uses distance metrics to identify similar entities. In graph partition methods several algorithms has been proposed [8, 9]. Newman- Girvan method and spectral clustering methods [10, 11] uses a notion of modularity and utilizes edge between-ness metric to divide into groups. In analyzing dynamic pattern, many methods use the temporal snapshots of interaction over the times [13, 14]. 3.3 Discovering Hierarchy of group Before analyzing community hierarchy we define several basic terminologies. Hierarchy of community provides the power of each individual in a group. If somehow we know this chain we can find the leader of group. Regarding this we used well known algorithm PageRank which calculate the individual score I_score of each person to represent importancy of person. Higher the score of person more powerful the person is. 4 Definition: If P  { p1 , p2 ... pk } be the collection of person in convolved in a communication. For any member pi and p j ,if I_score ( pi )  I_score( p j ) then pi is more powerful than p j . A. Data Set We performed the experiments on Enron data set. Data is about 150 users communication mostly senior managers organized into folder where nodes are people and edges are email communication between them. But this data set has still lots of issues regarding integrity issue and duplicate messages issue. We preprocessed the data sets in to socio matrix of 12 months from January to December and finally draw a graph of interaction over the months. B. Experiments and Analysis In this section, we evaluate the capability of the proposed approach on discovering organizational structure and to exploring evolution of organizational structure in a dynamic social network. The approach is implemented in DEV C++. The experiments were conducted on a 2.1GHz PC with Core(TM)2 Dual- Core Pentium 4 processor with 2 GB RAM. On examining the results in figure 1, we analyzed the number f grouping in the month of May was maximum. Figure 2 shows that Jim was the person who headed the group from Jan to Feb. Monique was leader throughout the months form Jan to April Number of communities over months 20 10 0 Figure 1. Number of communities over months 5 Figure 2. Evolution of community from months Jan to Dec 5 Conclusion In this paper we introduced a concept of hierarchy of positions in group by taking temporal interaction data of12 months in organization that shows how the position of group members changes when people joining and leaving the group. In the future we are planning to improve the integrity issues of preprocessed Enron data results of experiments. 6 References [1] Hanneman, Robert A. and Mark Riddle. Introduction to social network methods. Riverside, CA: University of California (published in digital form http://faculty.ucr.edu/~hanneman/, 2005. [2] J. Baumes, M. Goldberg, M. Magdon-Ismail, and W. Wallace. Discovering hidden Inform., 2004 [3] J. Tyler, D. Wilkinson, and B. Huberman. Email asspectroscopy: Automated discovery of community structure within organizations. Proc. 1st Intl. Conf. on Comm. And Tech., 2003. [4] M. Kretzschmar and M. Morris. Measures of concurrency innetworks and the spread of infectious disease. Math. Biosci.133:165–195, 1996. [5] J. Baumes, M. Goldberg, M. Magdon-Ismail, and W. Wallace. Discovering hidden groups in communication networks. Proc. 2nd NSF/NIJ Symp. on Intel. and Security Inform., 2004. [6] T. Y. Berger-Wolf and J. Saia. A framework for analysis of dynamic social networks. In Proc. KDD’06, 523–528, 2006. [7] Santo Fortunato and Claudio Castellano - Community Structure in Graphs,chapter of Springer’s Encyclopedia of Complexity and System Science (2008). [8] C. Chekuri, A. Goldberg, D. Karger, M. Levin and C. Stein. Experimental study of minimum cut algorithms. In Proc. 8th SAIM Syposium on Discreet Algorithm, P324333, 1997. [9] Andrew Y. Wu, et al., Mining scale-free networks using geodesic clustering, In Proc. KDD’04, pages 719-724, 2004. [10] M.E.J Newman and M. Girvan - Finding and evaluating community structure in networks, Phys. Rev. E 69, 026113 (2004). [11] M.E.J Newman - Modularity and community structure in networks, PNAS June 6, 2006 vol. 103 no. 23 8577-8582. [12] A. Moitra. Approximation algorithms for multicommoditytype problems with guarantees independent of the graph size.FOCS, pages 3–12, 2009. [13] Ding Zhou, Isaac Councill, Hongyuan Zha, C. Lee Giles. Discovering Temporal Communities from Social Network Documents. In Proc. of ICDM’07, pages 745 750, 2007. [14] C. Tantipathananandh, Tanya Berger-Wolf, David Kempe. A Framework For Community Identification in Dynamic Social Networks. In Proc. of KDD’07, pages 717 726, 2007.
International Journal of Enhanced Research in Science Technology & Engineering, ISSN: 2319-7463 Vol. 2 Issue 6, June-2013, pp: (71-75), Available online at: www.erpublications.com Discovery of frequent subgroup using data mining Seema Mishra Robotics and AI Lab, Indian Institute of Information Technology, Allahabad, India Abstract: Social network analysis is emerging technology that focuses on pattern of interaction/relation among people, organization. Frequent subgroup detection is subpart of social network that reveals the hidden knowledge and reflect the behavior of entire social network. These subgroups are collection of nodes that share common characteristics and densely connected with each other. In this paper, an unexampled approach is acknowledged to discover frequent subgroup inspired from a well known algorithm in the domain of association rule mining recognized as Continuous association rule mining algorithm. Keywords: Subgroup detection, Social network analysis, Dynamic network analysis. I. INTRODUCTION In modern era, social network analysis has been in existence for quite some time and experiencing a surge in popularity to understand the behavior of the users in the form of nodes in the network. In order to model the social network, most popular data structure typically known as graphs are used where the nodes depict the individual or group of person, or event or organization etc and each link/edge represents connection/relationship between two individual [7 ,8]. Social network analysis attempts to understand the network and its components like nodes (social entities commonly known as actor or event) and connections (inter-connection, ties, and links). It has main focus of analyzing individuals and their relationships among them rather than individuals and their attributes as we deal in conventional data structure. Social Network analysis has been in existence from past but now a day’s extensively used to analysis the structure and connection between various actors existing within organization. The ability to detect community structure in a network could have practical applications. Communities in a network might represent real social groupings oftentimes interacting, perhaps by interest or background; communities in a citation network might represent related papers on a single topic; communities in a metabolic network might represent cycles and other functional groupings; communities on the web might represent pages on related topics; hidden communities might represent potential suspicious activity. Being able to identify these communities could help us understand and exploit these networks more effectively. Communities of practice are the collaboration groups that naturally grow and coalesce within any kind of networks. Any institution that provides opportunities for communication or interaction among its members is eventually threaded by communities who have similar goals and a shared understanding of their activities. These communities have been the subject of much research as a way to uncover the structure and interaction patterns within a network in order to understand the collective behavior of the network from the individuals that constitute the network. Recent Research on these networks has focused on using a social network perspective to analyze these networks. A social network consists of both a set of actors, who may be arbitrary entities like persons or organizations, and one or more types of relations between them, such as information exchange or economic relationship. Subgroup detection aims at clustering nodes in a graph into subgroups that share common characteristics. But to some extent, sub graph discovery does the same job for finding interesting or common patterns in a graph. One of the most common interests of social network analysis is the substructures that may be present in the network. Subgroups are subsets of actors among whom there are relatively strong, direct, intense, frequent, or positive ties. From the ideas of subgroups within a network, we can understand social structure and embeddedness of individuals. Finding frequent groups in graph database can be modeled as (a) graph transaction setting and (b) single graph setting. Graph transaction setting takes as input relatively small graph of user interaction whereas single graph setting deals with large graph of user’s interaction involved in communication [M. Kuramochi and G. Karypis, 2004]. The approach we espoused for discovering frequent subgroups is based on continuous association rule mining algorithm [15]. The main aim of adopting this approach is, because the groups of people is not static, it changes over a period of time as the member of group is being joining and leaving from group. Page | 71 International Journal of Enhanced Research in Science Technology & Engineering, ISSN: 2319-7463 Vol. 2 Issue 6, June-2013, pp: (71-75), Available online at: www.erpublications.com In order to model the interaction of users, we have an undirected graph where each vertex represents user and each edge represents relation between two users. Using such a graph representation, the problem of finding frequent patterns then becomes that of discovering subgroups which occur frequently enough over the entire set of graphs. The overall group of people is represented in figure1. Figure 1: Group of people For example, subgroups are represented in following figure2. Figure 2: Two subgroups II. OVERVIEW OF THE PAPER Integration of paper is delimited as follows: In Section 3, we put a related work in the subgroup detection in social networking. In Section 4, CARMA algorithm will be discussed in the context of social network data. More appropriately will be said in the form of association /interaction among users. Section 5, incorporates insinuation of algorithm and discuss data sets and implementation results yielded by algorithm. This paper ends with conclusion in section 6. III. RELATED WORKS Since network is structurally designed as consisting of nodes and edges indicate relationship. One special kind of network is known as social network and has been studying for long time [2, 3, 4, 5]. Modeling complex datasets in graph has been recognized powerful tool in various research domains like chemical domain [9, 10], computer vision [11], image and object retrieval [12], and machine learning [13]. In particular, Dehaspe et al. [10] applied Inductive Logic Programming (ILP) to obtain frequent patterns in the toxicology evaluation problem [14]. IV. OVERVIEW OF ASSOCIATION RULE MINING IN THE CONTEXT OF CARMA In order to search the group of people communicating frequently we followed a graph transaction setting model. There are two phases in algorithm as we have in CARMA. Phase 1 is meant to construct lattice of large group of people interacting /communicating each others. For each group sets g  G, three variables are asseverated: Page | 72 International Journal of Enhanced Research in Science Technology & Engineering, ISSN: 2319-7463 Vol. 2 Issue 6, June-2013, pp: (71-75), Available online at: www.erpublications.com count(g): the number of times group g occurred in communication while inserting in lattice. firstTrans(g): pattern of communication number at which g is being inserted in lattice. maxMissed(g): upper bound on the occurrence of g before g is being inserted in lattice. V. DATA SETS AND IMPLEMENTATION This section exemplifies the approach. We considered a synthetic dataset incorporating Users Groups with following attributes of each user. Interaction of group is prepared in such an appropriate format that fits the experimental analysis  age  location  sex  education  marital status  interests Pattern of interaction: g1 : ( user_1, user_2, user_3, user_4, user_10, user_11, user_12, user_13) g 2 : ( user_22, user_23, user_24, user_25, user_17, user_18, user_19, user_9) g 3 : ( user_22, user_23, user_24, user_25, user_17, user_18, user_19, user_9) g 4 : ( user_1, user_2, user_3, user_4, user_22, user_23, user_24, user_25) g 5 : ( user_1, user_2, user_3, user_4, user_13, user_14, user_24, user_25) g 6 : (user_1, user_2, user_3, user_4, user_13, user_14, user_24, user_25) g 7 : (user_1, user_2, user_3, user_4, user_22, user_23, user_24, user_25) g 8 : ( user_3, user_4, user_5, user_22, user_23, user_24, user_25, user_10) g 9 : ( user_13, user_14, user_15, user_22, user_23, user_24, user_25, user_20) g10 : (user_12, user_13, user_14, user_5, user_6, user_7, user_8, user_9) g11 : (user_12, user_13, user_14, user_15, user_24, user_25, user_1, user_2) g12 : (user_22, user_23, user_24, user_25, user_8, user_12, user_13, user_14) g13 : (user_1, user_2, user_3, user_11, user_12, user_13, user_24, user_25) g14 : (user_1, user_2, user_20, user_21, user_22, user_23, user_24, user_25) g15 :( user_8, user_9 user_10 user_11 user_12 user_13 user_23 user_24) g16 : (user_10, user_11, user_12, user_13, user_22, user_23, user_24, user_25) g17 : (user_10, user_11, user_12, user_13, user_22, user_23, user_24, user_25) g18 : (user_22, user_23, user_24, user_25, user_16, user_17, user_18, user_19) g19 : (user_22, user_23, user_24, user_25, user_7, user_18, user_21, user_11) g 20 : (user_22, user_23, user_24, user_25, user_17, user_18, user_19, user_9) g 21 : (user_22, user_23, user_24, user_25, user_5, user_6, user_7, user_8) g 22 : (user_1, user_2, user_3, user_4, user_22, user_23, user_24, user_25) g 23 : (user_1, user_2, user_3, user_4, user_13, user_14, user_24, user_25) g 24 : (user_1, user_2, user_3, user_4, user_13, user_14, user_24, user_25) g 25 : (user_1, user_2, user_3, user_4, user_22, user_23, user_24, user_25) Page | 73 International Journal of Enhanced Research in Science Technology & Engineering, ISSN: 2319-7463 Vol. 2 Issue 6, June-2013, pp: (71-75), Available online at: www.erpublications.com the support sequence is supplied as  =[0 1 2 2 3 3 3 3 3 4 4 4 4 5 5 5 5 5 5 6 6 6] The group of people communicating frequently using algorithm is resulting as (user_1, user_2, user_3, user_4, user_10, user_11, user_12, user_13) Each user is linked with some communities of their own interest on the basis of that we made groups of users having some specific values for each attribute. Interaction of group is defines as pattern of communication in different groups g1, g 2, g 3,...g n and support lattice of users group is G. For the sake of convenience and ease of explanation for how algorithm works, we deducted a small part of user pattern of interaction as mentioned P = { g1 , g 2 , g 3 } where people involved in these groups as: g1 = {u1, u2, u4} g 2 = {u1, u2, u3} g 3 = {u1, u2} All the way of implementation (shown in figure 3), G is initialized to calculation support sequence is   (0.3,0.9,0.7) . {} and corresponding integers as (0, 0, 0) and for easy Since maxSupport (  ) =1   1 (0.3) , add all singleton users and set associated integers with updated values as (0, 1, 1). User sets are not pruned in first scanning. Now scanning of g 2 , since {u1, u2}  maxSupport=1 which is greater than (maxMissed, firstTrans, count) [minSupport, maxSupport]  2 . So insert {u1, u2} is inserted to G with updated associated integers. g 2 and {u1}, {u2} already in G with Figure 3: CARMA implementation of lattice of group VI. CONCLUSION In this paper, we presented frequent group detection algorithm based on CARMA. It is capable of handling dynamism of network as members are joining or leaving the group. We carried out the implementation on synthetic data of user interaction. VII. AKNOWLEDGEMENT We would like to thank Mr. Lokesh Kumar and Mr. Naveen B. Tech student of RGIIT Campus, Amethi for his endeavor in implementing algorithm. Organizational Risk Analyzer (ORA) tool available through the Center for Computational Analysis of Social and Organizational Systems (CASOS), http://www.casos.cmu.edu., is used to draw graphs. Page | 74 International Journal of Enhanced Research in Science Technology & Engineering, ISSN: 2319-7463 Vol. 2 Issue 6, June-2013, pp: (71-75), Available online at: www.erpublications.com REFERENCES M. Kuramochi and G. Karypis. Finding frequent patterns in a large sparse graph. SDM, 2004. Newman, M. E. J. Detecting community structure in networks. European Physical Journal B 38: 321-330. 2004. Newman, M. E. J Fast algorithm for detecting community structure in networks. Physical Review E 69: 066133, 2004. Luo, J. Social network analysis. Social Science Academic Press. (In Chinese), 2004. Wasserman, S. and K. Faust. Social network analysis: Methods and applications. New York, Cambridge University Press, 1994. Girvan, M. and M. E. J. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences of the United States of America 99: 7821–7826, 2002. [7] Hanneman, Robert A. and Mark Riddle. Introduction to social network methods. Riverside, CA: University of California (published in digital form http://faculty.ucr.edu/~hanneman/, 2005. [8] H. Goto, Y. Hasegawa, and M. Tanaka, “Efficient Scheduling Focusing on the Duality of MPL Representatives,” Proc. IEEE Symp. Computational Intelligence in Scheduling (SCIS 07), IEEE Press, Dec. 2007, pp. 57-64, doi:10.1109/SCIS.2007.357670. [1] [2] [3] [4] [5] [6] Page | 75
Assorting Unlabeled Data in Machine Learning: Active Cluster Learning Seema Mishra, G C Nandi [email protected], [email protected] Robotics and AI lab, Indian Institute of Information Technology, Allahabad Abstract— various research and practicals are being done in machine learning algorithms for training learning model. To train the model basic resources required are processing time, memory for storage of data. Another crucial resource is amount of learning/Training data from which learning is attempted. Since fewer amounts of data would not necessarily lead more accuracy and precision in model. Moreover, huge amount of data size needed for learning crusades complexity [6, 7]. Labeling this large amount of data will surely lead to higher computational complexity and expensive [3]. So we present here a new approach that divides the training data set in to distinguishable subsets having data with similar attributes and then select data points and ask oracle/expert for its label to be associated. KeywordsMachine learning, Active Semisupervised Learning, Active cluster learning. I. Learning INTRODUCTION Learning is the cognitive process through which human being is allowed to develop the necessary skills to deal with throng situations encountered in lives. There exists a variation of learning phenomenon in different context. As the same way, machine learning is the way to enable computers to analogize the process automatically. Efforts are mainly focused on inductive learning that refer to observing phenomena and generalize it. Here the inductive learning concern with classification. Indeed the learning is conceptualized and constrained by the information that the learning algorithm can get. Information is represented as a data points in n-dimensional Euclidean space. Most often, information is rendered to the learner in terms of examples. But there could be other approaches for providing information such as relations, constraints, function and even models. Learning can be broadly relegated into two categories: supervised and unsupervised learning. Supervised learning arguably pertains to the access of labeled data. However, as it is well known that unsupervised learning has nothing to do with labeled data [1]. Hence, we should take care of such algorithms computationally in terms of time and space efficiency and wisely. We should also take care of great deal about another precise resource, the optimum size of training data required by learning algorithm so that expected accuracy is achieved approximately. It has also been discoursed in the book Machine Learning by Tom M. Mitchell from the perspective of issues in learning machine [2]. Since the source of learning is the past data/experience assorted with information respectively and broadly speaking training data. In real life application unlabeled data is accessible easily but associating unlabeled data with its corresponding information is expensive, time consuming and most importantly involves human expertise. In speech recognition, performing labeling of speech articulation requires human expert to analyze every signal and divide in to segments of phonemes. This performance demands approximately ten times longer than original audio [3]. It is well understood that learning with large amount of data tends to result in computational complexity considerably and as a result give rise to increase in training time of learner/algorithm. From the principle of inductive machine learning hypothesis, there is great deal of studies providing evidence that it is not appropriate to introduce very large training set. Analysis of learning curve argued that over fitting may occur if machine is introduced with very large data to learn [6, 7]. Hence, It is suggested in extensive number of literatures that very small amount of training data could be used intelligently to achieve desired level of performance of learning algorithm as well as less monetary value of labeling unlabeled data. This issue is prime direction of this paper and motivates to explicate such a data selection technique with effect of reducing training time complexity and training data complexity. It is expected that selection technique should be scalable and so that it could perform well even if introduced with large amount of data set and surely with inexpensive manner. II. RELATED WORKS We have already a great deal of literature available to put down the cost of labeling training data and using unlabeled data efficiently. The task of labeling the data is iterative process and is accomplished by following three subtasks. Let us consider D is the entire data set, at each iteration i, 1. D L,i data points which are consociated with its corresponding label, i.e. D L  D.  {D-D L 2. D U ,i data points with no label, i.e. D 3. }. D C ,i interesting/informative data points preferred U to be labeled. More formally, D C  D U There are two different approaches being used to optimize the sizing and labeling of training data. A. Semi Supervised Learning Semi-supervised machine learning algorithm is widely known because of its potentiality of making use of both unlabeled and expensive labeled data [8]. The systematic plan of action is given in figure 1 for semi- supervised learning. III. OUR PROPOSED DATA SELECTION APPROACH The method, we suggest grounded on the presumption that data points resided in same cluster are assumed to be labeled same. This task implicates two steps, first it partitions the given large training data set into distinguishable cluster/subsets and selection domain of most informative data becomes narrower and easier. We assume that data is dispersed in n-dimensional Euclidean space, but for the sake of convenience, let us take data distribution in two dimensional spaces in figure 3. B. Active Learning Active learning is the strategy that provide interactive choice to the learner so that it could interrogate to assort label to most informative unlabeled data. The primary idea is to seek information that are sufficient for machine learner to learn the model rather than using all available data and this surely would be increased in computational complexity. In this way, we have lesser amount of quality training dataset which give approximately same accuracy as it would have been attained with large amount of randomly selected data. The systematic plan of action is given in figure 2 for Active learning. The idea of clustering of data into homogeneous groups can be illustrated as in figure 4 and we call this method as Active cluster learning. Step 4 // ask for label for selected data from cluster {xi, j ,?} Step 5 // add recently labeled data to training data set TD   TD  x i, j |TD|   |TD|+1 Remarks: Computational complexity: The proposed approach has low computational complexity. According to Naïve bases method, the data points are chosen at random require O(n) training data but suggested method has reduced it at certain extent. Suppose the numbers of clusters generated are k. Then it takes O(k) time search for search an element of cluster to get it to be label. VI. CONCLUSION Here, more specific idea is given in figure 5. Original data is partitioned in homogeneous groups with almost similar features (only non- overlapping clusters has considered). In this paper, we propose purely theoretical overview of the possibilities of grouping the large amount of unlabeled data in to distinct groups having similar features, such that searching efforts will be reduced because elements belong to same clusters are likely to be labeled same. This concept can be used in data mining processes which are the part of machine learning, where large amount of data is available to deal with. References [1] [2] [3] [4] Algorithm: The general algorithm may be expressed as follows: Input: unlabeled data points D=   Output: reduce data to be label i.e. k Step 1 // Initialization of size of Training data TD |TD|=0 Step 2 // Partition the data set into distinguishable subsets D={C 1  C 2  …,  C k } such that n i 1 Ci  C j   Step 3 // Randomly select any elements of distinct sets to be labeled xi, j  C i Where i=1 to k and j can vary and represent number of elements in each cluster. [5] [6] [7] [8] [9] Nathalie Japkowicz and Mohak Shah. Evaluating Learning Algorithms: A Classification Perspective, McGraw Hill, published by Cambridge University Press. March 2011. Mitchell T.M... Machine Learning, McGraw Hill, 1997. Burr Settles Active Learning Literature Survey. Computer Sciences Technical Report 1648,V Michaslski, R. S, Bratko, I, & Kubat, M.(EDs). Machine Learning and data mining: Method and applications. New York, NY: Wiley, 1998. Yi Wu, Rong Zhang, Alexander Rudicky. Data Selection For speech recognition Language Technology institute, Carnegie Mellon University. Proceedings of ASRU, Kyoto, Japan, December 2007 PEN Lutu, AP Engelbrecht. A Comparitive study of Sample Selection methods for classification, Department of Informatics, University of Pretoria, South Africa.South Africa Computer Journal, .Vol 36, pp. 69-85, 2006 Dietterich, T, Overfitting and underfitting in machine learning. ACM Computing surveys, Vol 27, no. 3 pp 326-327, 1995. ] Seeger Learning with labeled an unlabeled data unlabeled data,(Technical Report), Institute for Adaptive and Neural Computation, University of Edinburgh, Edinburgh United Kingdom, pp. 609-616, 2001. Ninan Sajeeth Philip, What is there in Trainind sample?, In proceeding of IEEE conference, pages1507-1511, March 2009. Ines Rehbein, Josef Ruppenhofer, There’s no Data like more Data? Revisitng the impact of Data Size on Classification Task, Saarland University, Germany, October 2010. [10] Buntine, W. Learning classification tree. Statistics and computing, ,(1992. [11] Hand, D. J Construction and assesment of classification rules. Chichester, England: John Wiley & Sons,(1997). [12] Shavlik, J.W.v& Dietterich, T. G(Eds). Reading in machine learning. San Mateo, CA: Morgan Kaufmann, (1990). [13] Spath, H, Cluster Analysis Algorithms for Data Reduction and Classification. Ellis Horwood, Chichester, UK, 1980. [14] Blaž Novak, Use of Unlabel Data for Supervised Machine Learning, Department of Knowledge Technologies, Jozef Stefen Institute [15] Vapnik, V.: Statistical Learning Theory. John Wiley, New York (1998). [16] Olivier Bousquet1, St_ephane Boucheron2, and G_abor Lugosi3. Introduction to Statistical Learning Theory
2012 International Conference on Communication, Information & Computing Technology (ICCICT), Oct. 19-20, Mumbai, India CVPD: A Tool Based on a Social Network Analysis to Combating Viruses Propagation Seema Mishra, Udit Agrawal and G C Nandi Robotics & Artificial Intelligence Lab Indian Institute of Information technology, Allahabad [email protected], [email protected] and [email protected] Abstract— It has been seen that Social network analysis is gaining its applicability in several areas like business, marketing, biology, disease modeling, and anti-terrorism. In this paper, we have discussed its practical application in the domain of computer network to identify distribution of computer viruses flowing through the network. To the best of our knowledge this is a novel idea and is based on the gSpan (Graph based substructure Pattern Mining) algorithm for identifying frequent pattern of viruses flowing in a particular region of connected nodes.This crusades make analysist enabled to deal with the problems and deploy more efficient antivirus in that region of nodes. Keywords: CVPD algorithm; Social network analysis; Computer virus propagation. I. INTRODUCTION In modern era, Social network analysis has been existence for around a long time and experiencing the surge in popularity in the arena of information technology. In order to visualize and represent social network, a netwotrk structure called graph theory plays a significant role and provide a substantial importance of knowledge in data. Formally, it can be represented as graph G = < V, E>, whereas V is set of nodes exemplify entities comprises of people or organizations and E represent relationship between these entities [1, 2]. It aims to examine the pattern of social link and bring forth obscured knowledge of interconnections between among entities. Hence social network analysis is the study of social networks to understand their structure and behavior. Mark Newman described that real word network can be in any form as social network, biological network, technological network and information network [9]. II. TYPES OF SOCIAL NETWORK Social Network can be modeled as one type entity like all people in a group or all organization in a group. It can be two types of entities including people and organization they belong to [1]. But this concept is not bounded to be actor level entity after the work of Wellman, 2001. Computer networks can also be dealt as social network that link people, organization and knowledge [5]. . III. RELATED WORK & MOTIVATION Since modeling and analyzing network has been major and essential issue in various areas like biology science, document retrieval and numerous more. By connecting nodes we can induce the reasoning of relationship between components of complex system. Later on, community detection has become classical problem in the expanse of social network analysis [1, 2, 8]. In graph theory relating term community detection refers to discovering frequent sub graphs occurring in entire graph. The frequent sub graphs reveals the prominent relationship between connected points and lead to significant informations and eventually the knowledge of complete network. So long as the entities has been characterize as an actor, organization but after the work of Wellman actors were also considered as computer nodes connected in network. Finding frequent groups in graph database can be modeled as (a) graph transaction setting (b) single graph setting. Graph transaction takes as input relatively small graph of user interaction whereas single graph setting deals with large graph of user’s interaction involved in communication [M. Kuramochi and G. Karypis, 2004]. This motivates to develop new idea of detecting frequent viruses propagation pattern in pertinent region of computer nodes in a network. This detected pattern can assist analyzer to combat the effect of viruses impairment in entire network and more efficacious anti-viruses can be put in the affected nodes. IV. PROBLEM DEFINITION In our problem context, there is data set of graph G, T ∈ G is transaction labeled or colored. Edges or vertices are also labeled or colored. Suppose minimum support is given as σ and we need to detect all connected sub graphs that frequently occurring in one transaction. These graphs show the flowing of viruses and antivirus installed in computers not able to resolve the harmful effects done by viruses. Notations we use in this paper are described in Table 1. Formal Model: In order to fulfill the purpose of this paper, we assume the input as a labeled graph in the form of partition of several computer networks and anti-viruses in increasing hierarchy. Theorize G= (T 1 , T 2 … T m ), AV= (AV 1 , AV 2 …AV n ) and group is defined to be subset T ∈ G. Here, T= (V, E L, l). 978-1-4577-2078-9/12/$26.00©2011 IEEE 1 2012 International Conference on Communication, Information & Computing Technology (ICCICT), Oct. 19-20, Mumbai, India Table 1: Notation Notation Meaning G T Av V CN A set of graph transaction Transaction of graph in G Set of antivirus installed Set of viruses Computer nodes method(); //QLabel *l = new QLabel("Go Clicked"); //l->show(); Where V is a set of computer nodes CN, E ⊆ VXV is set of links between nodes, L is a set of labels i.e. viruses V, l is such that V ∪ E → L function assigning label to the vertices and edges. Here we mean l as a virus. We denote the k FSG ⊆ G are k frequent subsets of original graph. The illustration of working is explicated in figure1. VI. EXPERIMENTAL ANALYSIS AND RESULTS The number of graphs in experimental analysis are more than 100. So, due to space constraint, we have used few numbers Here we are giving a set of graphs as input which would be the data of distribution of viruses in the following form: V. OUR APPROACH The algorithm is described in figure 2 Algorithm 1: CVPD Input: g Output: fsg MainWindow::MainWindow(QWidget *parent) : QMainWindow(parent) {setupUi(this); //this->hide(); connect(exitB,SIGNAL(clicked()), qApp, SLOT(quit())); void MainWindow::method(){ QString min_sup = l1->text(); QString fn = input_filenameL->text(); QString cmd_algo = "gSpan_algo.exe " + min_sup + " input/"+ fn + " output.gspan" ; //QLabel *l = new QLabel(cmd_algo); //l->show(); QByteArray ba1 = cmd_in_grph.toLocal8Bit(); system(ba1.data()); QString cmd_out_grph = "convert_output.exe output.gspan"; /QLabel *lb2 = new QLabel(cmd_out_grph); //lb2->show(); QByteArray ba2 = cmd_out_grph.toLocal8Bit(); system(ba2.data()); //l2->setText("READ"); //exit(0); system("start output_graphs"); system("start input_graphs"); } void MainWindow::on_goB_clicked(){ 978-1-4577-2078-9/12/$26.00©2011 IEEE Here, t represents n graph index, V represents a node in the graph which is basically a computer with some anti-viruses program installed. Here first value is the node id and the second value is its label.Similarlity, E represents the edge in 2 2012 International Conference on Communication, Information & Computing Technology (ICCICT), Oct. 19-20, Mumbai, India the graph. Its first two values are the node ids of the two nodes involved in the network and the third value is the edge label. Antivirus Rating Order: Following list is the antiviruses that are arrangend in hierarchical order. It also allows us to compare various antiviruses programs involved. The input graphs that we are using can be represented in the graphical form in the following format. Here, each antivirus is represented by a Node. Here, we are using different colors for different viruses and each edge represents the spread of virus from one node to another. Edge label represent the type of virus. Figure 2 shows the screen shot of tool. Figure 3 depicts the distribution of viruses among nodes. Figure 2 978-1-4577-2078-9/12/$26.00©2011 IEEE 3 2012 International Conference on Communication, Information & Computing Technology (ICCICT), Oct. 19-20, Mumbai, India 978-1-4577-2078-9/12/$26.00©2011 IEEE 4 2012 International Conference on Communication, Information & Computing Technology (ICCICT), Oct. 19-20, Mumbai, India algorithm is to overcome the complexity of computing isomorphic graphs. Experimental analysis shows the tool is adequate to handle large graphs and produce significant results. To the best of our knowledge, gSapn based approach has not yet been put in use to identifying computer virus distribution through network. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] Wasserman, S. and K. Faust. Social network analysis: Methods and applications. New York, Cambridge University Press, 1994. Hanneman, Robert A. and Mark Riddle. Introduction to social network methods. Riverside, CA: University of California (published in digital form http://faculty.ucr.edu/~hanneman/, 2005. M. Kuramochi and G. Karypis. Finding frequent patterns in a large sparse graph. SDM, 2004 Yan X, Han J gSpan: Graph-based substructure pattern mining. In: Proceedings of ICDM’02, pp 721---724, 2002. Wellman, B ,Computer networks as social networks.Science Magazine, 293, 2031-2034, 2001. Mason A. Porter, Jukka Pekka Onnela and Peter J. Mucha Communities in Networks, Notices of the American Mathematical Society, Vol. 56, No. 9, 2009. Santo Fortunato and Claudio Castellano - Community Structure in Graphs, chapter of Springer’s Encyclopedia of Complexity and System Science (2008). Luo, J. Social network analysis. Social Science Academic Press. (In Chinese), 2004. M.E.J Newman- The structure and function of complex networks, SIAM Review, 45 167-256(2003). Figure 3 Output: Frequent Sub Graphs: These sub graphs represent the frequent propagation of the computer viruses. Figure 4 depicts sub graph of computer viruses distribution. For example, virus D is flowing more frequently in the the system(not clear). CONCLUSION In this paper we conclude that, we have developed a tool inspired from gSpan algorithm and addressed the application to indentify frequently occurring sub graph that showing the flow of computer viruses. The main aim of preferring this 978-1-4577-2078-9/12/$26.00©2011 IEEE 5
Hierarchal Structure of Community and Link Analysis Seema Mishra and G.C Nandi Indian Institute of Information Technology, Allahabad, India {seema.mishra.phd,gcnandi}@gmail.com Abstract. Discovering the hierarchy of organizational structure in a dynamic social network can unveil significant patterns which can help in network analysis. In this paper, we formulated a methodology to establish the most influential person in a temporal communication network from the perspective of frequency of interactions which works on hierarchal structure. With the help of frequency of interactions, we have calculated the individual score of each person from Page Rank algorithm. Subsequently, a graph is generated that showed the influence of each individual in the network. Rigorous experiments we performed using Enron data set to establish a fact that our proposed methodology correctly identifies the influential persons over the temporal network. We have used Enron Company’s email data set that describes how employees of company interacted with each other. We could analyze from our methodology and verify from the facts in the Company’s dataset since after bankruptcy, the result of interactions and behaviors of the individual of the network are absolutely known. Our result shows that the proposed methodology is generic and can be applied to other data sets of communication to identify influential at particular instances. Keywords: Dynamic social network analysis, social network analysis, hierarchal structure. 1 Introduction A network structure is the perfect epitome provides a formal way of representing data that emphasizes the association between entities. This representation has a substantial importance gives the insight of knowledge into the data. Since for the work to be done many entities these days are interconnected and behaviors of individual entity reflect the function of whole system at large extent. The entity could be people, organization [15], computer nodes [16]. Networks are primarily studied in mathematical framework i.e. graph [1]. In modern era, social network analysis is proliferated area of research, has been in existence for quite some time and experiencing a surge in popularity to understand the behavior of the users at individual and group level [Wasserman & Faust, 1994, Wellman, 1996]. Understanding the behavior of individual social networking methods assuage the analysts to revealing hidden patterns from social communication. In order to model the social network mathematically, most popular data structure typically known as graphs are used where the nodes depict the individual or group of person, or event or organization etc and each link/edge represents connection/relationship A. Agrawal et al. (Eds.): IITM 2013, CCIS 276, pp. 252–260, 2013. © Springer-Verlag Berlin Heidelberg 2013 Hierarchal Structure of Community and Link Analysis 253 between two individual. Social network analysis attempts to understand the network and its components like nodes (social entities commonly known as actor or event) and connections (inter-connection, ties, and links). It has main focus of analyzing individuals and their interdependent relationships among them rather than individuals and their attributes as we deal in conventional data structure. The target of this paper is to proposing a hierarchical structure of social network changing with time Types of social network Sociometric: It involves the entire population and focus on global structure pattern of social network. Egocentric: It focus individual interaction pattern for analyzing social network. Limitation of this kind of analysis is that it is very difficult to collect case by case data. 2 Dynamic Social Network Analysis Versatile power of social network is being applied to mining pattern of social interaction in wide ranging applications including disease modeling [4] information transmission and behavior analysis [2, 3, 16] and business management and behavior analysis. Network analysis came in to picture as its practical applications in intelligence and surveillance [5] and has become popularized paradigm to uncover anti-social network such like criminal, terrorist and fraud network majorly after the tragically event of September 11th, 2001 which has shattered the whole world. Traditional social network analysis are incorporated with several computational techniques like artificial intelligence, machine learning, data mining to develop empirical research on human behavior, groups and organizational behavior within links among them with varying level of uncertainties. There are two level of analysis in DNA. Firstly, it focuses on relational data, i.e.; data about a link between group of people, events and locations, organizations. Identifying associations between these entities is a crucial part of unveiling different types of activities in order to discover knowledge about network. The second is the focus on dynamism of the relations; i.e., how these relations likely to change in the future and what are the interesting consequences of these changes in the system. Social interaction could be in any form that depends on the type of data available [6]. It might be verbal or written communication (cell phones, emails, and blogs chatting), scientific collaboration (co-authorship network, citation network), browed websites, and group of animals. This mathematical network model is very successful in analysis of social network but major drawback is that it may miss the temporal aspect of interaction because social interaction is inherently dynamic in nature. The static model of interaction can give the information that could be inaccurate and decision made based merely on this contributed information might lead analyst to faulty direction. 254 S. Mishra and G.C. Nandi Several shortcomings can be highlighted when dealing with static model of social network that could forbid acknowledging the casual relationship of pattern of social interaction [6]: • • What is the rate of spreading diseases while modeling diseases and who is the central person whom should be vaccinated to control spread of it among group of person. What are the causes and consequences of social structure evolution? Dynamic social network analysis is emerging research area play a crucial role to fill gap between traditional social network analysis and time domain. Dynamic study of network includes classical network analysis, link analysis, and multi-agent systems. Dynamic network analysis facilitates the analysis of multiple types of nodes (multi-node) and multiple types of links (multi-plex) simultaneously. On the contrary, static network analysis can only focus at most two mode data and analysis one type of link at a time. There are several characteristics of dynamic network: • • • Nature of nodes are dynamic, there properties changes with respect to time. Deals with meta-network. Network evolution is consequent of agent-based modeling. Furthermore the network analysis exists in four levels [15]: Attribute oriented analysis, Position oriented analysis, and Structure oriented analysis, Dynamic network analysis. Fig1 shows the categorization. The attribute analysis captures the properties of vertices and edges and finds the causal relation with the structure of network. Dynamic Social Network Analysis Structure oriented analysis ( t1,t2...tn ) Position oriented analysis Attribute oriented analysis Fig. 1. Level of network analysis Position oriented analysis aims to investigate the mico level stage of network. It looks into every single entity i.e. node or organization and their characteristic in the network. Structure analysis is macro level analysis of network and investigates the Hierarchal Structure of Community and Link Analysis 255 average metrics of social network. Dynamic analysis can utilizes the all measures of aforementioned analysis in order to identify progressive behavior of network and individual. ROLE OF DATA MINING IN SOCIAL NETWORK Social computing can make use of data mining techniques in following analysis • • • • Community Detection Classification Link Prediction Network Modeling COMMUNITY IN SOCIAL NETWORK A community is sub part of whole network between which inter-community interaction is relatively frequent and strong than intra-community interaction. It can be in any form for example group, subgroup, and cluster. It may: a) citation network represents related papers on single topics, b) web pages on related topics. Community detection is a classical problem in social network analysis. Commonly, it can be the problem of identifying sub-graphs of original graph and called vertex sparsifier [12]. These small networks uphold the relevant information of original group. Four levels of analyses are being conducted in community identification as shown in fig 1: ANALYSIS OF PREVIOUS RESEARCH Hierarchical methods for community detection falls into two categories: agglomerative and divisive. In former case each node is assumed to be a community and repetitively group together. Similarly, in later case initially whole network is considered a community and divided subsequently into smaller one. Most methods are graph clustering and partition. Distance based structural equivalence [7] uses distance metrics to identify similar entities. In graph partition methods several algorithms has been proposed [8, 9]. Newman- Girvan method and spectral clustering methods [10, 11] uses a notion of modularity and utilizes edge betweenness metric to divide into groups. In analyzing dynamic pattern, many methods use the temporal snapshots of interaction over the times [13, 14]. 3 Discovering Hierarchy of Group Before analyzing community hierarchy we define several basic terminologies. Hierarchy of community provides the power of each individual in a group. If somehow we know this chain we can find the leader of group. Regarding this we used 256 S. Mishra and G.C. Nandi well known algorithm PageRank which calculate the individual score I_score of each person to represent importancy of person. Higher the score of person more powerful the person is. Definition: If P = { p1 , p2 ... pn } be the collection of persons involved in communication. For any members pi and p j if I_score ( pi ) ≥ I_score( p j ) then pi can be the leader of p j . 3.1 Proposed Model Figure 2 shows architecture of our system proposed Extracting Names of Distinct user Enron Email data Preprocessing Socio matrix January February March ... De cember Socio matrix for the twelve months Graph of twelve months as shown in Fig. 2. Model for extracting graph over months First the Individual score of each member calculated with the help of Page Rank algorithm described following. Input: Social Network G. Output: Individual score of each person Steps: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. d = 0.85 epi = 0.01 del = 0,delta[150] del_prev=0 float sum1 = 0 float sum2 = 0 float e = 0 do loop: del_prev=del iter++ fort i = 1 to 150 Hierarchal Structure of Community and Link Analysis 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25 26. 27. 257 R[1][i] = ((1 - d) / N) + d*fun(i) for j = 0 to 150 sum2 += R[1][j] sum1 += R[0][j] e = sum1 - sum2 for j = 1 to 150 R[1][j] = R[1][j] + e*L[j]/sum_all For j = 1 to 150 sum2 += R[1][j] - R[0][j] del = sum2 for j = 1to 150 R[0][j] = R[1][j] while(epi > del && del_prev!=del) for i=0 to150 R[0][i]=10000.0*R[0][i] In I_score (), d is damping factor generally assumed to be around 0.85. R is vector stores the individual score of each person. N denotes the total number f persons in the network. Sum_all defines the sum of the weight of all edges. Fun() defines W ( p p j ∈M ( pi ) ij ). Pk ( j ) where L ( p j ) sum of weights of all edges linked with p j L( p j ) and W ( pij ) weights of edges linking 4 pi and p j . Practical Implementation and Analysis In this section, we evaluate the capability of the proposed approach on discovering organizational structure and to exploring evolution of organizational structure in a dynamic social network and link between individuals. We performed the experiments on Enron email data set. Email communication data has become a practical source for research in network analysis like social network. Mostly the experiments are carried out on the artificial data due to the non-availability of real life communication data. The Enron email data set [17] has become a benchmark for this sort of research domain in network analysis. This data set was made public and posted on web by the Federal Energy Regulatory Commission during its investigation for fraud happened in company, in order to make it test bed for validating and testing the efficacy of methodologies developed for counter-terrorism, fraud detection and link analysis. Data is about 150 users communication mostly senior managers organized into folder. But this set has still lots of issues like integrity issue and duplicate messages issue. For preprocessing, first the names of distinct users were extracted and duplicated ids were neglected. 258 S. Mishra and G.C. Naandi 20 15 10 Number of communities over months 5 0 Jan Feb Mar Apr May June July Aug Sept Oct Nov Dec Number of communities Numbe er of communities over months Months Fig. 3. Number of communities over months olution of community from months Jan to Dec Fig. 4. Evo Hierarchal Structure of Community and Link Analysis 259 The proposed approach is implemented in DEV C++. The experiments were conducted on a 2.1GHz PC with Core(TM)2 Dual- Core Pentium 4 processor with 2 GB RAM. On examining the results in figure 3, we analyzed the number of grouping in the month of May was maximum. Figure 2 shows that Jim was the person who headed the group from Jan to Feb. Monique was leader throughout the months form Jan to April. 5 Conclusion and Future Scope In this paper we introduced a concept of hierarchy of positions in group by taking temporal interaction data of twelve months in organization that shows how the position and link of group members changes when people joining and leaving the group. This hierarchal group interaction is significant to facilitate link analysis of individuals with the time period In the future we are planning to improve the integrity issues of preprocessed Enron data results of experiments. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] Hanneman, R.A., Riddle, M.: Introduction to social network methods University of California, Riverside (2005), published in digital form http://faculty.ucr.edu/~hanneman/ Baumes, J., Goldberg, M., Magdon-Ismail, M., Wallace, W.: Discovering hidden Inform. (2004) Tyler, J., Wilkinson, D., Huberman, B.: Email as spectroscopy: Automated discovery of community structure within organizations. In: Proc. 1st Intl. Conf. on Comm. and Tech. (2003) Kretzschmar, M., Morris, M.: Measures of concurrency in networks and the spread of infectious disease. Math. Biosci. 133, 165–195 (1996) Baumes, J., Goldberg, M., Magdon-Ismail, M., Wallace, W.: Discovering hidden groups in communication networks. In: Proc. 2nd NSF/NIJ Symp. on Intel. and Security Inform. (2004) Berger-Wolf, T.Y., Saia, J.: A framework for analysis of dynamic social networks. In: Proc. KDD 2006, pp. 523–528 (2006) Fortunato, S., Castellano, C.: Springer’s Encyclopedia of Complexity and System Science. Community Structure in Graphs (2008) Chekuri, C., Goldberg, A., Karger, D., Levin, M., Stein, C.: Experimental study of minimum cut algorithms. In: Proc. 8th SAIM Syposium on Discreet Algorithm, pp. 324– 333 (1997) Wu, A.Y., et al.: Mining scale-free networks using geodesic clustering. In: Proc. KDD 2004, pp. 719–724 (2004) Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69, 026113 (2004) Newman, M.E.J.: Modularity and community structure in networks. PNAS 103(23), } 8577–8582 (2006) 260 [12] [13] [14] [15] [16] [17] S. Mishra and G.C. Nandi Moitra, A.: Approximation algorithms for multi commodity type problems with guarantees independent of the graph size. FOCS, 3–12 (2009) Zhou, D., Councill, I., Zha, H., Lee Giles, C.: Discovering Temporal Communities from Social Network Documents. In: Proc. of ICDM 2007, pp. 745–750 (2007) Tantipathananandh, C., Berger-Wolf, T., Kempe, D.: A Framework For Community Identification in Dynamic Social Networks. In: Proc. of KDD 2007, pp. 717–726 (2007) Wasserman, S., Faust, K.: Social network analysis: Methods and applications. Cambridge University Press, New York (1994) Wellman, B.: Computer networks as social networks. Science Magazine 293, 2031– 2034 (2001) The original dataset can be downloaded from William Cohen’s web page, http://www-2.cs.cmu.edu/~enron/