Link mining Using Strength of Frequent pattern of Interaction
1
Seema Mishra , G C Nandi
1
2
[email protected] ,
[email protected]
1
Research Scholar, 2 Professors
Robotics and AI Lab, CC-1 Building
Indian Institute of Information Technology, Allahabad
Allahabad, India
2
Abstract: This work addresses the important problem of discovery and analysis of social
networks and link between frequent people from surveillance video where large amount of
video data is collected routinely. A computer vision approach enabled to solve this problem at
lower and with the help of video data obtained from the fixed camera. Camera systems should
have the capability to acquiring high-resolution face images of people under challenging conditions. We perform “opportunistic “face recognition on captured images. We present a novel
frequent pattern mining based approach to solve this frequent association problem between
social networks. Our approach is illustrated with promising results from a fully integrated camera system.
Keywords: Social network analysis, Link analysis, Link mining, Law- enforcement, Homeland
security, Closed world environment, Data mining, Frequent pattern mining, Knowledge Discovery, Computer Vision.
1. Introduction
There are several modern means of communications, like email, chartroom, weblogs,
and telephone [12]. Here, we discuss about the new emerging means of communication and interaction is computer vision. Since computer vision techniques are being
used tremendously in many areas of surveillance systems, face recognition [1, 18],
detecting, recognizing and analysis of activities [9, 10] and events.
More specifically, we are attempting to discover higher level understanding and
knowledge of group behavior and their frequent relations within the reference of social interaction. This discovered knowledge can be useful in law enforcement, homeland security, organizations, and closed world environment.
With the objective of achieving knowledge about social interaction, we can imagine
our problem centered with two critical aspects. For example, first aspect is frequent
relation of people appearing together. Second aspect is dynamism in a group as people are changing within it. In this research we propose a framework, for automated
frequent link analysis in terms of social network pattern. The framework includes two
level tasks, low level and high level. The low level task integrates module, face
recognition and high level task includes pattern mining for frequent group.
Here, the face recognition module should include:
1. Persistently track video sequences and bifurcate into key frames.
2. Detect and recognize faces unambiguously under challenging conditions. For this
purpose, only high resolute images of faces are captured.
The approach we espoused for discovering frequent persons grouped together oftentimes is based on pattern mining approach [17, 20].
Integration of paper is delimited as follows: In Section 2, we put a description of social network and its creation. In Section 3, we discuss proposed approach in the context of social network data. Section 4, incorporates insinuation of algorithm and discuss data sets and implementation results yielded by fully integrated system. This
paper ends with conclusion in section 5.
1.
Social Network
Social network analysis is used to understand the pattern of interaction caused by
social dynamics and events [2, 3, 4, 5, 6, 7].
Social network graph corresponds to G V , E as we have in mathematical literature where V is set of nodes. Each node is an individual in society and possesses a
name and some identifying information. This information could be face images and
signatures.
Suppose, social connection strength is S ij between two persons, i and j . To establish
strength of tie between individuals i and j the following assumptions are required.
1.
2.
3.
Persons involved in interaction,
i1 , i2 ...im
where m is variable should be
indentified and recognized in positive manner.
The interaction is measured in a group.
The frequency with which the groups of people are seen in proximity is observed
in appropriate manner over the period of time.
The low level job face recognition module compute the probabilistic values
i.e. P [ p1 , p 2 ,... p N ] where each value evaluates the probability that recognized face is corresponds to individual i in the original data base of faces where
signature of all individuals are stored. But if new face is detected, it should be
stored in database of faces with subsequent index. The index of face in original
data base becomes the signature of recognized face. Suppose three persons P, Q,
R
are
recognized
from
the
system
j ; arg max j q j and k ' arg max k r k
then
we
have i ' arg max i p i ,
Figure 1. Grouping of people: Groups are obtained
from capturing images of persons form fixed camera
views. Here 1 indicates that particular person is involved
in the particular group
Figure 2: System for calculating frequent person
and link mining
Detected faces appear in a group and as a consequence we obtain a transaction of
groups (see in figure1) as G {g1 ,..., g m } where g1 G and here
p1 , p 2, ..., pn are persons.
1. Frequent persons and their link mining
The proposed model of frequent link is pictured in figure 2.
The proposed system is conceptualized as in following algorithm 1.
We have divided the model in several modules of key frames extraction,
face recognition, frequent pattern mining in faces recognized.
3.1 Key frame extraction
Our problem definition is concerned with the video capturing and fragment into
several frames. This results to the very large database of images and in particular,
much of the captured data may be redundant. Hence, it requires efficient method that
enables to extract those frames which has desired information form video information
to deal with called representative or key frames.
Here, our key frames to be extracted depend on the relevant object in visual contents which should consist of group of people.
We let video sequence V starting form time t and comprising of F frames is
Vt
f t
n n 0 ,1 ,...,
and set of key frames kf have KF
1
F frames with the following expression
kf t f KF t n 1 , f KF t n 2 ,..., f KF t n t 0 n i F
Let I ( x, y ) be the intensity value of each pixel in m n image at ( x, y ) location.
F
Our key frame selection process is based on the idea of reading image from left to
right and calculate the gradient point where rapid change in intensity occurs. The
method is described in following algorithm (see in algorithm 2).
3.2 Face Recognition
The face recognition module computes probabilistic values of recognizing faces that
returns the index of most likely matched face from the watch list. As a consequence,
we have now the list of index of faces stored in watch list that is used to apply method
of computing frequent faces coming along. Evidently, face detection is prior procedure to enable any automated system to solve desire problem [18].
3.3 Pattern Mining with Social Network
Frequent social network are those entities that appear in a group more frequently aims
to analyze person’s link behavior with the premise of how frequently they appear with
other persons. The association rules problem was introduced by R.Agrawal et al. [14,
15, 16]. The problem of discovering all association rules can be decomposed into two
sub problems [AGR 94]: First, find all sets of items that are called frequent item sets
that have transaction support above minimum support. Secondly, Use the frequent
item sets to generate the desired rules
Let us assume D be a set of face database and let F { f 1 , f 2 ,..., f m } be the set of
all face images. For the sake of computation, we have enumerated number of faces
form 0 to m-1. Let ms be the minimum support threshold. Of course ms (0; 1]. If X
is a nonempty subset of I, then X is a frequent persons appearing together if and only
if T D X T
D
2.
ms
Experimentation
In order to describe the experimental result of the proposed system we captured
the video sequence of group of people at a time, divide it to frames and generate
a key frames form this data. The whole process of key frame generation, face detection, face recognition, and pattern mining is done offline manner to analyze
the data of video. The remaining section of this experimentation will present the
result from various aspect of the system.
Key frame extraction
First of all, we demonstrate the performance of key frame extraction video captured on the particular timing of entering the students, staff, and research scholars
in the laboratory. Mostly the subject’s s are entering and leaving constantly with
making groups with different person with different time. That form of group is to
be discovered as a frequent group.
Face Detection and Recognition.
After key frame extraction process, this module is concerned with face detection
and finally recognizes it from the watch list of faces and returns the index of that
recognized face. The detected faces from each group are given as in figure 3.
Here we represent only few groups for the sake of understanding.
Group1
Group2
Group3
Group4
Group5
Group6
Pattern Mining with Social Network
The final result frequent persons are given in figure 4. For the verification we have
analyzed that these three persons were involved with strong link because they were
doing their master’s project together.
3.
Conclusion and future work
This paper, addressed an emerging new problem discovering and mining frequent link
between social networks. A computer vision and pattern mining approach is utilized
to solve the proposed problem computing frequent linked people which may have
tremendous applications for practical surveillance systems. With premise of this, we
have presented an algorithm to extract key frames form video sequences to identifying individuals and analyzing their interactions using face recognitions. In order to
implement a system data were collected in closed world environment using video.
Several challenges were encountered to solve the problem such as to model group,
face detection and recognition in cluttered environment. In a future more sophisticated and expressive strategy could be used to overcome this problem i.e. close in space
with some orientation. Fuzzy concepts could be employed to clustering people to
identify the degree of membership that is how strong the one associated with other
cluster.
References
1.
2.
3.
H. Schneiderman and T. Kanade. A statistical method for 3D object detection applied to faces and
cars. In IEEE Computer Vision and Pattern Recognition, Hilton Head, SC, volume 1, pages 746–751,
2000.
Newman, M. E. J. Detecting community structure in networks. European Physical Journal B 38: 321330. 2004.
T. Yu, S. Lim, K. Patwardhan, and N. Krahnstoever. Monitoring, recognizing and discovering social
networks. In CVPR, 2009.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
M. E. Newman, “The structure and function of complex networks,” SIAM Review, 45(2): 167–256,
2003.
T. Berger-Wolf and J. Saia, “A framework for analysis of dynamic social networks,” DIMACS Technical Report, vol. 28, 2005.
J. Sinai, “Combating terrorism insurgency resolution software,” Proc. IEEE Int. Conf. Intelligence
and Security Informatics (ISI-2006), pp. 401–406.
L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan, “Group formation in large social networks:
membership, growth, and evolution,” Proc. 12th ACM SIGKDD Int. Conf. Knowledge Discovery and
Data Mining, 2006, pp. 44–54.
Goldberg, M.; Hayvanovych, M.; Hoonlor, A.; Kelley, S.; Magdon-Ismail, M.; Mertsalov, K.; Szymanski, B.; Wallace, W.; , "Discovery, analysis and monitoring of hidden social networks and their
evolution," Technologies for Homeland Security, 2008 IEEE Conference on , vol., no., pp.1-6, 12-13
May 2008.
N. Vaswani, A. K. R. Chowdhury, and R. Chellappa. “Activity recognition using the dynamics of the
configuration of interacting objects”. In CVPR (2), pages 633–642, 2003.
Turaga, P., Chellappa, R., Subrahmanian, V. S., and Udrea, O. 2008. “Machine recognition of human
activities”: A survey. IEEE Transactions on Circuits and Systems for Video Technology 18, 11 (Nov),
1473-1488.
Turk, M.A.; Pentland, A.P. "Face recognition using eigenfaces," Computer Vision and Pattern
Recognition, 1991. Proceedings CVPR '91. IEEE Computer Society Conference on , vol., no., pp.586591, 3-6 Jun 1991.
Mishra, Seema; Agrawal, Udit; Nandi, G C; “CVPD: A tool based on a social network analysis to
combating viruses’ propagation", Communication, Information & Computing Technology (ICCICT),
2012 International Conference on, vol., no., pp.1-5, 19-20 Oct. 2012.
R. Agrawal and R. Srikant. “Fast algorithms for mining association rules”. In J.B. Bocca, M. Jarke,
and C. Zaniolo, editors, Proceedings 20th International Conference on Very Large Data Bases, pages
487–499. Morgan Kaufmann, 1994.
R. Agrawal and R. Srikant. “Fast algorithms for mining association rules”, IBM Research Report
RJ9839, IBM Almaden Research Center, San Jose, California, June 1994.
Jiawei Han, Micheline Kember.”Data Mining: Concepts and Techniques”.Mongan Kaufmann publishers, 2000.225-278.
Agrawal R, Imielinski T, Swarmi A, “Mining Association Rules between Sets of Items in Large
Database”. In: proceedings of ACMSIGMOD International conference on Management of Date
Washington, D C, 1993, 207-216.
M. Hegland. “The apriori algorithm - a tutorial. WSPC/Lecture Notes Series, 9(7)”, March 2005.
http://www2.ims.nus.edu.sg/preprints/2005- 29.pdf.
Ming-Hsuan Yang; Kriegman, D.J.; Ahuja, N.; "Detecting faces in images: a survey," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol.24, no.1, pp.34-58, Jan 2002.
Lakshmi, H.C.V.; PatilKulakarni, S., "Segmentation Algorithm for Multiple Face Detection for Color
Images with Skin Tone Regions," Signal Acquisition and Processing, 2010. ICSAP '10. International
Conference on, vol., no., pp.162-166, 9-10 Feb. 2010.
J. Han, H. Cheng, D. Xin, X. Yan, “Frequent pattern mining: current status and future directions, in:
Data Mining and Knowledge Discovery”, 10th Anniversary Issue, 2007, pp. 55–86.
Guozhu Liu, and Junming Zhao, “Key Frame Extraction from MPEG Video Stream”, Proceedings of
the Second Symposium International Computer Science and Computational Technology(ISCSCT
’09), Huangshan, P. R. China, 26-28,Dec. 2009, pp. 007-011.
G. Ciocca and R. Schettini, “An innovative algorithm for key frame extraction in video summarization,” J. Real-Time Image Process., vol. 1, no. 1, pp. 69–88, 2006.
R.R. Schulz and R.L. Stevenson, “Extraction of high-resolution frames from video sequences,” IEEE
Trans. Image Processing, vol. 5, pp. 996-1011, June 1996.
International Journal of Enhanced Research in Science Technology & Engineering, ISSN: 2319-7463
Vol. 2 Issue 6, June-2013, pp: (71-75), Available online at: www.erpublications.com
Discovery of frequent subgroup using data mining
Seema Mishra
Robotics and AI Lab, Indian Institute of Information Technology, Allahabad, India
Abstract: Social network analysis is emerging technology that focuses on pattern of interaction/relation among people,
organization. Frequent subgroup detection is subpart of social network that reveals the hidden knowledge and reflect the behavior
of entire social network. These subgroups are collection of nodes that share common characteristics and densely connected with
each other. In this paper, an unexampled approach is acknowledged to discover frequent subgroup inspired from a well known
algorithm in the domain of association rule mining recognized as Continuous association rule mining algorithm.
Keywords: Subgroup detection, Social network analysis, Dynamic network analysis.
I.
INTRODUCTION
In modern era, social network analysis has been in existence for quite some time and experiencing a surge in popularity to
understand the behavior of the users in the form of nodes in the network. In order to model the social network, most popular
data structure typically known as graphs are used where the nodes depict the individual or group of person, or event or
organization etc and each link/edge represents connection/relationship between two individual [7 ,8]. Social network analysis
attempts to understand the network and its components like nodes (social entities commonly known as actor or event) and
connections (inter-connection, ties, and links). It has main focus of analyzing individuals and their relationships among them
rather than individuals and their attributes as we deal in conventional data structure.
Social Network analysis has been in existence from past but now a day’s extensively used to analysis the structure and
connection between various actors existing within organization.
The ability to detect community structure in a network could have practical applications. Communities in a network might
represent real social groupings oftentimes interacting, perhaps by interest or background; communities in a citation network
might represent related papers on a single topic; communities in a metabolic network might represent cycles and other
functional groupings; communities on the web might represent pages on related topics; hidden communities might represent
potential suspicious activity.
Being able to identify these communities could help us understand and exploit these networks more effectively. Communities
of practice are the collaboration groups that naturally grow and coalesce within any kind of networks. Any institution that
provides opportunities for communication or interaction among its members is eventually threaded by communities who have
similar goals and a shared understanding of their activities. These communities have been the subject of much research as a
way to uncover the structure and interaction patterns within a network in order to understand the collective behavior of the
network from the individuals that constitute the network. Recent Research on these networks has focused on using a social
network perspective to analyze these networks.
A social network consists of both a set of actors, who may be arbitrary entities like persons or organizations, and one or
more types of relations between them, such as information exchange or economic relationship. Subgroup detection aims at
clustering nodes in a graph into subgroups that share common characteristics. But to some extent, sub graph discovery does
the same job for finding interesting or common patterns in a graph. One of the most common interests of social network
analysis is the substructures that may be present in the network. Subgroups are subsets of actors among whom there are
relatively strong, direct, intense, frequent, or positive ties. From the ideas of subgroups within a network, we can understand
social structure and embeddedness of individuals. Finding frequent groups in graph database can be modeled as (a) graph
transaction setting and (b) single graph setting. Graph transaction setting takes as input relatively small graph of user
interaction whereas single graph setting deals with large graph of user’s interaction involved in communication [M.
Kuramochi and G. Karypis, 2004].
The approach we espoused for discovering frequent subgroups is based on continuous association rule mining algorithm [15].
The main aim of adopting this approach is, because the groups of people is not static, it changes over a period of time as the
member of group is being joining and leaving from group.
Page | 71
International Journal of Enhanced Research in Science Technology & Engineering, ISSN: 2319-7463
Vol. 2 Issue 6, June-2013, pp: (71-75), Available online at: www.erpublications.com
In order to model the interaction of users, we have an undirected graph where each vertex represents user and each edge
represents relation between two users. Using such a graph representation, the problem of finding frequent patterns then
becomes that of discovering subgroups which occur frequently enough over the entire set of graphs. The overall group of
people is represented in figure1.
Figure 1: Group of people
For example, subgroups are represented in following figure2.
Figure 2: Two subgroups
II.
OVERVIEW OF THE PAPER
Integration of paper is delimited as follows: In Section 3, we put a related work in the subgroup detection in social
networking. In Section 4, CARMA algorithm will be discussed in the context of social network data. More appropriately will
be said in the form of association /interaction among users. Section 5, incorporates insinuation of algorithm and discuss data
sets and implementation results yielded by algorithm. This paper ends with conclusion in section 6.
III.
RELATED WORKS
Since network is structurally designed as consisting of nodes and edges indicate relationship. One special kind of network is
known as social network and has been studying for long time [2, 3, 4, 5]. Modeling complex datasets in graph has been
recognized powerful tool in various research domains like chemical domain [9, 10], computer vision [11], image and object
retrieval [12], and machine learning [13]. In particular, Dehaspe et al. [10] applied Inductive Logic Programming (ILP) to
obtain frequent patterns in the toxicology evaluation problem [14].
IV.
OVERVIEW OF ASSOCIATION RULE MINING IN THE CONTEXT OF CARMA
In order to search the group of people communicating frequently we followed a graph transaction setting model. There are
two phases in algorithm as we have in CARMA. Phase 1 is meant to construct lattice of large group of people interacting
/communicating each others. For each group sets g G, three variables are asseverated:
Page | 72
International Journal of Enhanced Research in Science Technology & Engineering, ISSN: 2319-7463
Vol. 2 Issue 6, June-2013, pp: (71-75), Available online at: www.erpublications.com
count(g): the number of times group g occurred in communication while inserting in lattice.
firstTrans(g): pattern of communication number at which g is being inserted in lattice.
maxMissed(g): upper bound on the occurrence of g before g is being inserted in lattice.
V.
DATA SETS AND IMPLEMENTATION
This section exemplifies the approach. We considered a synthetic dataset incorporating Users Groups with following
attributes of each user. Interaction of group is prepared in such an appropriate format that fits the experimental analysis
age
location
sex
education
marital status
interests
Pattern of interaction:
g1 : ( user_1, user_2, user_3, user_4, user_10, user_11, user_12, user_13)
g 2 : ( user_22, user_23, user_24, user_25, user_17, user_18, user_19, user_9)
g 3 : ( user_22, user_23, user_24, user_25, user_17, user_18, user_19, user_9)
g 4 : ( user_1, user_2, user_3, user_4, user_22, user_23, user_24, user_25)
g 5 : ( user_1, user_2, user_3, user_4, user_13, user_14, user_24, user_25)
g 6 : (user_1, user_2, user_3, user_4, user_13, user_14, user_24, user_25)
g 7 : (user_1, user_2, user_3, user_4, user_22, user_23, user_24, user_25)
g 8 : ( user_3, user_4, user_5, user_22, user_23, user_24, user_25, user_10)
g 9 : ( user_13, user_14, user_15, user_22, user_23, user_24, user_25, user_20)
g10 : (user_12, user_13, user_14, user_5, user_6, user_7, user_8, user_9)
g11 : (user_12, user_13, user_14, user_15, user_24, user_25, user_1, user_2)
g12 : (user_22, user_23, user_24, user_25, user_8, user_12, user_13, user_14)
g13 : (user_1, user_2, user_3, user_11, user_12, user_13, user_24, user_25)
g14 : (user_1, user_2, user_20, user_21, user_22, user_23, user_24, user_25)
g15 :( user_8, user_9 user_10 user_11 user_12 user_13 user_23 user_24)
g16 : (user_10, user_11, user_12, user_13, user_22, user_23, user_24, user_25)
g17 : (user_10, user_11, user_12, user_13, user_22, user_23, user_24, user_25)
g18 : (user_22, user_23, user_24, user_25, user_16, user_17, user_18, user_19)
g19 : (user_22, user_23, user_24, user_25, user_7, user_18, user_21, user_11)
g 20 : (user_22, user_23, user_24, user_25, user_17, user_18, user_19, user_9)
g 21 : (user_22, user_23, user_24, user_25, user_5, user_6, user_7, user_8)
g 22 : (user_1, user_2, user_3, user_4, user_22, user_23, user_24, user_25)
g 23 : (user_1, user_2, user_3, user_4, user_13, user_14, user_24, user_25)
g 24 : (user_1, user_2, user_3, user_4, user_13, user_14, user_24, user_25)
g 25 : (user_1, user_2, user_3, user_4, user_22, user_23, user_24, user_25)
Page | 73
International Journal of Enhanced Research in Science Technology & Engineering, ISSN: 2319-7463
Vol. 2 Issue 6, June-2013, pp: (71-75), Available online at: www.erpublications.com
the support sequence is supplied as
=[0 1 2 2 3 3 3 3 3 4 4 4 4 5 5 5 5 5 5 6 6 6]
The group of people communicating frequently using algorithm is resulting as
(user_1, user_2, user_3, user_4, user_10, user_11, user_12, user_13)
Each user is linked with some communities of their own interest on the basis of that we made groups of users having some
specific values for each attribute. Interaction of group is defines as pattern of communication in different groups
g1, g 2, g 3,...g n and support lattice of users group is G.
For the sake of convenience and ease of explanation for how algorithm works, we deducted a small part of user pattern of
interaction as mentioned P = { g1 , g 2 , g 3 } where people involved in these groups as:
g1 = {u1, u2, u4}
g 2 = {u1, u2, u3}
g 3 = {u1, u2}
All the way of implementation (shown in figure 3), G is initialized to
calculation support sequence is (0.3,0.9,0.7) .
{} and corresponding integers as (0, 0, 0) and for easy
Since maxSupport ( ) =1 1 (0.3) , add all singleton users and set associated integers with updated values as (0, 1, 1).
User sets are not pruned in first scanning. Now scanning of g 2 , since {u1, u2}
maxSupport=1 which is greater than
(maxMissed, firstTrans, count)
[minSupport, maxSupport]
2 . So insert {u1, u2} is inserted to G with updated associated integers.
g 2 and {u1}, {u2} already in G with
Figure 3: CARMA implementation of lattice of group
VI.
CONCLUSION
In this paper, we presented frequent group detection algorithm based on CARMA. It is capable of handling dynamism of
network as members are joining or leaving the group. We carried out the implementation on synthetic data of user
interaction.
VII. AKNOWLEDGEMENT
We would like to thank Mr. Lokesh Kumar and Mr. Naveen B. Tech student of RGIIT Campus, Amethi for his endeavor in
implementing algorithm.
Organizational Risk Analyzer (ORA) tool available through the Center for Computational Analysis of Social and
Organizational Systems (CASOS), http://www.casos.cmu.edu., is used to draw graphs.
Page | 74
International Journal of Enhanced Research in Science Technology & Engineering, ISSN: 2319-7463
Vol. 2 Issue 6, June-2013, pp: (71-75), Available online at: www.erpublications.com
REFERENCES
M. Kuramochi and G. Karypis. Finding frequent patterns in a large sparse graph. SDM, 2004.
Newman, M. E. J. Detecting community structure in networks. European Physical Journal B 38: 321-330. 2004.
Newman, M. E. J Fast algorithm for detecting community structure in networks. Physical Review E 69: 066133, 2004.
Luo, J. Social network analysis. Social Science Academic Press. (In Chinese), 2004.
Wasserman, S. and K. Faust. Social network analysis: Methods and applications. New York, Cambridge University Press, 1994.
Girvan, M. and M. E. J. Newman. Community structure in social and biological networks. Proceedings of the National Academy of
Sciences of the United States of America 99: 7821–7826, 2002.
[7] Hanneman, Robert A. and Mark Riddle. Introduction to social network methods. Riverside, CA: University of California (published
in digital form http://faculty.ucr.edu/~hanneman/, 2005.
[8] H. Goto, Y. Hasegawa, and M. Tanaka, “Efficient Scheduling Focusing on the Duality of MPL Representatives,” Proc. IEEE Symp.
Computational Intelligence in Scheduling (SCIS 07), IEEE Press, Dec. 2007, pp. 57-64, doi:10.1109/SCIS.2007.357670.
[1]
[2]
[3]
[4]
[5]
[6]
Page | 75
Assorting Unlabeled Data in Machine Learning: Active Cluster Learning
Seema Mishra, G C Nandi
[email protected],
[email protected]
Robotics and AI lab, Indian Institute of Information Technology, Allahabad
Abstract— various research and practicals are being done in
machine learning algorithms for training learning model. To
train the model basic resources required are processing time,
memory for storage of data. Another crucial resource is
amount of learning/Training data from which learning is
attempted. Since fewer amounts of data would not necessarily
lead more accuracy and precision in model. Moreover, huge
amount of data size needed for learning crusades complexity
[6, 7]. Labeling this large amount of data will surely lead to
higher computational complexity and expensive [3]. So we
present here a new approach that divides the training data set
in to distinguishable subsets having data with similar
attributes and then select data points and ask oracle/expert for
its label to be associated.
KeywordsMachine learning, Active
Semisupervised Learning, Active cluster learning.
I.
Learning
INTRODUCTION
Learning is the cognitive process through which human
being is allowed to develop the necessary skills to deal with
throng situations encountered in lives. There exists a
variation of learning phenomenon in different context. As
the same way, machine learning is the way to enable
computers to analogize the process automatically. Efforts
are mainly focused on inductive learning that refer to
observing phenomena and generalize it. Here the inductive
learning concern with classification. Indeed the learning is
conceptualized and constrained by the information that the
learning algorithm can get. Information is represented as a
data points in n-dimensional Euclidean space. Most often,
information is rendered to the learner in terms of examples.
But there could be other approaches for providing
information such as relations, constraints, function and even
models. Learning can be broadly relegated into two
categories: supervised and unsupervised learning.
Supervised learning arguably pertains to the access of
labeled data. However, as it is well known that unsupervised
learning has nothing to do with labeled data [1].
Hence, we should take care of such algorithms
computationally in terms of time and space efficiency and
wisely. We should also take care of great deal about another
precise resource, the optimum size of training data required
by learning algorithm so that expected accuracy is achieved
approximately. It has also been discoursed in the book
Machine Learning by Tom M. Mitchell from the perspective
of issues in learning machine [2]. Since the source of
learning is the past data/experience assorted with
information respectively and broadly speaking training data.
In real life application unlabeled data is accessible easily but
associating unlabeled data with its corresponding
information is expensive, time consuming and most
importantly involves human expertise. In speech
recognition, performing labeling of speech articulation
requires human expert to analyze every signal and divide in
to segments of phonemes. This performance demands
approximately ten times longer than original audio [3].
It is well understood that learning with large amount of data
tends to result in computational complexity considerably
and as a result give rise to increase in training time of
learner/algorithm. From the principle of inductive machine
learning hypothesis, there is great deal of studies providing
evidence that it is not appropriate to introduce very large
training set. Analysis of learning curve argued that over
fitting may occur if machine is introduced with very large
data to learn [6, 7]. Hence, It is suggested in extensive
number of literatures that very small amount of training data
could be used intelligently to achieve desired level of
performance of learning algorithm as well as less monetary
value of labeling unlabeled data. This issue is prime
direction of this paper and motivates to explicate such a data
selection technique with effect of reducing training time
complexity and training data complexity. It is expected that
selection technique should be scalable and so that it could
perform well even if introduced with large amount of data
set and surely with inexpensive manner.
II.
RELATED WORKS
We have already a great deal of literature available to put
down the cost of labeling training data and using unlabeled
data efficiently. The task of labeling the data is iterative
process and is accomplished by following three subtasks.
Let us consider D is the entire data set, at each iteration i,
1. D L,i data points which are consociated with its
corresponding label, i.e. D L
D.
{D-D L
2.
D U ,i data points with no label, i.e. D
3.
}.
D C ,i interesting/informative data points preferred
U
to be labeled. More formally, D C D U
There are two different approaches being used to optimize
the sizing and labeling of training data.
A. Semi Supervised Learning
Semi-supervised machine learning algorithm is widely
known because of its potentiality of making use of both
unlabeled and expensive labeled data [8]. The systematic
plan of action is given in figure 1 for semi- supervised
learning.
III.
OUR PROPOSED DATA SELECTION
APPROACH
The method, we suggest grounded on the presumption that
data points resided in same cluster are assumed to be labeled
same. This task implicates two steps, first it partitions the
given large training data set into distinguishable
cluster/subsets and selection domain of most informative
data becomes narrower and easier. We assume that data is
dispersed in n-dimensional Euclidean space, but for the sake
of convenience, let us take data distribution in two
dimensional spaces in figure 3.
B. Active Learning
Active learning is the strategy that provide interactive
choice to the learner so that it could interrogate to assort
label to most informative unlabeled data. The primary idea
is to seek information that are sufficient for machine learner
to learn the model rather than using all available data and
this surely would be increased in computational complexity.
In this way, we have lesser amount of quality training
dataset which give approximately same accuracy as it would
have been attained with large amount of randomly selected
data. The systematic plan of action is given in figure 2 for
Active learning.
The idea of clustering of data into homogeneous groups can
be illustrated as in figure 4 and we call this method as Active
cluster learning.
Step 4 // ask for label for selected data from cluster
{xi, j ,?}
Step 5 // add recently labeled data to training data set
TD
TD x i, j
|TD|
|TD|+1
Remarks:
Computational complexity: The proposed approach has low
computational complexity. According to Naïve bases
method, the data points are chosen at random require O(n)
training data but suggested method has reduced it at certain
extent. Suppose the numbers of clusters generated are k.
Then it takes O(k) time search for search an element of
cluster to get it to be label.
VI. CONCLUSION
Here, more specific idea is given in figure 5. Original data is
partitioned in homogeneous groups with almost similar
features (only non- overlapping clusters has considered).
In this paper, we propose purely theoretical overview of the
possibilities of grouping the large amount of unlabeled data
in to distinct groups having similar features, such that
searching efforts will be reduced because elements belong
to same clusters are likely to be labeled same. This concept
can be used in data mining processes which are the part of
machine learning, where large amount of data is available to
deal with.
References
[1]
[2]
[3]
[4]
Algorithm:
The general algorithm may be expressed as follows:
Input: unlabeled data points D=
Output: reduce data to be label i.e. k
Step 1 // Initialization of size of Training data TD
|TD|=0
Step 2 // Partition the data set into distinguishable subsets
D={C 1 C 2 …, C k } such that
n
i 1
Ci C j
Step 3 // Randomly select any elements of distinct sets to
be labeled
xi, j C i
Where i=1 to k and j can vary and
represent number of elements in each cluster.
[5]
[6]
[7]
[8]
[9]
Nathalie Japkowicz and Mohak Shah. Evaluating Learning
Algorithms: A Classification Perspective, McGraw Hill, published by
Cambridge University Press. March 2011.
Mitchell T.M... Machine Learning, McGraw Hill, 1997.
Burr Settles Active Learning Literature Survey. Computer Sciences
Technical Report 1648,V
Michaslski, R. S, Bratko, I, & Kubat, M.(EDs). Machine Learning
and data mining: Method and applications. New York, NY: Wiley,
1998.
Yi Wu, Rong Zhang, Alexander Rudicky. Data Selection For speech
recognition Language Technology institute, Carnegie Mellon
University. Proceedings of ASRU, Kyoto, Japan, December 2007
PEN Lutu, AP Engelbrecht. A Comparitive study of Sample Selection
methods for classification, Department of Informatics, University of
Pretoria, South Africa.South Africa Computer Journal, .Vol 36, pp.
69-85, 2006
Dietterich, T, Overfitting and underfitting in machine learning. ACM
Computing surveys, Vol 27, no. 3 pp 326-327, 1995.
] Seeger Learning with labeled an unlabeled data unlabeled
data,(Technical Report), Institute for Adaptive and Neural
Computation, University of Edinburgh, Edinburgh United Kingdom,
pp. 609-616, 2001. Ninan Sajeeth Philip, What is there in Trainind
sample?, In proceeding of IEEE conference, pages1507-1511, March
2009.
Ines Rehbein, Josef Ruppenhofer, There’s no Data like more Data?
Revisitng the impact of Data Size on Classification Task, Saarland
University, Germany, October 2010.
[10] Buntine, W. Learning classification tree. Statistics and computing,
,(1992.
[11] Hand, D. J Construction and assesment of classification rules.
Chichester, England: John Wiley & Sons,(1997).
[12] Shavlik, J.W.v& Dietterich, T. G(Eds). Reading in machine learning.
San Mateo, CA: Morgan Kaufmann, (1990).
[13] Spath, H, Cluster Analysis Algorithms for Data Reduction and
Classification. Ellis Horwood, Chichester, UK, 1980.
[14] Blaž Novak, Use of Unlabel Data for Supervised Machine Learning,
Department of Knowledge Technologies, Jozef Stefen Institute
[15] Vapnik, V.: Statistical Learning Theory. John Wiley, New York
(1998).
[16] Olivier Bousquet1, St_ephane Boucheron2, and G_abor Lugosi3.
Introduction to Statistical Learning Theory
Hierarchal Structure of Community and Link Analysis
Seema Mishra and G.C Nandi
Indian Institute of Information Technology, Allahabad, India
{seema.mishra.phd,gcnandi}@gmail.com
Abstract. Discovering the hierarchy of organizational structure in a dynamic
social network can unveil significant patterns which can help in network analysis.
In this paper, we formulated a methodology to establish the most influential
person in a temporal communication network from the perspective of frequency
of interactions which works on hierarchal structure. With the help of frequency of
interactions, we have calculated the individual score of each person from Page
Rank algorithm. Subsequently, a graph is generated that showed the influence of
each individual in the network. Rigorous experiments we performed using Enron
data set to establish a fact that our proposed methodology correctly identifies the
influential persons over the temporal network. We have used Enron Company’s
email data set that describes how employees of company interacted with each
other. We could analyze from our methodology and verify from the facts in the
Company’s dataset since after bankruptcy, the result of interactions and behaviors
of the individual of the network are absolutely known. Our result shows that the
proposed methodology is generic and can be applied to other data sets of
communication to identify influential at particular instances.
Keywords: Dynamic social network analysis, social network analysis,
hierarchal structure.
1
Introduction
A network structure is the perfect epitome provides a formal way of representing data
that emphasizes the association between entities. This representation has a substantial
importance gives the insight of knowledge into the data. Since for the work to be done
many entities these days are interconnected and behaviors of individual entity reflect
the function of whole system at large extent. The entity could be people, organization
[15], computer nodes [16]. Networks are primarily studied in mathematical
framework i.e. graph [1].
In modern era, social network analysis is proliferated area of research, has been in
existence for quite some time and experiencing a surge in popularity to understand the
behavior of the users at individual and group level [Wasserman & Faust, 1994,
Wellman, 1996]. Understanding the behavior of individual social networking methods
assuage the analysts to revealing hidden patterns from social communication. In order
to model the social network mathematically, most popular data structure typically
known as graphs are used where the nodes depict the individual or group of person, or
event or organization etc and each link/edge represents connection/relationship
A. Agrawal et al. (Eds.): IITM 2013, CCIS 276, pp. 252–260, 2013.
© Springer-Verlag Berlin Heidelberg 2013
Hierarchal Structure of Community and Link Analysis
253
between two individual. Social network analysis attempts to understand the network
and its components like nodes (social entities commonly known as actor or event) and
connections (inter-connection, ties, and links). It has main focus of analyzing
individuals and their interdependent relationships among them rather than individuals
and their attributes as we deal in conventional data structure.
The target of this paper is to proposing a hierarchical structure of social network
changing with time
Types of social network
Sociometric: It involves the entire population and focus on global structure pattern of
social network.
Egocentric: It focus individual interaction pattern for analyzing social network.
Limitation of this kind of analysis is that it is very difficult to collect case by case
data.
2
Dynamic Social Network Analysis
Versatile power of social network is being applied to mining pattern of social
interaction in wide ranging applications including disease modeling [4] information
transmission and behavior analysis [2, 3, 16] and business management and behavior
analysis. Network analysis came in to picture as its practical applications in
intelligence and surveillance [5] and has become popularized paradigm to uncover
anti-social network such like criminal, terrorist and fraud network majorly after the
tragically event of September 11th, 2001 which has shattered the whole world.
Traditional social network analysis are incorporated with several computational
techniques like artificial intelligence, machine learning, data mining to develop
empirical research on human behavior, groups and organizational behavior within
links among them with varying level of uncertainties.
There are two level of analysis in DNA. Firstly, it focuses on relational data, i.e.;
data about a link between group of people, events and locations, organizations.
Identifying associations between these entities is a crucial part of unveiling different
types of activities in order to discover knowledge about network. The second is the
focus on dynamism of the relations; i.e., how these relations likely to change in the
future and what are the interesting consequences of these changes in the system.
Social interaction could be in any form that depends on the type of data available
[6]. It might be verbal or written communication (cell phones, emails, and blogs
chatting), scientific collaboration (co-authorship network, citation network), browed
websites, and group of animals. This mathematical network model is very successful
in analysis of social network but major drawback is that it may miss the temporal
aspect of interaction because social interaction is inherently dynamic in nature. The
static model of interaction can give the information that could be inaccurate and
decision made based merely on this contributed information might lead analyst to
faulty direction.
254
S. Mishra and G.C. Nandi
Several shortcomings can be highlighted when dealing with static model of social
network that could forbid acknowledging the casual relationship of pattern of social
interaction [6]:
•
•
What is the rate of spreading diseases while modeling diseases and who is the
central person whom should be vaccinated to control spread of it among group of
person.
What are the causes and consequences of social structure evolution?
Dynamic social network analysis is emerging research area play a crucial role to fill
gap between traditional social network analysis and time domain. Dynamic study of
network includes classical network analysis, link analysis, and multi-agent systems.
Dynamic network analysis facilitates the analysis of multiple types of nodes
(multi-node) and multiple types of links (multi-plex) simultaneously. On the contrary,
static network analysis can only focus at most two mode data and analysis one type of
link at a time. There are several characteristics of dynamic network:
•
•
•
Nature of nodes are dynamic, there properties changes with respect to time.
Deals with meta-network.
Network evolution is consequent of agent-based modeling.
Furthermore the network analysis exists in four levels [15]: Attribute oriented
analysis, Position oriented analysis, and Structure oriented analysis, Dynamic
network analysis. Fig1 shows the categorization. The attribute analysis captures the
properties of vertices and edges and finds the causal relation with the structure of
network.
Dynamic Social
Network Analysis
Structure
oriented
analysis
( t1,t2...tn )
Position
oriented
analysis
Attribute oriented analysis
Fig. 1. Level of network analysis
Position oriented analysis aims to investigate the mico level stage of network. It
looks into every single entity i.e. node or organization and their characteristic in the
network. Structure analysis is macro level analysis of network and investigates the
Hierarchal Structure of Community and Link Analysis
255
average metrics of social network. Dynamic analysis can utilizes the all measures of
aforementioned analysis in order to identify progressive behavior of network and
individual.
ROLE OF DATA MINING IN SOCIAL NETWORK
Social computing can make use of data mining techniques in following analysis
•
•
•
•
Community Detection
Classification
Link Prediction
Network Modeling
COMMUNITY IN SOCIAL NETWORK
A community is sub part of whole network between which inter-community
interaction is relatively frequent and strong than intra-community interaction. It can
be in any form for example group, subgroup, and cluster. It may: a) citation network
represents related papers on single topics, b) web pages on related topics.
Community detection is a classical problem in social network analysis. Commonly,
it can be the problem of identifying sub-graphs of original graph and called vertex
sparsifier [12]. These small networks uphold the relevant information of original
group. Four levels of analyses are being conducted in community identification as
shown in fig 1:
ANALYSIS OF PREVIOUS RESEARCH
Hierarchical methods for community detection falls into two categories:
agglomerative and divisive. In former case each node is assumed to be a community
and repetitively group together. Similarly, in later case initially whole network is
considered a community and divided subsequently into smaller one. Most methods
are graph clustering and partition. Distance based structural equivalence [7] uses
distance metrics to identify similar entities. In graph partition methods several
algorithms has been proposed [8, 9]. Newman- Girvan method and spectral
clustering methods [10, 11] uses a notion of modularity and utilizes edge betweenness metric to divide into groups.
In analyzing dynamic pattern, many methods use the temporal snapshots of
interaction over the times [13, 14].
3
Discovering Hierarchy of Group
Before analyzing community hierarchy we define several basic terminologies.
Hierarchy of community provides the power of each individual in a group. If
somehow we know this chain we can find the leader of group. Regarding this we used
256
S. Mishra and G.C. Nandi
well known algorithm PageRank which calculate the individual score I_score of each
person to represent importancy of person. Higher the score of person more powerful
the person is.
Definition: If
P = { p1 , p2 ... pn } be the collection of persons involved in
communication. For any members
pi and p j if I_score ( pi ) ≥ I_score( p j ) then
pi can be the leader of p j .
3.1
Proposed Model
Figure 2 shows architecture of our system proposed
Extracting
Names of
Distinct user
Enron Email
data
Preprocessing
Socio matrix
January
February
March ...
De cember
Socio matrix for the twelve months
Graph of twelve months as shown in
Fig. 2. Model for extracting graph over months
First the Individual score of each member calculated with the help of Page Rank
algorithm described following.
Input: Social Network G.
Output: Individual score of each person
Steps:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
d = 0.85
epi = 0.01
del = 0,delta[150]
del_prev=0
float sum1 = 0
float sum2 = 0
float e = 0
do loop:
del_prev=del
iter++
fort i = 1 to 150
Hierarchal Structure of Community and Link Analysis
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25
26.
27.
257
R[1][i] = ((1 - d) / N) + d*fun(i)
for j = 0 to 150
sum2 += R[1][j]
sum1 += R[0][j]
e = sum1 - sum2
for j = 1 to 150
R[1][j] = R[1][j] + e*L[j]/sum_all
For j = 1 to 150
sum2 += R[1][j] - R[0][j]
del = sum2
for j = 1to 150
R[0][j] = R[1][j]
while(epi > del && del_prev!=del)
for i=0 to150
R[0][i]=10000.0*R[0][i]
In I_score (), d is damping factor generally assumed to be around 0.85. R is vector
stores the individual score of each person. N denotes the total number f persons in the
network. Sum_all defines the sum of the weight of all edges. Fun() defines
W ( p
p j ∈M ( pi )
ij
).
Pk ( j )
where L ( p j ) sum of weights of all edges linked with p j
L( p j )
and W ( pij ) weights of edges linking
4
pi and p j .
Practical Implementation and Analysis
In this section, we evaluate the capability of the proposed approach on discovering
organizational structure and to exploring evolution of organizational structure in a
dynamic social network and link between individuals. We performed the experiments
on Enron email data set. Email communication data has become a practical source for
research in network analysis like social network. Mostly the experiments are carried
out on the artificial data due to the non-availability of real life communication data.
The Enron email data set [17] has become a benchmark for this sort of research
domain in network analysis. This data set was made public and posted on web by the
Federal Energy Regulatory Commission during its investigation for fraud happened in
company, in order to make it test bed for validating and testing the efficacy of
methodologies developed for counter-terrorism, fraud detection and link analysis.
Data is about 150 users communication mostly senior managers organized into folder.
But this set has still lots of issues like integrity issue and duplicate messages issue.
For preprocessing, first the names of distinct users were extracted and duplicated ids
were neglected.
258
S. Mishra and G.C. Naandi
20
15
10
Number of
communities over
months
5
0
Jan
Feb
Mar
Apr
May
June
July
Aug
Sept
Oct
Nov
Dec
Number of communities
Numbe
er of communities over
months
Months
Fig. 3. Number of communities over months
olution of community from months Jan to Dec
Fig. 4. Evo
Hierarchal Structure of Community and Link Analysis
259
The proposed approach is implemented in DEV C++. The experiments were
conducted on a 2.1GHz PC with Core(TM)2 Dual- Core Pentium 4 processor with 2
GB RAM.
On examining the results in figure 3, we analyzed the number of grouping in the
month of May was maximum. Figure 2 shows that Jim was the person who headed
the group from Jan to Feb. Monique was leader throughout the months form Jan to
April.
5
Conclusion and Future Scope
In this paper we introduced a concept of hierarchy of positions in group by taking
temporal interaction data of twelve months in organization that shows how the
position and link of group members changes when people joining and leaving the
group. This hierarchal group interaction is significant to facilitate link analysis of
individuals with the time period In the future we are planning to improve the integrity
issues of preprocessed Enron data results of experiments.
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
Hanneman, R.A., Riddle, M.: Introduction to social network methods University of
California, Riverside (2005), published in digital form
http://faculty.ucr.edu/~hanneman/
Baumes, J., Goldberg, M., Magdon-Ismail, M., Wallace, W.: Discovering hidden
Inform. (2004)
Tyler, J., Wilkinson, D., Huberman, B.: Email as spectroscopy: Automated discovery of
community structure within organizations. In: Proc. 1st Intl. Conf. on Comm. and Tech.
(2003)
Kretzschmar, M., Morris, M.: Measures of concurrency in networks and the spread of
infectious disease. Math. Biosci. 133, 165–195 (1996)
Baumes, J., Goldberg, M., Magdon-Ismail, M., Wallace, W.: Discovering hidden groups
in communication networks. In: Proc. 2nd NSF/NIJ Symp. on Intel. and Security
Inform. (2004)
Berger-Wolf, T.Y., Saia, J.: A framework for analysis of dynamic social networks. In:
Proc. KDD 2006, pp. 523–528 (2006)
Fortunato, S., Castellano, C.: Springer’s Encyclopedia of Complexity and System
Science. Community Structure in Graphs (2008)
Chekuri, C., Goldberg, A., Karger, D., Levin, M., Stein, C.: Experimental study of
minimum cut algorithms. In: Proc. 8th SAIM Syposium on Discreet Algorithm, pp. 324–
333 (1997)
Wu, A.Y., et al.: Mining scale-free networks using geodesic clustering. In: Proc. KDD
2004, pp. 719–724 (2004)
Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks.
Phys. Rev. E 69, 026113 (2004)
Newman, M.E.J.: Modularity and community structure in networks. PNAS 103(23), }
8577–8582 (2006)
260
[12]
[13]
[14]
[15]
[16]
[17]
S. Mishra and G.C. Nandi
Moitra, A.: Approximation algorithms for multi commodity type problems with
guarantees independent of the graph size. FOCS, 3–12 (2009)
Zhou, D., Councill, I., Zha, H., Lee Giles, C.: Discovering Temporal Communities from
Social Network Documents. In: Proc. of ICDM 2007, pp. 745–750 (2007)
Tantipathananandh, C., Berger-Wolf, T., Kempe, D.: A Framework For Community
Identification in Dynamic Social Networks. In: Proc. of KDD 2007, pp. 717–726 (2007)
Wasserman, S., Faust, K.: Social network analysis: Methods and applications.
Cambridge University Press, New York (1994)
Wellman, B.: Computer networks as social networks. Science Magazine 293, 2031–
2034 (2001)
The original dataset can be downloaded from William Cohen’s web page,
http://www-2.cs.cmu.edu/~enron/