Homogeneous temporal activity patterns in a large online communication space

Andreas Kaltenbrunner

Homogeneous temporal activity patterns in a large online communication space

2007

The many-to-many social communication activity on the popular technology-news website Slashdot has been studied. We have concentrated on the dynamics of message production without considering semantic relations and have found regular temporal patterns in the reaction time of the community to a news-post as well as in single user behavior. The statistics of these activities follow log-normal distributions. Daily and weekly oscillatory cycles, which cause slight variations of this simple behavior, are identified. A superposition of two log-normal distributions can account for these variations. The findings are remarkable since the distribution of the number of comments per users, which is also analyzed, indicates a great amount of heterogeneity in the community. The reader may find surprising that only a few parameters allow a detailed description, or even prediction, of social many-to-many information exchange in this kind of popular public spaces.

arXiv:0708.1579v1 [cs.NI] 12 Aug 2007 Homogeneous temporal activity patterns in a large online communication space Andreas Kaltenbrunner Vicenç Gómez Ayman Moghnieh Rodrigo Meza Josep Blat Vicente López [email protected] Departament de les Tecnologies de la Informació i les comunicacions Universitat Pompeu Fabra Passeig de Circumval·lació 8, 08003 Barcelona, Spain Barcelona Media Centre d’Innovació Ocata 1, 08003 Barcelona, Spain Abstract The many-to-many social communication activity on the popular technology-news website Slashdot has been studied. We have concentrated on the dynamics of message production without considering semantic relations and have found regular temporal patterns in the reaction time of the community to a news-post as well as in single user behavior. The statistics of these activities follow log-normal distributions. Daily and weekly oscillatory cycles, which cause slight variations of this simple behavior, are identified. A superposition of two log-normal distributions can account for these variations. The findings are remarkable since the distribution of the number of comments per users, which is also analyzed, indicates a great amount of heterogeneity in the community. The reader may find surprising that only a few parameters allow a detailed description, or even prediction, of social many-to-many information exchange in this kind of popular public spaces. Keywords: Social interaction, information diffusion, log-normal activity, heavy tails, Slashdot 1. Introduction Nowadays, an important part of human activity leaves electronic traces in form of server logs, e-mails, loan registers, credit card transactions, blogs, etc. This huge amount of generated data allows to observe human behavior and communication patterns at nearly no cost on a scale and dimension which would have been impossible some decades ago. A considerable number of studies have emerged in recent years using some part of these data to investigate the time patterns of human activity. The studied temporal events are rather diverse and reach from directory listings and file transfers (FTP requests) (Paxson and Floyd, 1995), job submissions on a supercomputer (Kleban and Clearwater, 2003), arrival times of consecutive printing-job submissions (Harder and Paczuski, 2006) over trades in bond (Mainardi et al., 2000) or currency futures (Masoliver et al., 2003) to messages in Inter. net chat systems (Dewes et al., 2003), online games (Henderson and Bhatti, 2001), page downloads on a news site (Dezsö et al., 2006) and e-mails (Johansen, 2004). A common characteristic of these studies is that the observed probability distributions for the waiting or inter-event times are heavy tailed. In other words, if the response time ever exceeds a large value, then it is likely to exceed any larger value as well (Sigman, 1999). A recent study (Barabási, 2005) tries to explain this behavior under the assumption that these heavy tailed distributions can be well approximated by a power-law or at least by a power-law with an exponential cut-off (Newman, 2005). The cited study presents a model which seems to explain the distribution of e-mail response times and has been used later to account for the inter-event times of web-browsing, library loans, trade transactions and correspondence patterns of letters (Vázquez et al., 2006). However, the hypothesis of a power-law distribution is not generally accepted, at least in case of e-mail response times. Stouffer et al. (2006) claim that the data can be much better fitted with either a log-normal (LN) distribution (Limpert et al., 2001) or the superposition of two LN. This debate has been repeated across many areas of science for decades, as noticed by Mitzenmacher (2003). To the authors’ knowledge no study of this type has been performed on systems where social interaction occurs in a more complex manner than just person to person (one-toone) communication. We think it is valuable to analyze the temporal patterns of the many-to-many social interaction on a technology-related news-website which supports user participation. We have chosen Slashdot1 , a popular website for people interested in reading and discussing about technology and its ramifications. It gave name to the “Slashdot effect” (Adler, 1999), a huge influx of traffic to a hosted link during a short period of time, causing it to slow down or even to temporarily collapse. Slashdot was created at the end of 1997 and has ever since metamorphosed into a website that hosts a large interactive community capable of influencing public perceptions and awareness on the topics addressed. Its role can be metaphorically compared to that of commercial malls in developed markets, or hubs in intricate large networks. The site’s interaction consists of short-story posts that often carry fresh news and links to sources of information with more details. These posts incite many readers to comment on them and provoke discussions that may trail for hours or even days. Most of the commentators register and comment under their nicknames, although a considerable amount participates anonymously. Although Slashdot allows users to express their opinion freely, moderation and meta-moderation mechanisms are employed to judge comments and enable readers to filter them by quality. The moderation system was analyzed by Lampe and Resnick (2004) who concluded that it upholds the quality of discussions by discouraging spam and offensive comments, marking a difference between Slashdot and regular discussion forums. This high quality social interaction has prompted several socio-analytical studies about Slashdot. Poor (2005) and Baoill (2000) have both conducted independent inquiries on the extent to which the site represents an online public sphere as defined by Habermas (1989). Given that a great amount of users with different interests and motivations participates in discussions about very different topics, one would expect to observe a high degree of heterogeneity on a site like Slashdot. However, what if the posts and comments were analyzed 1. http://www.slashdot.org 2 Homogeneous temporal activity patterns in a large online communication space just as imprints of an occurring information exchange, with no regard to semantic aspects? Is there a homogeneous behavior pattern underlying heterogeneity? To answer these and related questions we collected and studied one year’s worth of interchanged messages along with the associated meta-data from Slashdot. We show here that the temporal patterns of the comments provoked by a post are very similar, indicating that homogeneity is the rule not the exception. The temporal patterns of the social activity fit accurately log-normal distributions, thus giving empirical evidence of our hypothesis and establishing a link with previous studies where social interaction occurs in a simpler way. Finally, our analysis allows more insight into questions such as: is there a time-scale common to all discussions, or are they scale-free? What does incite a user to write a comment, is it the relevance of the topic, or maybe just the hour of the day? Can we predict the amount of activity a post will trigger already some minutes after it has been written? Which type of applications can we devise on the basis of using these conclusions? The rest of the article is organized as follows: In section 2 we briefly explain the process of data acquisition. We then present the results in section 3 providing first an overview of the global activity and then explaining our analysis in detail. We finish the paper with section 4 where we discuss the results. 2. Methods In this section we explain the methods used to crawl and analyze Slashdot. The crawled2 data correspond to posts and comments published between August 26th, 2005 and August 31th, 2006. We divided the crawling process into two stages. The first stage included crawling the main HTML (posts) and first level comments and the second stage covered all additional comment pages. Crawling all the data took 4.5 days and produced approximately 4.54 GB of data. Post-processing caused by the presence of duplicated comments was necessary (due to an error of representation on the website). Although a high amount of information was extracted from the raw HTML (sub-domains, title, topics, hierarchical relations between comments) we concentrated only on a minimal amount of information: type of contribution (either post or comment), its identifier, author’s identifier and timestamp or date of publishing. The selected information was extracted to XML-files and imported into Matlab where the statistical analysis was performed. Table 1 shows the main quantities of the crawling and the extracted data. Table 1: Main quantities of crawling and retrieved data. Period covered 26-8-05 − 31-8-06 Time needed for crawling 4.5 days Amount of data mined 4.54 GB Posts 10016 Comments 2075085 Commentators 93636 Anonymous comments 18.6% 2. Software used: wget, Perl scripts, and Tidy on a GNU/Linux, Ubuntu 6.0.6 OS. 3 (b) (a) 0.14 0.25 0.12 0.1 Normalized activity Normalized activity 0.2 mean posts mean comments stdev posts stdev comments 0.15 0.1 mean posts mean comments stdev posts stdev comments ay nd 0 0 Su Sa tu rd ay ay 0.02 Fr id rs da y y Th u en de sd a sd a 0.06 0.04 W Tu e da on M Su nd ay y 0 y 0.05 0.08 2 4 6 8 10 12 14 16 18 Hour in EDT − Eastern Daylight Time 20 22 0 Figure 1: (a) Weekly and (b) daily activity cycles. The time-stamps of post and comments can be obtained from Slashdot with minuteprecision and corresponded to the EDT time zone (= GMT−4 hours). They allow to calculate the following two quantities: The Post-Comment-Interval (PCI) stands for the difference between the timestamps of a comment and its corresponding post. The Inter-Comment-Interval (ICI) refers to the difference between the time-stamps of two consecutive comments of the same user (no matter what post he/she comments on). 3. Results In this section we first give an overview of the global activity looking at the data on different temporal scales and analyzing some relations between variables of interest. We then focus on the activity provoked by single posts and analyze the behavior of single users, concentrating on the most active ones. 3.1 Global cyclic activity As previously explained, comments can be considered as reactions triggered by the publishing of posts. This difference in nature between both types of contributions justifies a separate analysis of their dynamics. Figure 1 shows (normalized) mean activity and standard deviations of both posts and comments. It illustrates patterns in agreement with the social activity outside the public sphere. Figure 1a shows regular, steady activity during working days which slows down during weekends. This weekly cycle is interleaved by daily oscillations illustrated in Figure 1b. The daily activity cycle reaches its maximum at 1pm approximately and its minimum during the night between 3am and 4am. Although Slashdot is open to public access around the world, we see that its activity profile is clearly biased towards the American time-schedule. Interestingly, although post activity shows more fluctuations and higher standard deviations than comment activity, there is little discrepancy between their mean temporal profiles. This difference in the deviations is not surprising given the greater number of 4 Homogeneous temporal activity patterns in a large online communication space 500 1 400 number of posts 350 300 250 proportion of posts 450 0.8 0.6 0.4 0.2 200 150 0 0 200 400 600 800 number of comments 1000 100 Median = 160 Mean = 194.6231 50 0 0 500 1000 number of comments (bin−width = 10) 1500 Figure 2: Histogram of the number of comments per post (inset shows the corresponding cdf). comments (see Table 1). We notice that the standard deviations of the daily post- and commenting activities also show similar cyclic behavior (Figure 1b). 3.2 Post-induced activity In this section we analyze the activity (comments) a post induces on the site. The histogram of Figure 2 gives an idea of the number of comments the posts receive. Note that half of the posts provoke more than 160 comments and some of them even trigger more than 1000. To analyze the time-distribution of these comments we study their post-comment intervals (PCIs). 3.2.1 Analysis of the activity generated by a single post We are especially interested in the resulting probability distribution of all the PCIs of a certain post. This distribution reveals us the probability for a post to receive a comment t minutes after it has been published. Figures 3a and 3b show this distribution for a post which provoked 1341 comments. Although there are some important fluctuations, the characteristic shape of the probability density function (pdf) resembles a LN-distribution. This becomes even clearer if the cumulative probability distribution (cdf) is observed, since there the fluctuations of the pdf are averaged out. Figures 3c and 3d show a good fit of the PCI-cdf of the data with the cdf of the LN-distribution. To quantify the quality of the fit we have used a normalized error measure ǫ based on the ℓ1 -norm (see Appendix B). For the post shown in Figure 3 we obtain ǫ = 0.007, meaning that the average error is below 1%. The PCI-cdf of three more posts can be observed in Figure 4. The top two sub-figures show good fits, indicating that the PCI is well approximated even for a small number of comments. However, the fit is not that accurate for all posts. E.g. the comments of the post shown in Figure 4 (bottom) start to show considerable different behavior from the expected LN-approximation about 3 hours after its publication. The activity is lower than predicted, but starts to increase again at about 6am in the morning the next day. At around 8:30pm 5 time since post was published in hours (a) 0 0.5 1 1.5 2 2.5 time since post was published in hours (b) 3 −1 100 number of comments per bin 8 6 4 2 0 0 50 100 150 time since post was published in minutes 0 50 0 number of comments to post 1200 10 100 5 10 150 15 20 200 25 30 600 600 400 400 200 200 0 0 20 1 2 3 10 10 10 time since post was published in minutes 4 10 time since post was published in hours −1 0 10 1200 post−id: 1829252 published: 2006−01−10 13:49 median=141 min. ε=0.007 800 40 (d) 250 1000 800 60 0 0 10 1200 1000 0 2 10 80 200 time since post was published in hours (c) 1 10 data approx. number of comments to post number of comments/minute 10 0 10 1 10 2 10 10 data approx. 1000 800 600 400 200 0 2000 500 1000 1500 2000 0 0 10 4000 6000 8000 10000 12000 14000 16000 time since post was published in minutes 1 2 3 10 10 10 time since post was published in minutes 4 10 Figure 3: LN-approximation (dashed lines) of the PCI-distribution (solid lines and bars) of a post which received 1341 comments. (a) Comments per minutes (bin-with= 2 for better visualization) for the first 200 minutes after the post has been published. (b) Same as (a) in logarithmic scale. (c) The cumulative distribution of the data shown in (a). Inset shows a zoom on the first 2000 minutes. (d) Same as (c) in logarithmic scale. it increases further to recover the lost activity during the night. More such oscillations of activity can be observed during the following days. The time-spans of variations in activity coincide quite exactly with the average daily activity cycle shown in Figure 1b. We analyze this coincidence further in the next section. 3.2.2 Approximation quality With the LN shape of the PCI-distribution identified, we focus on the quality of this approximation in general. We therefore calculate the error measure ǫ of the fit for all posts which received comments. The resulting distribution of ǫ can be seen in Figure 5a. For 87% of the posts the approximation error ǫ is lower than 0.05, and for 29% of them lower than 0.02. 6 Homogeneous temporal activity patterns in a large online communication space −1 0 10 hours 10 2 −1 10 0 10 hours 10 1 2 10 10 data approx. 100 600 number of comments to post number of comments to post 700 1 10 data approx. 500 400 post−id: 1216245 published: 2006−05−11 06:55 median=150 min. ε=0.010 300 200 80 60 post−id: 1547251 published: 2006−04−01 12:02 median=92 min. ε=0.014 40 20 100 0 0 10 1 2 3 10 10 10 time since post was published in minutes −1 0 10 number of comments to post 1500 0 0 10 4 10 hours 10 data approx. 1 2 3 10 10 10 time since post was published in minutes 1 4 10 2 10 10 2005−10−17 21:30 → 2005−10−17 06:00 → 1000 post−id: 0152240 published: 2005−10−15 22:35 median=498 min. ε=0.031 ← 2005−10−16 01:30 500 0 0 10 ← 2005−10−16 08:30 ← 2005−10−16 06:00 1 2 3 10 10 10 time since post was published in minutes 4 10 Figure 4: LN-approximation of the PCI-distribution of 3 different posts. If we take a closer look at the data, we notice a dependence of ǫ on the publishing-hour of a post (Figure 5b). The best fit is reached when the post is published between 6am and 11am. Then the mean error increases successively until 11pm to stay high during the night and recover again in the early morning. This behavior can be understood looking at the daily activity cycle (Figure 1b). The less time the community has to comment on a post during the time-window of high activity, the greater is the need to comment on it the next time the high activity phase is reached, and hence the expected LN behavior is altered. Figure 4 (bottom) gives an example of such a late post (published at 10:35pm). 3.2.3 Approximation with double log-normal distributions We approximate the data as well with a double log-normal distribution (DLN), i.e. a superposition of two LN-distributions (See appendix A). To find their parameters and especially their mixing coefficient, we use maximum likelihood estimation (Stouffer et al., 2006; DeGroot and Schervish, 2002). The DLN should lead to better results in general and reduce the dependency on the circadian rhythm since it represents two waves of activity: 7 (a) (b) mean and median error ε of log−normal approx. 1 proportion of posts 350 Number of posts 300 250 200 0.8 0.6 0.4 0.2 0 0 150 0.02 0.04 0.06 0.08 Error ε 0.1 100 50 0 0 0.02 0.04 0.06 0.08 Error ε of log−normal approximation 0.1 0.07 0.06 0.05 0.04 0.03 0.02 0 0 0.12 mean median mean ± stdv 0.01 2 4 6 8 10 12 14 16 hour the post is published 18 20 22 24 Figure 5: (a) Errors ǫ of the LN-approximation of the PCI-cdf (bin-width = 10−3 ). Inset shows the corresponding cdf. (b) Dependence of mean and median of the approximation error ǫ on the hour the post is published. one starting when the post is published and another being caused by the next increase of activity in the circadian cycle. An example of this behavior is shown in Figure 6 where we compare LN and DLN-approximation of the same post as used in Figure 4 (bottom). The red and blue lines indicate the two log-normals whose superposition results in a DLN (gray, dashed-dotted), which clearly outperforms the previous LN (black, dashed) approach. The error ǫ decreases from 0.031 to 0.009 and the approximation is much closer to the cdf of the data (black continuous line in Figures 6c and 6d). We notice that the first 10 hours of activity are well approximated by a single LN-distribution (red line). Then the activity increases due to the high phase of the circadian cycle (compare also with the labels of Figure 4 bottom). The second LN distribution (blue line) accounts for this increase and therefore the DLN-approximation reflects the first bump in the PCI-cdf and fits well the data. To quantify the overall performance of a DLN-fit we apply it on all posts and plot the distribution of its approximation error ǫ in Figure 7a. The inset compares the error-cdfs of DLN (continuous) and LN-approach (dashed-dotted). We notice a significant improvement of the approximation quality. For example, the error of the DLN-fits is below 0.02 for more than 80% of the posts compared to only 29% in the case of LN-approximations. Figure 7b shows only a minor dependency of the quality of the DLN-fits on the publishing hour of the post (compare with Figure 5b), which allows us to conclude that the DLN-distributions accounts for the major part of the aberration of the log-normal behavior caused by the circadian cycle. 3.2.4 Approximation parameters For the cases where a LN-distribution leads to good results we can describe the activity triggered by a post with only two parameters: the median3 and the geometric stan3. Note that the median coincides with the geometric mean for a log-normally distributed random variable. 8 Homogeneous temporal activity patterns in a large online communication space time since post was published in hours (a) 0 2 4 8 10 12 3.5 150 µ =6.83; σ =0.47 2 2 µ =5.64; σ =1.68; c=0.77; µ =6.83; σ =0.47 3 1 −1 1 2 2 2.5 2 1.5 1 0 10 16 data µ=5.89; σ=1.55 µ1=5.64; σ1=1.68 4 time since post was published in hours (b) 14 number of comments per bin number of comments/minute 4.5 6 1 10 2 10 10 data LN LN 1 LN 2 100 DLN=LN +LN 1 2 50 0.5 0 0 200 400 600 800 time since post was published in minutes 0 0 10 1000 time since post was published in hours (c) 0 50 100 150 200 250 1500 5 10 15 20 25 30 1500 number of comments to post number of comments to post 0 1000 1000 post−id: 0152240 published: 2005−10−15 22:35 median=498 min. ε LN=0.031 ε DLN=0.009 500 500 0 0 −1 0 0 500 1000 1500 0 10 1500 2 3 4 10 time since post was published in hours (d) 300 1 10 10 10 time since post was published in minutes 1 10 2 10 10 data LN LN 1 LN 2 1000 DLN 500 2000 0 0 10 5000 10000 15000 time since post was published in minutes 1 2 3 10 10 10 time since post was published in minutes 4 10 Figure 6: Comparison of LN and DLN-approximations (dashed-dotted lines) of the PCIdistribution (solid lines and bars) of a post which received 1567 comments. The DLN-distribution is a superposition of LN1 and LN2 , which in the above figure are rescaled according to the coefficient c of the DLN. Rest of legend as in Figure 3 . dard deviation σg of the PCI-pdf, commonly used to compare log-normally distributed quantities (Limpert et al., 2001). The median and σg relate to the parameters of the LNdistribution in the following way. median = exp(µ) , σg = exp(σ). (1) Figure 8a shows the distribution of these quantities for all posts4 . The inset shows the distribution of σg , which is centered around 4.5 and has a standard deviation of 0.91. The median of the post-induced activity on the other hand shows more variations, but is rather 4. Instead of calculating σg directly from the data as in a previous version of this study (Kaltenbrunner et al., 2007b) we used equation (1) and the estimates of σ, which led to different results. Compare also with Limpert et al. (2001). 9 (b) 800 1 proportion of posts 700 600 Number of posts mean and median error ε of double log−normal app. (a) 500 400 0.8 0.6 0.4 0.2 0 0 300 DLN LN 0.02 0.04 0.06 0.08 Error ε 0.1 200 100 0 0 0.02 0.04 0.06 0.08 0.1 Error ε of double log−normal approximation 0.03 0.025 0.02 0.015 0.01 0.005 0 0 0.12 mean median mean ± stdv 2 4 6 8 10 12 14 16 hour the post is published 18 20 22 24 Figure 7: (a) Errors ǫ of the DLN-approximation of the PCI-cdf (bin-width = 10−3 ). Inset shows the corresponding cdf. (b) Dependence of mean and median of the approximation error ǫ on the hour the post is published. (a) 2 4 8 hours 10 6 12 14 (b) 16 600 Number of posts 500 2000 500 300 Number of posts Number of posts 400 400 300 200 100 200 DLN: µ 250 0 2 4 6 8 geometric stdv. σ =exp(σ) 1500 1 DLN: σ1 200 DLN: µ2 150 DLN: σ 100 LN: µ LN: σ 2 50 0 0.5 1000 0.6 0.7 0.8 0.9 mixing parameter c of DLN 1 10 g 500 100 0 0 200 400 600 800 median=exp(µ) of time since post was published in minutes 1000 0 0 1 2 3 4 5 6 µ and σ of LN and DLN 7 8 9 Figure 8: (a) Histograms of the estimates of medians (bin-width = 10) and geometric standard deviations (inset, bin-width = 0.1) of the PCI-distributions. (b) Parameters of LN and DLN-approximations. Bin-width=0.1 for µ and σ, 0.01 for c (inset). short (for 50% of the posts it is below 2.5 hours, for 90% below 6 hours) compared to the maximum PCI (approx. 12 days). We can thus conclude that although the total activity a post generates covers a large time interval, the major part of the activity happens within the first few hours after the post’s publication. If we use a DLN-distribution to approximate the data we need five parameters. Their distributions together with those of the parameters σ and µ of the LN-approximation are displayed in Figure 8b. For better visualization we choose a stair plot instead of a bar-graph. Clearly the regions of µ1 (continuous line with circles) and σ1 (continuous line) are very similar to those of the parameters of LN-approximations (dashed-dotted lines), indicating that the first one of the two log-normal distributions used to generate the DLN is similar to 10 Homogeneous temporal activity patterns in a large online communication space (a) (b) 1 0.9 4 cumulative distribution function (cdf) number of users 10 3 10 2 10 1 10 0.8 0.7 0.6 0.5 0.4 0.3 0.2 power−law (MLE fit γ = −1.5) truncated lognormal: (µ=0.8 σ=2.0) data 0.1 0 10 0 10 1 10 2 10 number of comments 0 0 10 3 10 1 10 2 10 number of comments 3 10 Figure 9: (a) Histogram of the number of comments per user and (b) and its corresponding cdf. the LN-approximations. The parameters µ2 and σ2 , on the other hand, show an interesting bimodal behavior. One of the two peaks of the distribution falls within the regions of µ1 or σ1 respectively. Those cases correspond to posts for which the two superposed log-normal distributions are very similar and the data fits well already a single LN-distribution. The second peak in the µ2 -distribution represents those posts which provoke a second wave of activity due to the circadian cycle. In those cases the parameter σ2 is usually smaller than σ1 . The inset of Figure 8b shows the mixing parameter c, which is nearly uniformly distributed although values in [0.7, 1] are slightly more likely than lower ones. We sorted the parameters to ensure a value of c ≥ 0.5. 3.3 User dynamics In this section we analyze the activity on Slashdot taking the authorship of the comments into account. We first study the distribution of activity among all the users participating in the debates and then focus on the temporal activity patterns of single users. 3.3.1 Global user activity The activity of all users is best illustrated by the distribution of the number of comments per user. It is shown in double-logarithmic scale in Figure 9a. The obtained distribution follows quite closely a straight line, suggesting a power-law probability distribution governing this relation. We note that 53% of the users write 3 or less comments whereas only 93 users (0.1%) write more than 1000 comments. Indeed, after applying linear regression as in other studies (Faloutsos et al., 1999; Albert et al., 1999) we obtain a quite large correlation coefficient R2 = −0.97 for an exponent of γ = −1.79. However, if we apply rigorous statistical analysis as proposed by Goldstein et al. (2004) the picture changes. First, we estimate the power-law exponent computing the less biased maximum likelihood estimator (MLE). The resulting exponent γ = −1.5 differs significantly from the previous one and is illustrated in Figure 9 (dashed-line). Although Figure 9a 11 tempts one to accept the power-law hypothesis, the cdf shown in Figure 9b discards it. It is thus not surprising that the Kolmogorov-Smirnov test forces us to reject the power-law hypothesis with statistical significance at the 0.1% level. As an alternative hypothesis to describe the data we propose a truncated LN probability distribution, shown in Figure 9 as grey-solid-line. Its parameters are found using the MLE. Clearly, the fit is better using this hypothesis. We remark that in many studies some data points (considered outliers) are discarded to improve the power-law fit. Here, in contrast, the truncated LN-approximation can characterize the entire data-set. 3.3.2 Single user dynamics After characterizing the user activity at a general level, we investigate the temporal behavior patterns of single users . The analysis concentrates on the two most active users (to protect their privacy we call them user1 and user2). Table 2 shows the number of commented posts and the total number of comments these two users published during the time-span covered by our data. Table 2: Contributions of the two most active users. user1 user2 commented posts 1189 1306 comments 3642 3350 We focus on the distribution of the PCIs of all of their comments as well as on their intercomment-interval (ICI) distribution, i.e. the time-difference between two comments of the same user. We approximate the PCI-cdf (gray lines in Figure 10a) also with LN (dashed and dasheddotted lines) and DLN-distributions (blue and red lines with box and circle markers). The quality of the LN-fit is worse than in the case of the post-induced comment activity, but the DLN-distribution is a good explanation of the data with a small approximation error ǫ. Again we notice a clear dependence of the quality of the fit on the activity cycle (shown in the insets of Figure 10a). The approximation is much better for user1, whose daily and especially weekly activity cycles are much more balanced than those of user2. The activity of the latter user concentrates almost exclusively on the working hours from Monday to Friday. Hence his PCI-distribution shows a clear decrease after 8 but increases again after 16 hours. This increase is less pronounced if only the first comment to a post is considered (data not shown), indicating that the user frequently rechecks the posts he commented the day before to participate again in an ongoing discussion. The same effect can be observed in their ICIs, which are illustrated in Figure 10b. There the cdf (inset of Figure 10b) of user2 shows an even more pronounced increase around an ICI of 16 hours. We further observe that the ICI-pdf peaks for both users as well as for the whole population at 3 minutes. This is probably caused by an anti-troll filter (Malda, 2002), which should prevent a user from commenting more than once within 120 seconds. The medians of the ICI-distributions of user1 and user2 are rather short (11 and 7 minutes respectively) compared to the median of the whole population (about 17 hours), indicating that the two users engage in discussions frequently during their activity phase. 12 Homogeneous temporal activity patterns in a large online communication space (a) −1 0 10 1 10 hours 1 (b) 2 10 10 0.9 0 Su Mo Tu We Th Fr Sa Su Daily Activity Cycle 0.5 0.1 0.4 user1: med.=317 min. user2: med.=179 min. user1 LN: ε=0.058 user2 LN: ε=0.076 user1 DLN: ε=0.006 user2 DLN: ε=0.010 0.05 0.2 0 0 5 10 15 20 0.1 0 0 10 1 2 3 10 10 10 time since post was published in min. −1 1 1 10 2 10 3 10 10 0.8 0.08 0.06 0.6 0.4 0.04 0 0 10 4 0 10 0.1 0.2 0 0 10 0.02 10 2 10 cdf distribution funtion (cdf) 0.05 0.6 0.3 1 10 user1: median=11 min. user2: median=7 min. all users: median=1039 min. 0.12 0.1 0.7 0 10 hours 0.2 0.15 probability density funtion (pdf) 0.8 −1 10 0.14 Weekly Activity Cycle 1 1 2 10 10 2 3 4 10 10 3 10 10 10 ICI: time between comments in min. 5 10 4 10 Figure 10: Activity patterns of the two most active users: (a) PCI-distributions, insets shows daily and weekly activity cycles. (b) Distribution of the inter-comment intervals (ICI) compared with the whole population (dashed line). 4. Discussion The special architecture of the technology-related news website Slashdot allowed us to analyze the temporal communication patterns of an online society without considering semantic aspects. The site activity is driven by news-posts which provoke communication activity in the form of comments. Despite the great amount of users participating in the discussions, close to 105 in the data we have studied, and the diversity of themes (games, politics, science, books, etc.) some simple patterns can be identified, which repeat themselves over and over again. One of these patterns appears in the shape of the distribution of time differences between a post and its comments (the PCIs). It can be well approximated by a log-normal distribution (Figures 3 and 4) for most of the posts. The only remarkable deviations from these approximations are caused by oscillatory daily and weekly activity patterns (Figure 1), which become less noticeable if a post is published early in the morning (Figure 5a). A significant improvement of the approximation can be achieved using a superposition of two log-normal distributions. Such a double log-normal accounts for the first oscillation caused by the circadian cycle. It can be interpreted as two independent waves of activity, one starting directly after a post has been published, and the second at the next increase of activity due to the circadian rhythm. Although more such oscillations may occur during the life-time of a post, their amplitude is low compared to the first one, suggesting that a combination of more than two LN-distributions would only increase the complexity of parameter-finding (via MLE) without improving significantly the approximation quality. Nevertheless, a combination of a DLN-distribution with an oscillatory function emulating the circadian cycle leads to slightly better results (Kaltenbrunner et al., 2007a), without affecting the complexity of MLE. In single user behavior an akin pattern appears in the PCI-distribution of all of the comments a user writes to several posts (Figure 10a). Again deviations are caused by the circadian cycle. Another interesting pattern can be observed analyzing the ICI of single13 users, i.e. the time-span between two consecutive comments of a certain user. In the case of the two most active users (Figure 10b) the ICI-distributions are very similar, which further supports our hypothesis of the existence of homogeneous temporal patterns on Slashdot. We would expect that the time-spans between publishing and reading of a post also follow log-normal patterns. This could be easily verified checking the server logs of Slashdot or access-times of an external homepage linked by a Slashdot post. Such a study has been performed to show the Slashdot effect (Adler, 1999), but the scale of the data presented does not allow to draw significant conclusions. Further investigation is needed to verify this claim. Log-normal temporal patterns similar to those described above were found in personto-person communication by Stouffer et al. (2006), who investigated the waiting and interevent times of an e-mail activity dataset. A second coincidence between their study and our findings is that the number of comments (or e-mails in their case) can be well approximated by the same distribution (a truncated log-normal in this case). The temporal patterns of the e-mail data were previously claimed to show power-law behavior, which would be explained by a queuing model (Barabási, 2005). Although this model might allow insight into other types of human activity (Vázquez et al., 2006) it is not able to account for the observed log-normal behavior patterns. We hope therefore to encourage further research towards a theoretical understanding of the underlying phenomena responsible for this apparently quite general human behavior pattern. The medians (Figure 8) of the PCI-distributions are very small compared to the overall duration of the activity provoked by a post. Although the posts might be available for commenting for more than 10 days, the first few hours decide whether they will become highly debated or just receive some sporadic comments. We would therefore expect that the simplicity of the approximation together with the high initial activity should make an accurate prediction of the expected user behavior feasible at an early phase after a post has been put online. The accuracy of such forecasting methods is subject of current research (Kaltenbrunner et al., 2007a). An early characterization of the activity triggered by a post could be applied, for instance, on dynamic pricing or placing of online advertisements or on the improvement of online marketing. The success of a campaign might be predicted already after a short time-period, thus allowing an early adaptation of the strategy of information diffusion. In this context the viral marketing concept (Leskovec et al., 2006), which relies on personal communication might be the most promising field. In our opinion, the regular communication activity patterns described in this work may be relevant in two aspects. The first, simpler one, is related to applications where a better understanding of information trade in the web translates easily into a better description, and even quantification, of Internet audience. But a second, more complex, aspect is related to the human “communicative” behavior uncovered at present time: Internet based communication capabilities. We face a new, large scale, all-to-all public space in which a novel kind of social behavior arises, a scenario that we do not yet fully understand. However, we should not forget that the new activity is being largely recorded and the data can be available for research. The work presented in this contribution is a good example of how those data can be collected and analyzed to give, at least, a quantitative description of the behavior. This is a first step towards a more ambitious target: to develop “ab initio” 14 Homogeneous temporal activity patterns in a large online communication space models for the population dynamics of message interchange, which is also the goal of our current research. Acknowledgments This work has been partially funded by Càtedra Telefónica de Producció Multimèdia de la Universitat Pompeu Fabra. Appendix A. Log-normal and double log-normal distributions The following two probability distributions have been used in this article: A log-normal (LN) distribution, which has the following probability density function (pdf): −(ln(t) − µ)2 1 √ exp (2) fLN (t; µ, σ) = 2σ 2 tσ 2π and its cumulative distribution function (cdf) is given by: ln(t) − µ 1 1 √ + erf , FLN (t; µ, σ) = 2 2 2σ where erf(x) is the Gauss error function being defined as Z x 2 erf(x) = √ exp(−u2 )du. π 0 (3) (4) And a double log-normal (DLN) distribution, which is a superposition of two independent LN-distributions and has the following pdf: fDLN (t; θ) = cfLN (t; µ1 , σ1 ) + (1 − c)fLN (t; µ2 , σ2 ) (5) where θ = (µ1 , σ1 , c, µ2 , σ2 ). The corresponding cdf can be easily derived from equations (3) and (5). Appendix B. Error Measure ǫ We use the following distance measure to calculate the error of the approximations. The distance between approximation and data is only calculated for the time-bins (i.e. minutes) where a post actually receives a comment to avoid a distortion of the error measure by the periods with low comment activity. Definition 1 Let T be the set of time-bins where a post receives at least one comment and T its cardinality. We define then the approximation error ǫ of a function f (t) approximating g(t) (both defined for all t ∈ T) as the normalized ℓ1 -norm of f (t) − g(t): X |f (t) − g(t)| . (6) ǫ= T t∈T If f (t) and g(t) are cumulative probability density functions (i.e. 0 ≤ f (t) ≤ 1 and 0 ≤ g(t) ≤ 1), it follows that 0 ≤ ǫ ≤ 1. 15 References Adler, S. (1999). The Slashdot effect, an analysis of three Internet publications. http://ssadler.phy.bnl.gov/adler/SDE/SlashDotEffect.html. Albert, R., Jeong, H., and Barabási, A.-L. (1999). Internet: Diameter of the World-Wide Web. Nature, 401:130, arXiv:cond-mat/9907038. Baoill, A. Ó. (2000). Slashdot and the public sphere. First Monday, 5(9). Barabási, A.-L. (2005). The origin of bursts and heavy tails in human dynamics. Nature, 435:207–211, arXiv:cond-mat/0505371. DeGroot, M. H. and Schervish, M. J. (2002). Probability and Statistics. Addison-Wesley, New York, 3rd edition. Dewes, C., Wichmann, A., and Feldmann, A. (2003). An analysis of Internet chat systems. In IMC ’03: Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, pages 51–64, New York, NY, USA. ACM Press. Dezsö, Z., Almaas, E., Lukács, A., Rácz, B., Szakadát, I., and Barabási, A.-L. (2006). Dynamics of information access on the web. Physical Review E, 73(6):066132, arXiv:physics/0505087. Faloutsos, M., Faloutsos, P., and Faloutsos, C. (1999). On power-law relationships of the Internet topology. In SIGCOMM ’99: Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication, pages 251–262, New York, NY, USA. ACM Press. Goldstein, M. L., Morris, S. A., and Yen, G. G. (2004). Problems with fitting to the power-law distribution. The European Physical Journal B, 41(2):255–258, arXiv:condmat/0402322. Habermas, J. (1962/1989). The Structural Transformation of the Public Sphere: Inquiry into a Category of Bourgeois Society. Cambridge, MA: MIT Press. Harder, U. and Paczuski, M. (2006). Correlated dynamics in human printing behavior. Physica A: Statistical Mechanics and its Applications, 361(1):329–336. Henderson, T. and Bhatti, S. (2001). Modelling user behaviour in networked games. In MULTIMEDIA ’01: Proceedings of the 9th ACM International Conference on Multimedia, pages 212–220, New York, NY, USA. ACM Press. Johansen, A. (2004). Probing human response times. Physica A: Statistical Mechanics and its Applications, 338(1–2):286–291, arXiv:cond-mat/0305079. Kaltenbrunner, A., Gómez, V., and López, V. (2007a). Description and prediction of Slashdot activity. In Proceedings of the 5th Latin American Web Congress (LA-WEB 2007), Santiago de Chile. IEEE Computer Society. 16 Homogeneous temporal activity patterns in a large online communication space Kaltenbrunner, A., Gómez, V., Moghnieh, A., Meza, R., Blat, J., and López, V. (2007b). Homogeneous temporal activity patterns in a large online communication space. In Proceedings of the BIS 2007 Workshop on Social Aspects of the Web (SAW 2007). Poznan, Poland. Kleban, S. D. and Clearwater, S. H. (2003). Hierarchical dynamics, interarrival times, and performance. In SC ’03: Proceedings of the 2003 ACM/IEEE conference on Supercomputing, page 28, Washington, DC, USA. IEEE Computer Society. Lampe, C. and Resnick, P. (2004). Slash(dot) and burn: Distributed moderation in a large online conversation space. In CHI ’04: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 543–550, New York, NY, USA. ACM Press. Leskovec, J., Adamic, L. A., and Huberman, B. A. (2006). The dynamics of viral marketing. In EC ’06: Proceedings of the 7th ACM conference on Electronic commerce, pages 228– 237, New York, NY, USA. ACM Press, arXiv:physics/0509039. Limpert, E., Stahel, W. A., and Abbt, M. (2001). Log-normal distributions across the sciences: Keys and clues. Bioscience, 51(5):341–352(12). Mainardi, F., Raberto, M., Gorenflo, R., and Scalas, E. (2000). Fractional calculus and continuous-time finance II: the waiting-time distribution. Physica A: Statistical Mechanics and its Applications, 287(3):468–481, arXiv:cond-mat/0006454. Malda, R. (2002). Slashdot FAQ: Comments and Moderation. http://slashdot.org/faq/commod.shtml. Masoliver, J., Montero, M., and Weiss, G. H. (2003). Continuous-time random-walk model for financial distributions. Physical Review E, 67(2):021112, arXiv:cond-mat/0210513. Mitzenmacher, M. (2003). A brief history of generative models for power law and lognormal distributions. Internet Mathematics, 1(2):226–251. Newman, M. E. J. (2005). Power laws, pareto distributions and zipf’s law. Contemporary Physics, 46:323–351, arXiv:cond-mat/0412004. Paxson, V. and Floyd, S. (1995). Wide area traffic: The failure of Poisson modeling. IEEEACM Transactions On Networking, 3(3):226–244. Poor, N. (2005). Mechanisms of an online public sphere: The web site Slashdot. Journal of Computer-Mediated Communication, 10(2). Sigman, K. (1999). Appendix: A primer on heavy-tailed distributions. Queueing Systems: Theory and Applications, 33(1–3):261–275. Stouffer, D. B., Malmgren, R. D., and Amaral, L. A. N. (2006). Log-normal statistics in e-mail communication patterns, arXiv:physics/0605027. Vázquez, A., Oliveira, J. G., Dezso, Z., Goh, K. I., Kondor, I., and Barabási, A.-L. (2006). Modeling bursts and heavy tails in human dynamics. Physical Review E, 73:036127, arXiv:physics/0510117. 17

Log In

Homogeneous temporal activity patterns in a large online communication space

Related papers

Related papers

Related topics