1. Introduction
The Internet has become a major public opinion formation and diffusion platform [
1]. In the Web 2.0 era, Bulletin Board System (BBS), Micro-blog, WeChat, etc. are main sources for public information dissemination, and become the core areas of public opinion monitoring [
2]. The numbers of posts, blogs or micro-blogs represent the hot-degree and trend of public opinion, so time series analysis of the numbers of posts, blogs or micro-blogs, such as prediction of the number of new posts, blogs or micro-blogs in a time interval ahead, can be considered as an important signal for government agencies, enterprises and website operators to make decisions. Based on time series analysis of the numbers of posts, blogs or micro-blogs, convincing results are obtained for the election result prediction [
3], crisis management [
4] and stock market forecasting [
5]. Therefore, time series analysis of the numbers of posts, blogs or micro-blogs enable them to monitor the tendency of public opinion, and further support them to make rational planning and actions for public opinion management and guidance [
6].
For government agencies and enterprises, the prediction of the numbers of posts, blogs or micro-blogs can help them make decision for at least three reasons [
7,
8]. First, the prediction of the numbers of posts, blogs or micro-blogs provides government agencies and enterprises with a measure of the trend, scope and duration time of public opinion on their related topics. Second, it is useful to estimate the effort involved in public opinion management and guidance. To different trend types of events or topics, different operation methods and effort investments are required. Generally, the effort involved in public opinion management and guidance is proportional to the number of posts, blogs or micro-blogs. Third, based on the numbers of posts, blogs or micro-blogs, the effectiveness of guidance strategies can be evaluated. According to the positive or negative feedback from public opinion, the guidance strategies can be improved immediately. For website operators, the prediction of the numbers of posts, blogs or micro-blogs can help them allocate resource or strategies on hot topics [
9]. Without sufficient resource allocation for hot events, it will make their system delay or crash. Conversely, too many resources allocation will increase their operational cost.
Hence, for effective management and guidance of public opinion, the prediction of the numbers of posts, blogs or micro-blogs is a critical issue. Various approaches are proposed to solve this problem and we can divide the approaches into two kinds: diffusion model and time series model. For diffusion model, the classic mathematical models of diffusion are adopted to establish public opinion, such as Logistic distribution [
10], epidemic model [
11] and Michaelis–Menten model [
12]. The information diffusion process of public opinion is modeled through the classic diffusion model. Based on the identified model, the trend, peak and duration at different stages of public opinion are predicted. For time series model, ignoring the diffusion mechanism of public opinion, the diffusion model of public opinion is constructed only based on time series data. Auto-regressive integrated moving average (ARIMA) [
13], artificial neural network [
14] and support vector machines [
15,
16] are the frequently used models for time series data. Due to effectiveness of time series model for public opinion prediction, the time series model is applied for the prediction of the numbers of posts, blogs or micro-blogs.
In Chinese online community, BBS serves as an important social media; the extensive interests and contents are distinct characters of the Chinese BBS sites [
17]. As an emerging media and electronic information center, the Chinese BBS sites fulfill the requirements of netizens to be informed and exchange opinions [
18]. With the appearance of blogs, twitter, WeChat, etc., the influence of BBS is declining, but, due to the Internet regulations in China, the Chinese BBS sites can offer another channel to express information and spread opinions, and are still an important platform in China. Tianya Club (
http://bbs.tianya.cn/), one of the most influential Chinese BBS sites, provides BBS, blogs, etc. services for netizen and consists of many sub-boards for different content/topic discussions, such as Tianya Zatan and Baixing Shengyin [
19]. Tianya Zatan board is an important board within Tianya Club, the content of which covers the daily news of current society and personal life. Daily new posts published in Tianya Zatan board are nearly 1000, and millions of clicks and replies are created by netizens [
20]. The daily new posts data of Tianya Zatan were selected as the data source for public opinion monitoring.
Takens’ [
21] embedding theorem and Sauer et al.’s [
22] embedology theorem provide a theoretical foundation for nonlinear dynamical system reconstruction based on its generated time series sequence. Consequently, for the modeling and forecasting of BBS posts time series, two main problems are: (1) Determination of the embedding dimension. A time series can be represented in the so-called “phase space” by a set of delay vectors (DVs), and the embedding dimension defines the size of the DVs. (2) Which model is fit for BBS post number prediction. Approximate entropy is an effective model for the embedding dimension analysis of time series; sample entropy (SampEn) is a modification of approximate entropy [
23,
24]. With multiple layers and more neurons, deep neural networks (DNN) can detect the features of data, and are more effective for time series prediction [
25].
Therefore, by combining SampEn and DNN, an approach SampEn-DNN is proposed to predict BBS new post number time series. SampEn-DNN applies sample entropy to measure the predictability of DVs with different dimensions, selects the dimension of DVs with smallest complexity, and feeds the DVs as the input of DNN to improve the predictive performance of time series. However, in some cases, the single-scale sample entropy cannot really reflect the complexity of a time series, thus, to avoid this issue, multi-scale sample entropy is adopted in this paper. The skipping parameter and the dimension of DVs are tuned by multi-scale sample entropy. The predictive performance of SampEn-DNN for Tianya Zatan new posts time series was investigated. For the combination of sample entropy and DNN for time series modeling and forecasting, both the proposed method and the application area are attempted for the first time. To illustrate the improvements of SampEn-DNN, the performances of ARIMA, seasonal ARIMA, polynomial regression and artificial neural network (ANN) on Tianya Zatan new posts time series analysis were compared.
The rest of the paper is organized as follows.
Section 2 presents the related methods for time series analysis. The proposed approach SampEn-DNN is shown in
Section 3.
Section 4 presents the experimental results of SampEn-DNN and the state-of-the-art approaches for time series analysis of Tianya Zatan new posts. Finally, concluding remarks and future work are given in
Section 5.
2. Methodology
Time series
is an ordering set of observations of a variable over successive periods of time. Time series modeling and forecasting has fundamental importance to various practical domains [
26]. Stock exchange data, wind speed, global temperature, etc. are typical examples of time series. The natural temporal ordering feature of time series makes time series analysis different from other data analysis problems, in which there is no natural ordering of the observations. The studies of time series data can be divided into two parts: one is to extract and understand the meaningful statistics and other characteristics of the data, and the other is to predict future values based on previously observed values.
Parametric approaches are frequently used for time series modeling and forecasting [
26]. The state-of-the-art parametric approaches include ARIMA model, seasonal ARIMA model, polynomial regression and ANN. The parametric approaches of time series analysis assume that underlying process is stationary. Generally, a time series
is stationary if
,
,
and
,
, and
is the auto-covariance function. In other words, for a stationary time series, the variation is finite, the expected values at any time points equals the same value
, and the auto-covariance is merely dependent on their time lag
r and not dependent on time
t or
. For example, the simplest stationary time series is white noise. For time series modeling and forecasting, the first issue is to know whether a time series is non-stationary or stationary. The common method for stationary test is augmented Dicker–Fuller (ADF) test [
27]. The related approaches for time series modeling and forecasting are presented as follows.
2.1. ARIMA Model
ARIMA model is a general class of ARMA model with differencing manipulation on time series data, and ARMA model consists of two parts: autoregressive (AR) model and moving average (MA) model. These models are applied for the fitting of time series data, and aim to describe the autocorrelations in time series.
For a time series
, assume the number of autoregressive terms as
p, AR model can be abbreviated as AR(
p), and expressed as
where
is stationary,
are constants and
.
is assumed to be Gaussian white noise with variance
and zero mean.
Assume the moving average order as
q, so MA model can be abbreviated as MA(
q) and expressed as
where model parameters are
, and
q lags are in the moving average.
According to Equations (1) and (2), ARMA model with the autoregressive and the moving average order
p and
q can be abbreviated as ARMA(
p,
q) and expressed as
If the mean
of
is non-zero, then set
, and the ARMA model can be rewritten as
ARIMA model contains differencing manipulation, which is used to transfer a non-stationary time series to a stationary time series. If
L is a differencing operator,
conforms to the process ARMA(
p,
q). ARIMA model can be denoted as ARIMA(
p,
d,
q). The general expression of ARIMA(
p,
d,
q) is given as Equation (5).
where, through difference by order
d, the original time series
is converted from non-stationary into a stationary time series
Wt.
2.2. Seasonal ARIMA Model
For real issues, most time series show seasonal variation. Seasonal time series mean that there is a similar trend of the observations during the same period (e.g., daily, monthly or yearly) of the time series. Additionally, the observations during the successive periods may also exhibit another seasonal trend.
To address the seasonality and potential seasonal unit root, an extensional ARIMA model called Seasonal ARIMA model is proposed [
27]. Assume the periodicity of time series is
s, Seasonal ARIMA model is given by
where
is the seasonal differencing operator, and accounts for non-stationarity in observations made in the same period in successive period,
for
.
2.3. Polynomial Regression
For a given dataset
,
x is the independent variable and
y is the dependent variable. Polynomial regression is a form of linear regression to model the relationship between
x and
y as an
nth order polynomial. In general, a polynomial regression fits data to a model of the following form,
Parameters
of polynomial regression are identified by the method of least squares. According to Taylor’s theorem [
28], a polynomial regression is the expansion of Taylor series, so it can be used to approximate continuous functions for curve fitting and trend analysis.
2.4. Artificial Neural Networks
ANN is a framework for machine learning inspired by biological neural networks. One of the most widely applied models is back propagation neural network (BPNN) [
29]. BPNN is a kind of feed-forward network, the connection weights of which are trained by error back propagation algorithm. For a given dataset
, the training of BPNN includes two parts: one is forward propagation, and the other is back propagation. Forward propagation: The input sample
is propagated from the input layer, via the hidden layer, to the output layer. The connection weights of BPNN in forward propagation process are maintained constant. Back propagation: The difference (error) between the real value
and expected output
of BPNN is propagated from the output layer to the input layer. The connection weights of BPNN are updated by the error feedback during the process. The objective of BPNN training is to find a set of network weights that minimize the difference between the real value
and the expect output
.
3. SampEn-DNN
By combining of sample entropy (SampEn) and deep neural networks (DNN), a novel time series modeling and forecasting method SampEn-DNN is proposed to predict the daily number of BBS new posts.
3.1. Sample Entropy
For a time series , assume its constant time interval as . The constant time interval of BBS new posts time series in this study is one day. The SampEn of time series can be computed as follows.
First, define the dimension of embedding vector as m and tolerance as r, such that embedding vector is given as .
Second, the Chebyshev distance
is used as the distance function [
30],
Third, the number of that do not exceed the tolerance r () is counted and denoted as , and then the proportion that any is close to is computed.
Fourth, by averaging over all possible , the proportion is estimated.
Fifth, the Chebyshev distance and for embedding vector dimension as m + 1 are computed in a similar way.
Sixth, SampEn of
is defined as
From the definition, it can be found that
c(
m + 1) is not bigger than
c(
m), so the
SampEn(
m, r) value will be either zero or positive. For time series dataset, a smaller value of
SampEn(
m, r) means more self-similarity (predictability) or less noise. To overcome the shortcuts of single-scale SampEn in some special case, multi-scale SampEn is adopted. In multi-scale SampEn, a certain interval between its every element is defined for input vector specified by the skipping parameter
. Hence, input vector is modified as
,
is expressed as
, and then SampEn can be given as
. In this study, the value of tolerance
r is set as 0.02
std, where the notation
std stands for the standard deviation of time series
[
24].
The procedures for the calculation of
SampEn(
m,
r,
δ) is presented in
Figure 1. For the given starting positions as
i and
j in the time series
, Lines 4–9 in
Figure 1 are applied to decide
or not, and Lines 11–13 are implemented to decide
or not. In
Figure 1, for given
i and
j, according to the definition of Chebyshev distance, if
and
,
can be achieved.
3.2. Deep Neural Network
DNN model consists of deep belief network (DBN) [
31] and feedforward neural network (FNN). DBN is developed by stacking of multiple-units of Restricted Boltzmann Machines (RBM) [
32]. The structure of DNN is shown in
Figure 2. The aim of DBN is to extract the high-level features from input data by the stacked RBMs. The learning process of the stacked RBMs is that the features produced by the hidden layer of one RBM serve as the input to the higher-level RBM. The high-level feature representation learned by DBN is fed as the input of FNN. Meanwhile, BPNN is one of the most used FNN models, and adopted for DNN model.
Hence, RBM is the main component of DNN. RBM is an energy-based deep learning model for unsupervised learning, and consists of two kinds of layers: one is the visible layer and the other is the hidden layer. The visible layer is for input data representation, and the hidden layer is to represent a probability of the distribution of input data. The neurons in the visible layer are only connected to the neurons in the hidden layer.
For RBM, assume the numbers of neurons in the visible layer and the hidden layer as
m and
n, denoted as
and
, respectively. Meanwhile, assume the
bias vectors and the weight matrix of RBM as
a,
b and
W. The energy-based model means an entropy function is applied to define the log-likelihood input data distribution over the parameters
a,
b,
W,
v and
h. The energy function for RBM is given by Equations (10) and (11).
For each pair of neurons in the visible layer and the hidden layer, the joint probabilistic distribution is defined as
The sum of all probabilities of the hidden vector is the probability that the network assigns to the visible vector, and expressed as
Since the neurons in visible layer only connect to the neurons in hidden layer, there is no connection between neurons in the same layer. The joint probability of each pair of neurons in different layers can be facilitated by the conditional probabilities,
For binary data, Equations (14) and (15) can be expressed as
where
sigm(
x) is the sigmoid function.
3.3. SampEn-DNN Approach
DNN as regression model for time series analysis. The primary problem is to decide the formation of its input vectors, which is the main factor to affect the predictive performance for model. Generally, the input vectors are decided by two parameters: the dimension of input vector m and the skipping parameter δ. Based on SampEn method, the dimension of input vector m and the skipping parameter δ are optimized. For DNN model training, the first m − 1 elements are applied to predict the last (m-th) element.
To optimize the parameters
m and
δ of input vector, the maximum values of the two parameters need to be decided first. For the dimension of input vector, if
m is larger than 13, the SampEn of time series cannot be derived in most cases. Because, at this stage,
c(m) and
c(m + 1) in Equation (9) are zeros, the differences between the elements of
m-length and the (
m + 1)-length input vectors are all bigger than 0.02
std. For skipping parameter, if the value of
δ is too big, the intervals between data points will be large, which means the similarity of the data points to each other of time series will be smaller, and the unpredictability of the input vectors will be larger. Based on this point, the biggest skipping parameter
δ is constrained as 12 in the study [
16].
The SampEn results of BBS new posts time series based on different dimensions of input vectors
m and different skipping parameters
δ are presented in
Table 1.
As shown in
Table 1, when keeping the skipping parameter
δ constant, for the dimension of input vectors
m varying from 2 to 13, the SampEn results of BBS post time series decrease at first and increase after a critical threshold. For example, when
, for
m varying from 2 to 11, the SampEn results of the time series decrease from 1.19 to 0.54; however, for
m varying from 12 to 13, the SampEn results increase to 0.54. Formally, this phenomenon is known as phase transition. In
Table 1, it can also be found that, when
m is increasing, the SampEn results will decreases along with the range of
and
. Similarly, the value of
for a given
q is smaller than
when
is larger than
. When phase transition appears, the complexity of input vector will increase, which means the unpredictability of time series increase.
Therefore, according to the above analysis, the procedures for the determination of the optimal size of input vector
m and the optimal skipping parameter
δ are shown in
Figure 3 and
Figure 4. At first, for the optimal size of input vectors
m determination, the average of the SampEn results of different skipping parameter
δ under same
m is adopted, and
m with the smallest average SampEn is selected as the optimal parameter. After that, the skipping parameter
δ with the minimum SampEn is selected as the optimal skipping parameter.
The procedure for determination of the optimal size of input vector
is presented in
Figure 3.
seMatrix is derived based on the SampEn results of different
m and
δ in
Table 1. According to Lines 1–8, the SampEn results of all
δ under the same size
m are summarized. Line 9 is to get the average of SampEn results under the same size parameter
m. Based on the SampEn values in
Table 1, the optimal dimension of input vector is obtained as 11. It can be found that, for different dimension of input vector
m, the possible skipping parameters
δ with SampEn results not equal to infinity are different from each other.
Therefore, the procedure for determination of the optimal skipping parameter
δ is shown in
Figure 4. According to Lines 2–3, when increasing the skipping parameter
δ under the dimension of input vector
, the optimal skipping parameter
is found. Based on the SampEn values in
Table 1, the optimal skipping parameter
δ is derived as 5.
5. Concluding Remarks
In this paper, based on SampEn and DNN, a novel approach SampEn-DNN is proposed for BBS new post number time series modeling and forecasting. The multi-scale sample entropy is adopted to optimize the skipping parameter δ and the dimension of input vector m, and DNN is applied for time series modeling and forecasting based on the optimal parameters. Tianya Zatan board daily new post number was selected as the data source, and extensive experiments based on SampEn-DNN and the state-of-the-art approaches were carried out. From the experimental results, it can be found that, due to the parameter optimization of multi-scale sample entropy, DNN easily learns the micro-level patterns from BBS new posts time series and SampEn-DNN has produced better performance than ARIMA, Seasonal ARIMA, polynomial regression and ANN.
In the future, SampEn-DNN approach will be applied to different tasks on time series modeling and forecasting. For public opinion monitoring, SampEn-DNN will be extended to predict the daily numbers of posts or micro-blogs for more BBSs or micro-blogging websites. Meanwhile, SampEn-DNN approach will be applied to other areas to test its effectiveness, such as weather forecasting, control engineering, etc.