1. Introduction
By analyzing social big data such as blogs, twitter, etc., many statistical properties of human behaviors are uncovered scientifically. Typical examples are power law growth and relaxation of social interests around social events such as Christmas [
1,
2], spreading properties of false rumors and fake news [
3,
4], relationships between stock prices and social data [
5,
6], predicting consumer behavior with Web search [
7] and forecasting macroeconomic index with Google Trends [
8]. It is a big challenge for mathematical scientists to establish data analysis methods and propose mathematical models to describe and predict the phenomena. Currently, much attention is being paid to the topic of finding significant correlations between phenomena in the real world and in cyberspace.
In this paper, we examine the relation between a composite index (CI) and big blog data. The CI that characterizes the Japanese economy is announced by the Japanese government every month. Although many people want to know the latest economic condition quickly, it takes a few months to calculate CI values based on the observed quantities in many fields such as production, inventory, investment, employment, consumption, business management, finance, prices and services. In this paper, we show that the value of CI can be roughly estimated by the appearance of carefully chosen words in cyberspace. As the analysis using data in the cyberspace can be conducted in real time at lowers costs, it is possible that CI could be approximated by applying this new method for quick announcement.
We expect that online textual data such as blogs and Twitter are related to economic indices. Recently, many diverse people have begun to use social media and the data posted on such sites immediately reflect social events and human activities. For example, earthquakes and typhoons can be detected quickly and accurately by using Twitter data [
9]. It detected 96% of earthquakes of Japan Meteorological Agency (JMA) seismic intensity scale 3 or higher and announced more quickly than the announcements that were broadcast by the JMA. In addition, power law growth and relaxation were observed in blog frequency with words such as “April Fool” or “Christmas” around these social events. The future frequency of such words could be predicted by the properties [
1].
Nowcasting and forecasting based on social big data are effective because social data in cyberspace reflect public opinion and behavior in physical space. However, there are difficulties because of the large volume of data. For example, in the case of a huge number of candidates, such as 10,000 variables, if we assign the threshold of p-value as 0.01, then approximately 100 variables would remain, even if these candidates followed independent randomness.
We also have to pay attention to the following points in conducting correlation and regression analyses. First, the Pearson correlation coefficient is defined as follows:
where
represents an average of
Z and
has a value between +1 and −1. If
, the correlation is positive; conversely, if
, the correlation is negative.
and
are
Z-scores defined as:
which are normalized to an average 0 and standard deviation 1. The Pearson correlation coefficient is a useful tool in describing the relationship between two variables; however, when we apply it, we should be aware of the following assumptions.
Namely, if two variables do not follow a linear relation, or if variables (x, y) include outliers that significantly affect the correlation coefficient, we cannot trust the coefficient. In addition, in cases where the distributions of x or y follow power law distributions, , the standard deviation tends to diverge when the number of samples increases. Therefore, we need to consider that the Pearson correlation coefficient is meaningless in this context.
Second, we focus on the possibility that a correlation may be spurious. For instance, we assume three variables, e.g., temperature, and the sales of ice cream and soft drinks. Calculating correlations among these values, we probably find positive correlations. We consider when the temperature rises, people tend to buy more ice cream and soft drinks to provide relief from the heat; however, it is not reasonable to assume that the sales of soft drinks are increasing because the sales of ice cream are rising. In this case, we find the positive correlation between sales of ice cream and soft drinks. However, this could happen even if there is no relation among these variables, in which case we call the correlation a spurious correlation. When we detect a strong correlation, we are inclined to consider the existence of a linkage that includes a causal relationship, which often leads to an incorrect result. Therefore, when we deal with multiple variables, we need to introduce a process which excludes spurious correlations.
Third, when we apply regression analysis, we must pay attention to the following points:
Overfitting
Spurious regression
Multicollinearity
Overfitting is the case where we take many variables to reproduce details of training data, however, the model does not have the same performance for the test data, which are not used in the training. If we run regression analysis to non-stationary time series, we often have a large coefficient of determination, which is called a spurious regression [
10,
11]. Moreover, if more than two explanatory variables are a strongly correlated, the relationships between the explanatory variables are not independent. Therefore, optimal regression coefficients are not uniquely obtained, which is called multicollinearity. In this case, the reliability of the regression coefficients is remarkably reduced.
In cases where we handle multivariate variables with non-normal distribution and non-stationary dynamics, it is important to carefully check their appropriateness for the Pearson correlation coefficient and regression analysis. In the next section, we introduce a method which removes these difficulties and estimates the economic index using big blog data.
2. Development of a New Economic Index Based on a Comprehensive Search
In this section, we introduce a new method that systematically finds words that are correlated with respect to an economic index, and it reconstructs the index by using a regression formulation with the frequency of selected words.
The method, which resolves all the problems mentioned in the Introduction, consists of the following four steps, as shown in
Figure 1: data pre-processing, variable selection, regression analysis and smoothing.
In this analysis, we used three types of data: blogs, dictionary of candidate words and the time series of an economic index. As for blogs, we used “Kuchikomi@kakaricho” (
http://kakaricho.jp) to obtain the daily frequencies of words, which covers about 1.8 billion blogs from 1 January 2007 to 31 October 2015. We adopted 1352 words listed in the economics and industry section of the Nikkei thesaurus (
http://t21.nikkei.co.jp/public/help/contract/price/20/thesaurus/field_B.html) to illustrate the economic index. In this paper, we focus on the index of business conditions, especially Composite Index (CI) coincidence, which is announced monthly by the Japanese government. The CI provides a quantitative description of the Japanese economy.
2.1. Data Pre-Processing
We analyzed the daily frequency of each of the 1352 words from 1 January 2007, which is described by
where
i is an index of words and
is daily time. Here,
is defined as follows:
where
is the number of blog entries including the
ith key word, and
is the total number of blogs on
. Namely,
is a value by normalizing
with
to remove the effects such as trend of the entire blog and the day of the week effect because people tend to write blogs on the weekend. We also calculate monthly average frequency as,
to match with the economic index intervals which is announced monthly. Here,
m is the number of days in one month: 28, 29, 30 or 31. We excluded the words with 0 in frequency over 10 months to remove very low frequency words. Thus, 1127 candidate words remained.
2.2. Variable Selection
2.2.1. Extraction of Words by One-Body Correlation
First, we introduced an algorithm that chooses the words with similar patterns and trends in the time series of frequency. Specifically, as shown in
Figure 2, we gave a “+” sign if the value is greater than the median of the time series, while we used a “−” sign if the value is smaller. There were four patterns with respect to both values of CI and word frequency, which are greater or smaller than the median: (CI, word frequency) =
. If we find more frequency of
and
than
and
, CI and word frequency have positive correlation. Alternatively, if
and
appear more often than
and
, the two time series have negative correlation. Using the number of appearances of
as
, we define
p as no correlation degree.
This definition is the same as that of probability in Fisher’s exact test. In this analysis, we imposed the restriction of dividing the median, and therefore the probability
p does not have exactly the same meaning as the original
p in Fisher’s exact test. In this analysis, we considered that, if
, the variables are regarded as sufficiently correlated. As a result, we found 416 words that have small
p-values that show strong correlations with CI. The Top 4 words with the smallest
p-values are shown in
Table 1.
2.2.2. Grouping
The words selected in the previous method are likely to have very similar patterns and trends. If we apply a regression analysis using all the words, we have the multicollinearity problem and overfitting problem mentioned in the Introduction. To solve these problems, we categorized the words into some groups which have similar patterns and trends.
First, we assigned a number (
) in ascending order of the no correlation index,
p (in descending order of having strong correlation). Next, we calculated the value
p defined in Equation (
5) for
ith and
jth words (
) as well as shown in the extraction of words by one-body correlation, and, if the value
p was smaller than the threshold value (
), the time series of frequencies of two words was judged as very similar and the
jth word was classified into the
ith word’s group. By applying this algorithm to pairs in descending order, i.e., 1st–2nd, 1st–3rd, ⋯, 2nd–3rd, 2nd–4th, etc., we formed a group of words.
Figure 3 shows an example of the grouping results. We can observe the time series of pairs: for example, “
” (recession) and “
” (economic recovery policy), and “
” (foreign banks) and “
” (capital market) are comparable. After the grouping process, we classified the 416 words into 43 groups whose representative words have the lowest
p in the group.
2.2.3. Round Robin (Detection of Spurious Correlation)
Based on Bayesian analysis for Dirichlet distribution [
12], we defined a probability in which both frequency of
ith words and CI have a larger value than the median,
where
and
, as shown in
Section 2.2.1, and
r is the probability that CI has a larger value than its median estimated from whole data:
where
and
. We also define corresponding standard deviation as
In the cases where the frequency of the
ith word has a lower probability than the median, we defined the following values as well as the frequency that has a larger value than the median:
To introduce conditional probabilities with respect to the
jth word, we defined the following quantities,
as the numbers that CI and the frequency of the
ith word take a greater value than each median under the condition where the
jth word has a bigger value compared to its median in the same month. The probability and standard deviation in the condition are defined below:
In addition to
, we defined
, which describe the numbers that both CI and the frequency of the
jth word have a larger value than each median in the condition where the
ith word has a bigger value compared to its median. We also defined
,
,
,
in the same way as Equations (
11)–(
14).
Using these values, we defined the effectiveness of the
ith word to illustrate CI,
, in a condition where the frequency of the
jth word is bigger than the median:
Similarly, we defined the following quantity with the condition that the frequency of the
jth word is smaller than its median:
If these values, which were conditioned by the
jth word, are small, the correlation between the
ith word and CI is likely to be spurious correlation caused by the
jth word. We introduced the following quantity, which characterizes the strength of the correlation between the
ith word and CI taking into account the condition of the
jth word.
Some round robin results of
are shown in
Table 2. If
is smaller than
, we considered that the
jth word has a spurious correlation caused by the
ith word, because the correlation with a condition by the
ith word against the
jth word is smaller than a condition by the
jth word against
ith word. To evaluate the asymmetry, we defined an asymmetrical index:
If
has a larger value than a given threshold (−0.6 in this analysis), we judged that the
ith word has a spurious correlation caused by the
jth word. For example, regarding the pair “
” (economic measures) and “
” (recession),
We regarded “ ” (economic measures) as having spurious correlation against “” (recession). Therefore, “ ” (economic measures) was classified into the group of words which is represented by the word, “” (recession).
After classifying the words, we computed the following sum of
In the regression analysis, we adopted, as an explanatory variable, the descending order of .
2.2.4. Visualization of Relationships Between Variables
To intuitively understand the results obtained in the previous steps, such as the extraction of words by the one-body correlation, grouping and round robin (detection of a spurious correlation), we show the relationships between words in
Figure 4. In this figure, the solid line with both sides arrow indicates relationships with a strong correlation between CI and word frequency. These words passed the judgments: grouping and detection of spurious correlation. There are 17 representative words and, in this figure, we show the Top 7 words. The solid line without an arrow describes the words which are grouped in the representative words, and the broken line with an arrow indicates a spurious correlation.
2.3. Regression Analysis
We designed a model that reconstructs CI by conducting a regression analysis with frequency of words such as “” (recession), “ ” (foreign banks), etc., which were selected in the previous analysis. We applied this model to estimate the daily CI based on the daily frequency of words.
We performed a linear regression analysis using CI as a dependent variable and the frequency of words as explanatory variables:
where
is a constant and
m means the number of representative words. We used top
m words with respect to the order of
in Equation (
20). We then obtained coefficients
with the minimum square error:
Figure 5 depicts the regression results using one, three and seven variables. In the case of using only one variable, “hukeiki” (recession), big jumps in 2008 and 2009 were reproduced, and, by adding other variables to our model, more details were reproduced. The regression coefficients in the case of 7 words are shown in
Table 3. Besides the results of in-sample data, our model yielded reasonable results for the out-sample data from January to October 2015.
In the above variable selection analysis, we obtained 17 representative words from 1352 words. We then introduced the method that allowed us to systematically decide the appropriate number of explanatory variables. As discussed in the Introduction, we applied the regression analysis to non-stationary time series CI. To check for the spurious regression problem, we proposed a hypothesis testing which compared the frequency of words and an artificial time series which follows the random walk model.
A random walk time series was generated by randomly sampled numbers from the same distribution with real time series. For example, an absolute value of differences (step size) of the first random walk was randomly chosen from the absolute value of the differences in “
” (recession) and we then gave positive or negative signs randomly. The
kth random walks were produced using the absolute value of differences of the top
kth word as well as the first one.
Figure 6 shows the results of fitting CI by seven random walks. In
Figure 6a, the fitting did not work well and
is low, where
is defined as follows:
Here,
N is the number of data. In
Figure 6b, the random walk regression reconstructed the properties of CI with
. These result indicate that random walk sometimes reproduce non-stationary time series.
Figure 7 shows the probability density function of
for 10,000 samples by the regression of seven random walks. The red line in this figure shows the result in
Figure 5c. The results in
Figure 5c had a greater
value compared to the random walks in most cases, and the probability of the random walks model having a higher
was approximately 3%.
Using the distribution of
by the random walk regression, we performed a statistical test. The null hypothesis is the following: “The random walks regression is more effective than the regression by the time series of selected words with respect to
”.
Figure 8 shows the relationships between the coefficient of determination (
), the number of explanatory variables in the proposed model, and the random walk model with the Top 5 and 10 percentiles. The values of the coefficient of determination increases with respect to the number of explanatory variables. In this case, the
of the proposed method was higher than the case of the Top 5 percentile of the random walk model until seven variables. This means that the null hypothesis was rejected with a significant level of 5% and proposed model was better than the random walk model statistically. However, in the case of more than eight variables, the null hypothesis with a significance level of 5% was not rejected. Therefore, we considered the proposed method with seven variables appropriate. The number of explanatory variables is determined as
where
(proposed method) shows the coefficient of determination of the proposed method with the top
k explanatory variables and
(RW 5%) is the coefficient of determination of the random walk model in the top 5% percentile.
2.4. Daily Index and Smoothing
The daily index
can be calculated by the regression coefficients and daily frequency of words (
),
, as shown in
Figure 9a, suddenly increased in response to the news related to the words. To reduce this effect, we applied the optimal moving average [
13]. The optimal moving average (
) of
is calculated as follows:
where
and the coefficients (
) are shown in
Figure 9b. It has a large weight in a small
k and it has a small peak at
. We considered it a weekly period, and it almost converged to 0 around
. Comparing
and
(
Figure 9a), we found that sudden peaks are reduced in the case of
, and it fluctuates around the monthly economic index (CI).