CN104063747A - Performance abnormality prediction method in distributed system and system - Google Patents

Performance abnormality prediction method in distributed system and system Download PDF

Info

Publication number
CN104063747A
CN104063747A CN201410294472.2A CN201410294472A CN104063747A CN 104063747 A CN104063747 A CN 104063747A CN 201410294472 A CN201410294472 A CN 201410294472A CN 104063747 A CN104063747 A CN 104063747A
Authority
CN
China
Prior art keywords
performance
data
historical
mode
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410294472.2A
Other languages
Chinese (zh)
Inventor
曹健
杨定裕
仇沂
顾骅
沈琪骏
王烺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201410294472.2A priority Critical patent/CN104063747A/en
Publication of CN104063747A publication Critical patent/CN104063747A/en
Pending legal-status Critical Current

Links

Landscapes

  • Testing And Monitoring For Control Systems (AREA)

Abstract

The invention relates to a performance abnormality prediction method in a distributed system and a system. The historical performance data and real-time performance data are collected through the monitoring system of a distributed environment, a characteristic value is employed to extract the characteristic of description data, the mode of a performance variable is constructed, a classification model is trained through Naive Bayesian classification, a current data mode and historical data modes are compared, a mode which is most similar to the current data mode is found in the historical data modes, and finally a question whether the current data mode is in an abnormal state is predicated according to a Naive Bayesian predication model. According to the method and the system, for the abnormal performance prediction in the distributed system, the problem of the characteristic of a variable is considered comprehensively, the accuracy is high, a machine learning method Bayesian model is employed to guide the prediction, the performance abnormality situation is detected in real time, the detected prediction is estimated and analyzed through the previously obtained Bayesian model, the confidence of the prediction is raised, the degree of automation is high, and the reliability and practicality of the prediction are improved.

Description

Performance abnormity prediction method and system in distributed system
Technical Field
The present invention relates to a performance anomaly detection and prediction method and system, and in particular, to a performance anomaly prediction method and system in a distributed system.
Background
In a distributed system, the computers are independent of each other, may be physically adjacent, or may be geographically distributed, and are connected by a network or other means to form a whole. From the research point of view, distributed computing has the following characteristics: 1. resource sharing; 2. scalability; 3. fault tolerance; 4. and (4) concurrency.
Monitoring of a distributed computing environment becomes particularly important and critical in order to better embody the powerful ability of distributed computing to handle data computations. The system must coordinate the operation of these tasks, allocate resources reasonably so that the resources are fully utilized and improve the performance of the whole system. Typically, the system employs a scheduler to manage these tasks. The scheduler will gather information about the various resources in the system to determine whether the resources are available, and then the scheduling algorithm will prioritize and assign the tasks to their available resources based on the availability of the resources, the running time of the tasks, etc. However, as the task runs, the states of various resources, such as CPU load, remaining memory, and remaining space of the hard disk, change at any time, and if it is predicted that the resources will still be available at a future time before scheduling is performed, and the use of the resources in an abnormal period is reasonably avoided, the scheduling result of the system will be more ideal. Therefore, it is important to monitor the resources in the system in real time and detect a precursor to an anomaly before it occurs.
The system performance abnormity refers to the phenomenon that the performance of a computer system is gradually reduced to an intolerable degree due to the gradual exhaustion of resources or the gradual accumulation of operation errors during the operation of software. A system performance exception is typically a system state behavior (e.g., CPU load, memory usage, etc.) that does not maintain the existing application program work. Most of the abnormal prediction models are only models based on regression technology, and the regression technology has specific limitations, so that the models have respective defects, or are only suitable for specific data, or have large prediction errors and the like. On the basis of the existing abnormal prediction model based on classification, identification needs to be manually allocated to historical data, the degree of automation is not high, and the characteristics of variables cannot be comprehensively considered only from the perspective of the values of the variables, so that a prediction result has certain errors.
Disclosure of Invention
The invention aims to provide a performance abnormity prediction method and a performance abnormity prediction system in a distributed system, and solves the problems that the automation degree of the performance prediction of a distributed environment is not high, and the characteristics of variables cannot be considered comprehensively only from the angle of variable values.
In order to solve the above problems, the present invention relates to a performance anomaly prediction method in a distributed system, comprising the following steps:
s1: extracting a target data value from historical performance data obtained by a plurality of monitoring nodes in a monitoring system to serve as a training data source, and calculating characteristic values of historical data patterns in the data source;
s2: respectively obtaining prior probability distribution of each historical data mode in various states according to the characteristic value of each historical data mode, and counting the probability distribution of each state, thereby training Bayesian models of the states of each data mode;
s3: calculating a characteristic value of a current data mode according to real-time performance data acquired by a monitoring system;
s4: finding a data pattern most similar to a current data pattern from the historical data patterns;
s5: predicting through a Bayesian model trained in S2 according to the output result of S4 to respectively obtain probability distributions of the multiple states;
s6: a confidence factor and an abnormality threshold are set based on the result in S5, and an abnormal state is predicted if the confidence factor exceeds the abnormality threshold.
Preferably, the characteristic values include a performance value change amount, a performance value change rate, and a performance value.
Preferably, in S2, the variance of each eigenvalue of all historical data patterns is arranged according to the value size, and divided into a plurality of subspaces, and the prior probability of the specific state of the variance of each eigenvalue corresponding to each subspace is calculated.
Preferably, in S2, a bayesian model of each historical data pattern is trained according to the feature values of each historical data pattern, and prior probabilities of multiple states of each pattern are obtained respectively.
Preferably, S4 further includes:
calculating the standard deviation of the characteristic values between the current data mode and each historical normal mode;
and obtaining the historical data pattern with the minimum sum of all standard deviations of the current data pattern as the most similar pattern of the current data pattern.
Preferably, the states are an abnormal state, a warning state and a normal state.
Preferably, S6 further includes setting an alarm threshold, and predicting to be an alarm state if the confidence factor is between the alarm threshold and the abnormal threshold, and predicting to be a normal state if the confidence factor is smaller than the alarm threshold.
In order to solve the above problem, the present invention further relates to a performance anomaly prediction system in a distributed system, connected to a monitoring system of the distributed system, including:
the historical characteristic value calculation module extracts a target data value from historical performance data obtained by a plurality of monitoring nodes in the monitoring system to serve as a training data source and calculates characteristic values of historical data patterns in the data source;
the prior probability module is connected with the output end of the historical characteristic value calculation module, respectively obtains prior probability distribution of each historical data mode in various states according to the characteristic value of each historical data mode, and counts the probability distribution of each state, thereby training the Bayesian model of each data mode;
the real-time characteristic value calculating module is used for calculating the characteristic value of the current data mode according to the real-time performance data acquired by a plurality of monitoring nodes in the monitoring system;
the similar mode module is connected with the output end of the historical characteristic value calculating module and the output end of the real-time characteristic calculating module, and finds a data mode which is most similar to the current data mode from the historical data modes;
the probability calculation module predicts through a Bayes model trained in the prior probability module according to the output result of the similar mode module and respectively obtains the probability distribution of the multiple states; and
and the abnormal alarm module is used for setting a confidence factor and an abnormal threshold according to the result in the probability calculation module, and predicting an abnormal state if the confidence factor exceeds the abnormal threshold.
Preferably, the characteristic values include a performance value change amount, a performance value change rate, and a performance value.
Preferably, the states include an abnormal state, a warning state and a normal state.
Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects:
1) according to the invention, aiming at performance abnormity prediction in a distributed system, the problem of variable characteristics is comprehensively considered by analyzing the performance of distributed nodes through a special value and a divided data mode, and the accuracy is higher;
2) according to the invention, the Bayes model of the machine learning method is adopted to guide prediction, the performance abnormal condition is detected in real time, and the detected prediction is evaluated and analyzed through the Bayes model obtained before, so that the prediction confidence is provided, the automation degree is high, and the reliability and the practicability of the prediction are improved;
3) the invention converts the standard variance of the characteristic values of each historical data mode into a plurality of subspaces, trains the subspaces as the parameters of the Bayesian model, calculates the prior probability of the specific state corresponding to each subspace, and further improves the accuracy of the abnormal prediction.
Drawings
FIG. 1 is a flow chart of a performance anomaly prediction method in a distributed system in accordance with the present invention;
fig. 2 is a system block diagram of a performance anomaly prediction system in a distributed system according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the present invention, and it is obvious that what is described herein is only a part of the embodiments of the present invention, and not all of the embodiments, and all other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts belong to the protection scope of the present invention.
For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking specific embodiments as examples with reference to the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.
Example one
Referring to fig. 1, the present invention provides a performance anomaly prediction method in a distributed system, which mainly includes the following steps:
s1: extracting a target data value from historical performance data obtained by a plurality of monitoring nodes in a monitoring system to serve as a training data source, and calculating characteristic values of historical data patterns in the data source;
in this embodiment, a data point is described by using three aspect feature values, including Change Value (CV), Change Rate (CR), and performance Value (Value, V). The value of the property being a time t1The value of the performance metric of (a).
The variation of the performance value being a time t1With another time t2Difference in performance metric of (a):
CV ( t i ) = V t i - V t i - 1
wherein,time tiI-0, 1, …, n;
time ti-1I-1, …, n.
The rate of change of a performance value is the rate of change of a performance metric, equal to the amount of change of the performance value divided by the current timet1Performance value of (2):
wherein,time tiI-0, 1, …, n;
time ti-1I-1, …, n.
S2: respectively obtaining prior probability distribution of each historical data mode in various states according to the characteristic value of each historical data mode, and counting the probability distribution of each state, thereby training Bayesian models of the states of each data mode;
according to the data feature result of S1, dividing the historical data into a plurality of modes, marking the modes in three states, namely an abnormal state, a warning state and a normal state, then training out prior probability distribution through the three states, counting out the probability distribution of each mode in each state, training out Bayes models of each mode, converting the features of the modes into a plurality of subspaces in order to further improve the model accuracy, and training the subspaces as the parameters of the Bayes models.
In S2, a bayesian model of each historical data pattern may be trained according to the feature values of each historical data pattern, and prior probabilities of multiple states of each pattern may be obtained. The plurality of states may be an abnormal state, a warning state, and a normal state.
A classification model is built using a naive bayes classifier. The use limitation of the naive Bayes classification is that the parameters are independent from each other, and the formality of the obtained pattern is three parameters which are independent from each other, thereby meeting the requirements of the naive Bayes classification.
Assume that the current time is tiThen from ti-LTo tiThe characteristic values related to all data within the time period of (a) constitute the current data pattern, where L is the length of the current data pattern.
During training, each pattern in the training data is tagged to indicate the state of the pattern, i.e., a pattern may be represented as (Vt1, Vt2, …, Vtn, Status). Using the training dataset containing the labels, a prior probability distribution (prior distribution) of all modes of the three states can be obtained:
P((SDCV,SDCR,SDV)|status)
where status is normal state normal, alarm state alert or abnormal state abnormal.
The three standard deviations of the most similar patterns are respectively SDCV,SDCR,SDVThe state corresponding to this mode is the probability of status. According to the training data, the distribution situation p (status) of each state can also be obtained.
According to the prior probability, the probability of a specific state can be calculated under the condition that the variance value is obtained, and the probability is obtained by Bayesian classification:
P ( status | ( SD CV , SD CR , SD V ) ) = P ( ( SD CV , SD CR , SD V ) | status ) P ( status ) P ( ( SD cv , SD CR , SD V ) )
as mentioned above, the three parameters are independent of each other and can therefore be expressed as:
P ( status | ( SD CV , SD CR , SD V ) ) = P ( SD CV | status ) P ( SD CR | status ) P ( SD V | status ) P ( status ) P ( SD CV ) P ( SD CR ) P ( SD V )
in order to further improve the model correctness, the various eigenvalue variances of all historical data modes can be arranged according to the value size and divided into a plurality of subspaces, and the prior probability of the specific state of the eigenvalue variance corresponding to each subspace is calculated, wherein the specific state can be an abnormal state, a warning state or a normal state.
The pattern space is divided into a plurality of subspaces, each subspace comprises all specific characteristic values existing in a continuous value range, so that a plurality of discrete subspaces are obtained, and the subspaces are used as parameters of naive Bayes classification. For example, the variance SD of the rate of change of the performance valuesCRAll value ranges of (a) are r ═ a, b]Wherein a is the minimum value taken by the variance of the rate of change of the performance values, and b isThe maximum value taken by the variance of the rate of change of the performance values. Dividing the space into m subspaces, the length of each subspace is:
<math> <mrow> <mi>&Delta;r</mi> <mo>=</mo> <mfrac> <mrow> <mi>b</mi> <mo>-</mo> <mi>a</mi> </mrow> <mi>m</mi> </mfrac> </mrow> </math>
therefore, each subspace can be represented as:
SSDCR1=[a,a+Δr],SSDCR2=[a+Δr,a+2*Δr],...,SSDCR1=[b-Δr,b]
for each performance value rate of change variance, it is simply put into the appropriate subspace. Therefore, the prior probability of the specific state corresponding to each variance does not need to be calculated, and only the prior probability of the specific state corresponding to each subspace needs to be calculated:
P ( status | ( SD CV , SD CR , SD V ) ) = P ( S SDCVi | status ) P ( S SDCVj | status ) P ( S SDCVk | status ) P ( S SDCVi ) P ( S SDCRj ) P ( S SDVk )
wherein S isSDCViVariance SD of Performance valueCVA certain corresponding subspace;
SSDCRjvariance SD of the rate of change of the value of the propertyCRA certain corresponding subspace;
SSDVk-variance of Performance value SDVA certain corresponding subspace;
status-a particular state, normal, alert, or abnormal.
S3: and calculating the characteristic value of the current data mode according to the real-time performance data acquired by the monitoring system. Assume that the current time is tiThen from ti-L to tiWherein L is the length of the current data pattern.
S4: finding a data pattern which is most similar to the current data pattern from the historical data patterns;
the method specifically comprises the following steps:
s41: calculating the standard deviation of the characteristic values between the current data mode and each historical normal mode;
each time tiThe data of (c) all have three characteristics, namely (CV (t)i),CR(ti) V (ti)). Assume that the current time is tiThen from ti-LTo tiAll data-related features within the time period of (a) constitute a pattern of the current performance metric, where L is the length of the current data pattern.
As shown in fig. 2, the current pattern is compared with the historical normal patterns, and a pattern most similar to the current data pattern is found among the historical normal patterns. Standard deviations (Standard deviations) of the respective features between the current data pattern and the respective historical normal patterns are calculated. If a historical data pattern is from time tjBeginning of L to tjEnding, and recording the standard deviation of the performance value variation between the current data mode and the historical data mode as SDCV(tj) The standard deviation of the change rate of the performance value between the current data pattern and the historical data pattern is recorded as SDCR(tj) The standard deviation SD between the current data pattern and the historical data patternV(tj). The current data pattern is compared with the previous historical data pattern one by one,
s42: if the sum of all standard deviations of the current data mode and a historical data mode is minimum, the historical data mode is set as the most similar mode of the current data mode.
When a pattern in the historical data satisfies the following formula:
SD CV ( t k ) + SD CR ( t k ) + SD V ( t k ) = min j { SD CV ( t j ) + SD CR ( t j ) + SD V ( t j ) }
wherein, { SDCV(tj)+SDCR(tj)+SDV(tj) } -a set of standard deviations of features between the current data pattern and all historical data patterns;
min — the minimum in the set.
I.e., the sum of all standard deviations that satisfy the current data pattern and this historical data pattern is minimal, then the historical data pattern is said to be the most similar pattern of the current data pattern. Thus, for each current data pattern, the most similar pattern in the history can be found:
(SDCV(tk),SDCR(tk),SDV(tk))。
s5: predicting through a Bayes model trained in the S2 according to the output result of the S4, and respectively obtaining probability distribution of multiple states;
most similar mode (SD) according to S4 in this exampleCV(tk),SDCR(tk),SDV(tk) Guided prediction from the bayesian model trained in S2, the probability situation of the state of the pattern is obtained:
P ( status | ( SD CV , SD CR , SD V ) ) = P ( ( SD CV , SD CR , SD V ) | status ) P ( status ) P ( ( SD cv , SD CR , SD V ) )
the mode state is determined by obtaining the mode probability, and the mode state is accurately judged, so that the precursor of the abnormal occurrence can be captured, and the abnormal prediction is realized.
S6: a confidence factor and an abnormality threshold are set based on the result in S5, and an abnormal state is predicted if the confidence factor exceeds the abnormality threshold.
The method also comprises the step of setting an alarm threshold value, if the confidence factor is between the alarm threshold value and the abnormal threshold value, the alarm state is predicted, and if the confidence factor is smaller than the alarm threshold value, the normal state is predicted. An alarm mechanism is also needed to be set, and defense treatment measures after alarm are taken through a preset alarm mechanism.
In the present embodiment, for the current mode (SD)CV,SDCR,SDV) According to the above method, the probabilities corresponding to three states are obtained:
P(normal|(SDcv,SDCR,SDV))
P(alert|(SDcv,SDCR,SDV))
P(abnormal|(SDcv,SDCR,SDV))
to determine which state the pattern is in, the probabilities of the three states are compared accordingly:
δ1=logP(alert|(SDCV,SDCR,SDV))-logP(normal|(SDCV,SDCR,SDV))
δ2=logP(alert|(SDCV,SDCR,SDV))-logP(abnormal|(SDCV,SDCR,SDV))
if the following conditions are met, the current data mode is judged to be in an alarm state, and then abnormity may occur:
δ1is not less than 0 and delta2≥0
δ1Indicating which of the greater the likelihood of the current data pattern being in an alarm state and in a normal state, δ2Indicating that the probability of the current data pattern being in an alarm state and in an abnormal state is greater. If the formula (3-10) is satisfied, it indicates that the current data pattern is more likely to be in an alarm state than in a normal or abnormal state, and it can be determined that an abnormality is likely to occur next.
When an alarm predicting an abnormality is issued, if δ1Is not less than 0 and is delta1The larger the value, the more likely it is that the mode is in an alarm state than in a normal state. Likewise, if δ2Is not less than 0 and is delta2The larger the value, the more likely it is that the mode is in an alarm state than in an abnormal state. It can be said that | δ1| and | δ2The larger the value of | is, the higher the reliability of the prediction result is, so | δ may be set1| and | δ2And | is used as a reference index of the credibility of the abnormal prediction. Each anomaly prediction made is assigned a Confidence Factor (CF) which is calculated as follows:
CF=δ12
clearly, the greater the likelihood that the pattern is alert state, the greater the CF value, and thus this is a way to effectively measure the confidence of the anomaly prediction. According to the CF value, the degree of reliability of prediction can be known, an alarm threshold value is determined according to the degree of reliability, if the confidence factor is between the alarm threshold value and the abnormal threshold value, an alarm state is predicted, if the confidence factor is smaller than the alarm threshold value, a normal state is predicted, an alarm mechanism needs to be set, and defense treatment measures are taken in the alarm state and the abnormal state through the preset alarm mechanism to prevent the abnormal occurrence or reduce the loss caused by the abnormal occurrence.
Example two
Referring to fig. 2, the present invention provides a performance anomaly prediction system in a distributed system, which is connected to a monitoring system of the distributed system, and mainly includes: the device comprises a historical characteristic value calculating module, a prior probability module, a real-time characteristic value calculating module, a similar mode module, a probability calculating module and an abnormal alarm module.
The historical characteristic value calculation module extracts a target data value from historical performance data obtained by a plurality of monitoring nodes in the monitoring system to serve as a training data source and calculates characteristic values of historical data patterns in the data source;
in this embodiment, a data point is described by using three aspect feature values, including Change Value (CV), Change Rate (CR), and performance Value (Value, V). The value of the property being a time t1The value of the performance metric of (a).
The variation of the performance value being a time t1With another time t2Difference in performance metric of (a):
CV ( t i ) = V t i - V t i - 1
wherein,time tiI-0, 1, …, n;
time ti-1I-1, …, n.
The rate of change of a performance value is the rate of change of a performance metric, equal to the amount of change of the performance value divided by the current timet1Performance value of (2):
wherein,time tiI-0, 1, …, n;
time ti-1I-1, …, n.
The prior probability module is connected with the output end of the historical characteristic value calculation module, respectively obtains prior probability distribution of each historical data mode in various states according to the characteristic value of each historical data mode, and counts the probability distribution of each state, thereby training the Bayesian model of each data mode;
according to the data feature result output by the historical feature value calculation module, historical data are divided into a plurality of modes, the modes are marked in three states, namely an abnormal state, a warning state and a normal state, then prior probability distribution is trained through the three states, probability distribution of each mode in each state is counted, Bayes models of various modes are trained, in order to further improve the model accuracy, the features of the modes are converted into a plurality of subspaces, and the subspaces are used as parameters of the Bayes models for training.
In the prior probability module, a Bayesian model of each historical data mode can be trained according to the characteristic value of each historical data mode, and prior probabilities of various states of each mode are obtained respectively. The plurality of states may be an abnormal state, a warning state, and a normal state.
A classification model is built using a naive bayes classifier. The use limitation of the naive Bayes classification is that the parameters are independent from each other, and the formality of the obtained pattern is three parameters which are independent from each other, thereby meeting the requirements of the naive Bayes classification.
Assume that the current time is tiThen from ti-LTo tiThe characteristic values related to all data within the time period of (a) constitute the current data pattern, where L is the length of the current data pattern.
During training, each pattern in the training data is tagged to indicate the state of the pattern, i.e., a pattern may be represented as (Vt1, Vt2, …, Vtn, Status). Using the training dataset containing the labels, a prior probability distribution (prior distribution) of all modes of the three states can be obtained:
P((SDCV,SDCR,SDV)|status)
where status is normal state normal, alarm state alert or abnormal state abnormal.
The three standard deviations of the most similar patterns are respectively SDCV,SDCR,SDVThe state corresponding to this mode is the probability of status. According to the training data, the distribution situation p (status) of each state can also be obtained.
According to the prior probability, the probability of a specific state can be calculated under the condition that the variance value is obtained, and the probability is obtained by Bayesian classification:
P ( status | ( SD CV , SD CR , SD V ) ) = P ( ( SD CV , SD CR , SD V ) | status ) P ( status ) P ( ( SD cv , SD CR , SD V ) )
as mentioned above, the three parameters are independent of each other and can therefore be expressed as:
P ( status | ( SD CV , SD CR , SD V ) ) = P ( SD CV | status ) P ( SD CR | status ) P ( SD V | status ) P ( status ) P ( SD CV ) P ( SD CR ) P ( SD V )
in order to further improve the model correctness, the various eigenvalue variances of all historical data modes can be arranged according to the value size and divided into a plurality of subspaces, and the prior probability of the specific state of the eigenvalue variance corresponding to each subspace is calculated, wherein the specific state can be an abnormal state, a warning state or a normal state.
The pattern space is divided into a plurality of subspaces, each subspace comprises all specific characteristic values existing in a continuous value range, so that a plurality of discrete subspaces are obtained, and the subspaces are used as parameters of naive Bayes classification. For example, the variance SD of the rate of change of the performance valuesCRAll value ranges of (a) are r ═ a, b]Where a is the minimum value taken by the variance of the rate of change of the performance value and b is the maximum value taken by the variance of the rate of change of the performance value.Dividing the space into m subspaces, the length of each subspace is:
<math> <mrow> <mi>&Delta;r</mi> <mo>=</mo> <mfrac> <mrow> <mi>b</mi> <mo>-</mo> <mi>a</mi> </mrow> <mi>m</mi> </mfrac> </mrow> </math>
therefore, each subspace can be represented as:
SSDCR1=[a,a+Δr],SSDCR2=[a+Δr,a+2*Δr],...,SSDCR1=[b-Δr,b]
for each performance value rate of change variance, it is simply put into the appropriate subspace. Therefore, the prior probability of the specific state corresponding to each variance does not need to be calculated, and only the prior probability of the specific state corresponding to each subspace needs to be calculated:
P ( status | ( SD CV , SD CR , SD V ) ) = P ( S SDCVi | status ) P ( S SDCVj | status ) P ( S SDCVk | status ) P ( S SDCVi ) P ( S SDCRj ) P ( S SDVk )
wherein S isSDCViVariance SD of Performance valueCVA certain corresponding subspace;
SSDCRjvariance SD of the rate of change of the value of the propertyCRA certain corresponding subspace;
SSDVk-variance of Performance value SDVA certain corresponding subspace;
status-a specific state, normal state normal, alarm state alert or abnormal state abnormal.
And the real-time characteristic value calculating module is used for calculating the characteristic value of the current data mode according to the real-time performance data acquired by a plurality of monitoring nodes in the monitoring system. Assume that the current time is tiThen from ti-L to tiWherein L is the length of the current data pattern.
The similar mode module is connected with the real-time characteristic value calculating module and the historical characteristic value calculating module and finds a data mode which is most similar to the current data mode from the historical data modes;
the method specifically comprises the following steps:
the historical data comparison module is connected with the real-time characteristic value calculation module and the historical characteristic value calculation module and used for calculating the standard deviation of the characteristic values between the current data mode and each historical normal mode;
each time tiThe data of (c) all have three characteristics, namely (CV (t)i),CR(ti) V (ti)). Assume that the current time is tiThen from ti-LTo tiAll data-related features within the time period of (a) constitute a pattern of the current performance metric, where L is the length of the current data pattern.
As shown in fig. 2, the current pattern is compared with the historical normal patterns, and a pattern most similar to the current data pattern is found among the historical normal patterns. Standard deviations (Standard deviations) of the respective features between the current data pattern and the respective historical normal patterns are calculated. If a historical data pattern is from time tjBeginning of L to tjEnding, and recording the standard deviation of the performance value variation between the current data mode and the historical data mode as SDCV(tj) The standard deviation of the change rate of the performance value between the current data pattern and the historical data pattern is recorded as SDCR(tj) Current data mode and the historical data modeThe standard deviation SD between the formulaeV(tj). The current data pattern is compared with the previous historical data pattern one by one,
and the minimum variance acquisition module is connected with the output end of the historical data comparison module, and if the sum of all standard variances of the current data mode and a historical data mode is minimum, the historical data mode is set as the most similar mode of the current data mode.
When a pattern in the historical data satisfies the following formula:
SD CV ( t k ) + SD CR ( t k ) + SD V ( t k ) = min j { SD CV ( t j ) + SD CR ( t j ) + SD V ( t j ) }
wherein, { SDCV(tj)+SDCR(tj)+SDV(tj) } -a set of standard deviations of features between the current data pattern and all historical data patterns; min — the minimum in the set.
I.e., the sum of all standard deviations that satisfy the current data pattern and this historical data pattern is minimal, then the historical data pattern is said to be the most similar pattern of the current data pattern. Thus, for each current data pattern, the most similar pattern in the history can be found:
(SDCV(tk),SDCR(tk),SDV(tk)。
the probability calculation module predicts through a Bayes model trained in the prior probability module according to the output result of the similar mode module and respectively obtains probability distribution of various states;
in this embodiment, according to the most similar mode of the minimum variance obtaining module:
(SDCV(tk),SDCR(tk),SDV(tk) And guiding prediction by a Bayes model trained in a prior probability module to obtain probability conditions of each state of the model:
P ( status | ( SD CV , SD CR , SD V ) ) = P ( ( SD CV , SD CR , SD V ) | status ) P ( status ) P ( ( SD cv , SD CR , SD V ) )
the mode state is determined by obtaining the mode probability, and the mode state is accurately judged, so that the precursor of the abnormal occurrence can be captured, and the abnormal prediction is realized.
And the abnormality alarm module is used for setting a confidence factor and an abnormality threshold according to the output result in the probability calculation module, and predicting an abnormal state if the confidence factor exceeds the abnormality threshold.
Generally, the method also comprises the step of setting an alarm threshold value, if the confidence factor is between the alarm threshold value and the abnormal threshold value, the alarm state is predicted, and if the confidence factor is smaller than the alarm threshold value, the normal state is predicted. An alarm mechanism is also needed to be set, and defense treatment measures after alarm are taken through a preset alarm mechanism.
In the present embodiment, for the current mode (SD)CV,SDCR,SDV) The above method yields a summary of the three statesRate:
P(normal|(SDcv,SDCR,SDV))
P(alert|(SDcv,SDCR,SDV))
P(abnormal|(SDcv,SDCR,SDV))
to determine which state the pattern is in, the probabilities of the three states are compared accordingly:
δ1=logP(alert|(SDCV,SDCR,SDV))-logP(normal|(SDCV,SDCR,SDV))
δ2=logP(alert|(SDCV,SDCR,SDV))-logP(abnormal|(SDCV,SDCR,SDV))
if the following conditions are met, the current data mode is judged to be in an alarm state, and then abnormity may occur:
δ1is not less than 0 and delta2≥0
δ1Indicating which of the greater the likelihood of the current data pattern being in an alarm state and in a normal state, δ2Indicating that the probability of the current data pattern being in an alarm state and in an abnormal state is greater. If the formula (3-10) is satisfied, it indicates that the current data pattern is more likely to be in an alarm state than in a normal or abnormal state, and it can be determined that an abnormality is likely to occur next.
When an alarm predicting an abnormality is issued, if δ1Is not less than 0 and is delta1The larger the value, the more likely it is that the mode is alert state than normal state. Likewise, if δ2Is not less than 0 and is delta2The larger the value, the more likely it is that the mode is in an alarm state than in an abnormal state. It can be said that | δ1| and | δ2The larger the value of | is, the higher the reliability of the prediction result is, so | δ may be set1| and | δ2And | is used as a reference index of the credibility of the abnormal prediction. Each anomaly prediction made is assigned a Confidence Factor (CF) which is calculated as follows:
CF=δ12
clearly, the greater the likelihood that the pattern is alert state, the greater the CF value, and thus this is a way to effectively measure the confidence of the anomaly prediction. According to the CF value, the degree of reliability of prediction can be known, an alarm threshold value is determined according to the degree of reliability, if the confidence factor is between the alarm threshold value and the abnormal threshold value, an alarm state is predicted, if the confidence factor is smaller than the alarm threshold value, a normal state is predicted, an alarm mechanism needs to be set, and defense treatment measures are taken in the alarm state and the abnormal state through the preset alarm mechanism to prevent the abnormal occurrence or reduce the loss caused by the abnormal occurrence.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A performance anomaly prediction method in a distributed system is characterized by comprising the following steps:
s1: extracting a target data value from historical performance data obtained by a plurality of monitoring nodes in a monitoring system to serve as a training data source, and calculating characteristic values of historical data patterns in the data source;
s2: respectively obtaining prior probability distribution of each historical data mode in various states according to the characteristic value of each historical data mode, and counting the probability distribution of each state, thereby training Bayesian models of the states of each data mode;
s3: calculating a characteristic value of a current data mode according to real-time performance data acquired by a monitoring system;
s4: finding a data pattern most similar to a current data pattern from the historical data patterns;
s5: predicting through a Bayes model trained in the S2 according to the output result of the S4, and respectively obtaining probability distribution of multiple states;
s6: a confidence factor and an abnormality threshold are set based on the result in S5, and an abnormal state is predicted if the confidence factor exceeds the abnormality threshold.
2. The method of claim 1, wherein the performance anomaly prediction method comprises,
the characteristic values include a performance value change amount, a performance value change rate, and a performance value.
3. The method of claim 1, wherein in S2, the variance of each eigenvalue of all historical data patterns is arranged according to the value size and divided into a plurality of subspaces, and the prior probability of a specific state of the variance of each eigenvalue corresponding to each subspace is calculated.
4. The method of claim 3, wherein in step S2, a Bayesian model of each historical data pattern is trained according to the feature values of each historical data pattern, and the prior probabilities of the multiple states of each pattern are obtained.
5. The method of claim 3, wherein the step S4 further comprises:
calculating the standard deviation of the characteristic values between the current data mode and each historical normal mode;
and obtaining the historical data pattern with the minimum sum of all standard deviations of the current data pattern as the most similar pattern of the current data pattern.
6. The method of claim 3, wherein the states are an abnormal state, a warning state and a normal state.
7. The method of claim 3, wherein the step S6 further comprises setting an alarm threshold, and predicting an alarm state if the confidence factor is between the alarm threshold and the abnormal threshold, and predicting a normal state if the confidence factor is less than the alarm threshold.
8. A performance anomaly prediction system in a distributed system, coupled to a monitoring system of the distributed system, comprising:
the historical characteristic value calculation module extracts a target data value from historical performance data obtained by a plurality of monitoring nodes in the monitoring system to serve as a training data source and calculates characteristic values of historical data patterns in the data source;
the prior probability module is connected with the output end of the historical characteristic value calculation module, respectively obtains prior probability distribution of each historical data mode in various states according to the characteristic value of each historical data mode, and counts the probability distribution of each state, thereby training the Bayesian model of each data mode;
the real-time characteristic value calculating module is used for calculating the characteristic value of the current data mode according to the real-time performance data acquired by a plurality of monitoring nodes in the monitoring system;
the similar mode module is connected with the output end of the historical characteristic value calculation module and the real-time characteristic calculation module, and finds a data mode which is most similar to the current data mode from the historical data modes;
the probability calculation module predicts through a Bayes model trained in the prior probability module according to the output result of the similar mode module and respectively obtains the probability distribution of the multiple states; and
and the abnormal alarm module is used for setting a confidence factor and an abnormal threshold according to the result in the probability calculation module, and predicting an abnormal state if the confidence factor exceeds the abnormal threshold.
9. The system of claim 8, wherein the characteristic values include performance value variation, performance value variation rate, and performance value.
10. The system of claim 8, wherein the states include an abnormal state, an alert state, and a normal state.
CN201410294472.2A 2014-06-26 2014-06-26 Performance abnormality prediction method in distributed system and system Pending CN104063747A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410294472.2A CN104063747A (en) 2014-06-26 2014-06-26 Performance abnormality prediction method in distributed system and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410294472.2A CN104063747A (en) 2014-06-26 2014-06-26 Performance abnormality prediction method in distributed system and system

Publications (1)

Publication Number Publication Date
CN104063747A true CN104063747A (en) 2014-09-24

Family

ID=51551447

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410294472.2A Pending CN104063747A (en) 2014-06-26 2014-06-26 Performance abnormality prediction method in distributed system and system

Country Status (1)

Country Link
CN (1) CN104063747A (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105629947A (en) * 2015-11-30 2016-06-01 东莞酷派软件技术有限公司 Household equipment monitoring method, household equipment monitoring device and terminal
CN105871879A (en) * 2016-05-06 2016-08-17 中国联合网络通信集团有限公司 Automatic network element abnormal behavior detection method and device
CN106095639A (en) * 2016-05-30 2016-11-09 中国农业银行股份有限公司 A kind of cluster subhealth state method for early warning and system
CN106125643A (en) * 2016-06-22 2016-11-16 华东师范大学 A kind of industry control safety protection method based on machine learning techniques
CN106293976A (en) * 2016-08-15 2017-01-04 东软集团股份有限公司 Application performance Risk Forecast Method, device and system
CN106897113A (en) * 2017-02-23 2017-06-27 郑州云海信息技术有限公司 The method and device of a kind of virtualized host operation conditions prediction
WO2017124953A1 (en) * 2016-01-21 2017-07-27 阿里巴巴集团控股有限公司 Method for processing machine abnormality, method for adjusting learning rate, and device
CN107844406A (en) * 2017-10-25 2018-03-27 千寻位置网络有限公司 Method for detecting abnormality and system, service terminal, the memory of distributed system
CN107943809A (en) * 2016-10-13 2018-04-20 阿里巴巴集团控股有限公司 Data quality monitoring method, device and big data calculating platform
CN108039971A (en) * 2017-12-18 2018-05-15 北京搜狐新媒体信息技术有限公司 A kind of alarm method and device
CN108089962A (en) * 2017-11-13 2018-05-29 北京奇艺世纪科技有限公司 A kind of method for detecting abnormality, device and electronic equipment
CN108153591A (en) * 2017-12-05 2018-06-12 深圳竹信科技有限公司 Data flow real-time processing method, device and storage medium
CN108288161A (en) * 2017-01-10 2018-07-17 第四范式(北京)技术有限公司 The method and system of prediction result are provided based on machine learning
CN108875815A (en) * 2018-06-04 2018-11-23 深圳市研信小额贷款有限公司 Feature Engineering variable determines method and device
CN109297582A (en) * 2017-07-25 2019-02-01 台达电子电源(东莞)有限公司 The detection device and detection method of fan abnormal sound
CN109471783A (en) * 2017-09-08 2019-03-15 北京京东尚科信息技术有限公司 The method and apparatus for predicting task run parameter
CN109728923A (en) * 2017-10-27 2019-05-07 中移(苏州)软件技术有限公司 A kind of cloud platform running state monitoring method for early warning and device
CN109921955A (en) * 2017-12-12 2019-06-21 北京嘀嘀无限科技发展有限公司 Portfolio monitoring method, system, computer equipment and storage medium
CN110008979A (en) * 2018-12-13 2019-07-12 阿里巴巴集团控股有限公司 Abnormal data prediction technique, device, electronic equipment and computer storage medium
CN110209560A (en) * 2019-05-09 2019-09-06 北京百度网讯科技有限公司 Data exception detection method and detection device
WO2020078385A1 (en) * 2018-10-18 2020-04-23 杭州海康威视数字技术股份有限公司 Data collecting method and apparatus, and storage medium and system
CN111159237A (en) * 2019-12-25 2020-05-15 中国平安财产保险股份有限公司 System data distribution method and device, storage medium and electronic equipment
CN112650660A (en) * 2020-12-28 2021-04-13 北京中大科慧科技发展有限公司 Early warning method and device for power system of data center
US11017340B2 (en) 2017-12-05 2021-05-25 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for cheat examination
WO2021146996A1 (en) * 2020-01-22 2021-07-29 京东方科技集团股份有限公司 Training method for device metrics goodness level prediction model, and monitoring system and method
CN113673916A (en) * 2021-10-25 2021-11-19 深圳市明源云科技有限公司 Risk data identification method, terminal device and computer-readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324155A (en) * 2012-03-19 2013-09-25 通用电气航空系统有限公司 System monitoring
KR20140056952A (en) * 2012-11-02 2014-05-12 주식회사 세이프티아 Method and system for evaluating abnormality detection

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324155A (en) * 2012-03-19 2013-09-25 通用电气航空系统有限公司 System monitoring
KR20140056952A (en) * 2012-11-02 2014-05-12 주식회사 세이프티아 Method and system for evaluating abnormality detection

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
仇沂: "分布式环境中的性能异常预测监控", 《中国优秀硕士学位论文全书数据库 信息科技辑》 *

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105629947B (en) * 2015-11-30 2019-02-01 东莞酷派软件技术有限公司 Home equipment monitoring method, home equipment monitoring device and terminal
CN105629947A (en) * 2015-11-30 2016-06-01 东莞酷派软件技术有限公司 Household equipment monitoring method, household equipment monitoring device and terminal
US10748090B2 (en) 2016-01-21 2020-08-18 Alibaba Group Holding Limited Method and apparatus for machine-exception handling and learning rate adjustment
WO2017124953A1 (en) * 2016-01-21 2017-07-27 阿里巴巴集团控股有限公司 Method for processing machine abnormality, method for adjusting learning rate, and device
CN106991095A (en) * 2016-01-21 2017-07-28 阿里巴巴集团控股有限公司 Machine abnormal processing method, the method for adjustment of learning rate and device
CN105871879A (en) * 2016-05-06 2016-08-17 中国联合网络通信集团有限公司 Automatic network element abnormal behavior detection method and device
CN105871879B (en) * 2016-05-06 2019-03-05 中国联合网络通信集团有限公司 Network element abnormal behaviour automatic testing method and device
CN106095639A (en) * 2016-05-30 2016-11-09 中国农业银行股份有限公司 A kind of cluster subhealth state method for early warning and system
CN106125643A (en) * 2016-06-22 2016-11-16 华东师范大学 A kind of industry control safety protection method based on machine learning techniques
CN106293976A (en) * 2016-08-15 2017-01-04 东软集团股份有限公司 Application performance Risk Forecast Method, device and system
CN107943809A (en) * 2016-10-13 2018-04-20 阿里巴巴集团控股有限公司 Data quality monitoring method, device and big data calculating platform
CN107943809B (en) * 2016-10-13 2022-02-01 阿里巴巴集团控股有限公司 Data quality monitoring method and device and big data computing platform
CN108288161A (en) * 2017-01-10 2018-07-17 第四范式(北京)技术有限公司 The method and system of prediction result are provided based on machine learning
CN106897113A (en) * 2017-02-23 2017-06-27 郑州云海信息技术有限公司 The method and device of a kind of virtualized host operation conditions prediction
CN109297582A (en) * 2017-07-25 2019-02-01 台达电子电源(东莞)有限公司 The detection device and detection method of fan abnormal sound
CN109471783A (en) * 2017-09-08 2019-03-15 北京京东尚科信息技术有限公司 The method and apparatus for predicting task run parameter
CN107844406A (en) * 2017-10-25 2018-03-27 千寻位置网络有限公司 Method for detecting abnormality and system, service terminal, the memory of distributed system
CN109728923A (en) * 2017-10-27 2019-05-07 中移(苏州)软件技术有限公司 A kind of cloud platform running state monitoring method for early warning and device
CN109728923B (en) * 2017-10-27 2022-01-28 中移(苏州)软件技术有限公司 Cloud platform running state monitoring and early warning method and device
CN108089962A (en) * 2017-11-13 2018-05-29 北京奇艺世纪科技有限公司 A kind of method for detecting abnormality, device and electronic equipment
US11017340B2 (en) 2017-12-05 2021-05-25 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for cheat examination
CN108153591A (en) * 2017-12-05 2018-06-12 深圳竹信科技有限公司 Data flow real-time processing method, device and storage medium
CN109921955B (en) * 2017-12-12 2020-10-02 北京嘀嘀无限科技发展有限公司 Traffic monitoring method, system, computer device and storage medium
CN109921955A (en) * 2017-12-12 2019-06-21 北京嘀嘀无限科技发展有限公司 Portfolio monitoring method, system, computer equipment and storage medium
CN108039971A (en) * 2017-12-18 2018-05-15 北京搜狐新媒体信息技术有限公司 A kind of alarm method and device
CN108875815A (en) * 2018-06-04 2018-11-23 深圳市研信小额贷款有限公司 Feature Engineering variable determines method and device
WO2020078385A1 (en) * 2018-10-18 2020-04-23 杭州海康威视数字技术股份有限公司 Data collecting method and apparatus, and storage medium and system
CN110008979A (en) * 2018-12-13 2019-07-12 阿里巴巴集团控股有限公司 Abnormal data prediction technique, device, electronic equipment and computer storage medium
CN110209560A (en) * 2019-05-09 2019-09-06 北京百度网讯科技有限公司 Data exception detection method and detection device
CN110209560B (en) * 2019-05-09 2023-05-12 北京百度网讯科技有限公司 Data anomaly detection method and detection device
CN111159237A (en) * 2019-12-25 2020-05-15 中国平安财产保险股份有限公司 System data distribution method and device, storage medium and electronic equipment
CN111159237B (en) * 2019-12-25 2023-07-14 中国平安财产保险股份有限公司 System data distribution method and device, storage medium and electronic equipment
WO2021146996A1 (en) * 2020-01-22 2021-07-29 京东方科技集团股份有限公司 Training method for device metrics goodness level prediction model, and monitoring system and method
CN112650660A (en) * 2020-12-28 2021-04-13 北京中大科慧科技发展有限公司 Early warning method and device for power system of data center
CN112650660B (en) * 2020-12-28 2024-05-03 北京中大科慧科技发展有限公司 Early warning method and device for data center power system
CN113673916A (en) * 2021-10-25 2021-11-19 深圳市明源云科技有限公司 Risk data identification method, terminal device and computer-readable storage medium
CN113673916B (en) * 2021-10-25 2022-02-08 深圳市明源云科技有限公司 Risk data identification method, terminal device and computer-readable storage medium

Similar Documents

Publication Publication Date Title
CN104063747A (en) Performance abnormality prediction method in distributed system and system
AU2018203321B2 (en) Anomaly detection system and method
US9658916B2 (en) System analysis device, system analysis method and system analysis program
US10914608B2 (en) Data analytic engine towards the self-management of complex physical systems
JP2022523563A (en) Near real-time detection and classification of machine anomalies using machine learning and artificial intelligence
US10558544B2 (en) Multiple modeling paradigm for predictive analytics
US10719577B2 (en) System analyzing device, system analyzing method and storage medium
US11288577B2 (en) Deep long short term memory network for estimation of remaining useful life of the components
JP6815480B2 (en) Methods and systems for discovering precursor substrings in time series
CN105677538A (en) Method for adaptive monitoring of cloud computing system based on failure prediction
US20160217378A1 (en) Identifying anomalous behavior of a monitored entity
JPWO2017154844A1 (en) Analysis apparatus, analysis method, and analysis program
US20220318118A1 (en) Detecting changes in application behavior using anomaly corroboration
EP2646884A1 (en) Machine anomaly detection and diagnosis incorporating operational data
JP6183449B2 (en) System analysis apparatus and system analysis method
JP5827425B1 (en) Predictive diagnosis system and predictive diagnosis method
WO2014132612A1 (en) System analysis device and system analysis method
WO2017214613A1 (en) Streaming data decision-making using distributions with noise reduction
WO2017126585A1 (en) Information processing device, information processing method, and recording medium
WO2018211721A1 (en) Abnormal information estimation device, abnormal information estimation method, and program
CN110858072B (en) Method and device for determining running state of equipment
EP3716279A1 (en) Monitoring, predicting and alerting for census periods in medical inpatient units
CN115081673B (en) Abnormality prediction method and device for oil and gas pipeline, electronic equipment and medium
CN111382946B (en) Autonomous evaluation method and system for health state of equipment and industrial internet equipment
CN105912843B (en) Non-supervisory fault prediction method based on two visual angles and particle filter

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140924

RJ01 Rejection of invention patent application after publication