CN113886173A - Method, device and equipment for monitoring multi-node distributed cluster and readable medium - Google Patents

Method, device and equipment for monitoring multi-node distributed cluster and readable medium Download PDF

Info

Publication number
CN113886173A
CN113886173A CN202110999556.6A CN202110999556A CN113886173A CN 113886173 A CN113886173 A CN 113886173A CN 202110999556 A CN202110999556 A CN 202110999556A CN 113886173 A CN113886173 A CN 113886173A
Authority
CN
China
Prior art keywords
response time
inflixdb
database
cluster
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110999556.6A
Other languages
Chinese (zh)
Inventor
张书博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202110999556.6A priority Critical patent/CN113886173A/en
Publication of CN113886173A publication Critical patent/CN113886173A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides a method, a device, equipment and a readable medium for monitoring a multi-node distributed cluster, wherein the method comprises the following steps: starting a telegraff component on each node in the cluster, deploying a plurality of inflixdb databases in the cluster, establishing a mapping relation between the nodes and the inflixdb databases, and storing the mapping relation into the configuration of the telegraff component of each node; training according to the response time of the inflixdb database of the historical record to obtain an LSTM model; acquiring the response time of each influxdb database and predicting the corresponding future time based on an LSTM model; comparing the predicted value obtained by prediction with an alarm threshold value; and responding to the predicted value being larger than the alarm threshold value, and prompting that the influxdb database service needs to perform capacity expansion. By using the scheme of the invention, unnecessary waste of node resources can be reduced, the use safety is improved, and the integrity, the stability and the usability of the function of the artificial intelligence platform can be ensured.

Description

Method, device and equipment for monitoring multi-node distributed cluster and readable medium
Technical Field
The present invention relates to the field of computers, and more particularly, to a method, an apparatus, a device, and a readable medium for monitoring of a multi-node distributed cluster.
Background
Private cloud platforms (such as AIStation training platforms) will use technologies such as k8s to establish a plurality of physical nodes into a cluster, and for stability of the cluster and full utilization of cluster resources, the cluster will be generally established as a distributed high availability. In the process of carrying out deep learning model training inference by using a multi-node distributed high-availability cluster, monitoring the computing power of the cluster bottom layer resources, the node state and the like becomes particularly important, for example, monitoring whether the nodes in the cluster are healthy, whether the utilization rate of a GPU (graphics processing Unit) accelerator card, a video memory and the like are fully used, whether the temperature of the GPU accelerator card is normal and the like. When the indexes are in a normal range, the data can be used for debugging a model or generating a report, and when the indexes are abnormal, a platform alarm system can be triggered to avoid unnecessary loss. Therefore, the monitoring module is necessary in the private cloud platform.
In the past, the cluster monitoring of the platform can be implemented by using a component combination of TIGK (cloud environment monitoring solution, which is a combination of four components, namely Telegraf, infiux, Grafana, and Kapacitor), and correspondingly implementing resource monitoring steps of acquisition, storage, presentation, and alarm, wherein the acquisition and storage are the most critical two steps, that is, the acquisition and storage are implemented by using a Telegraf component (a plug-in driver server agent (installed in all hosts in the cloud)) for collecting and reporting indexes and infiluxdb (a time sequence database, which is constructed from the beginning to process high writing and query loads). Under the condition that the number of nodes in the cluster is small, only one influxdb is needed to be started to realize data storage service, all monitoring nodes start telegraf service to monitor bottom-layer resources, monitoring item data of the nodes are collected according to a configuration file and stored in the influxdb according to collection intervals, and finally, according to platform functions, the influxdb database is inquired by platform service to integrate the data and return to display. However, in a scenario where the number of nodes in the cluster is large, the above structure may cause situations of increased throughput, response timeout, and even breakdown.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method, an apparatus, a device, and a readable medium for monitoring a multi-node distributed cluster, which can ensure efficient stability and load balance of storage and read/write of monitoring data, facilitate a user to visually check dynamic information of cluster resources and sort reports, reduce unnecessary waste of node resources, improve usage safety, and ensure integrity, stability, and availability of functions of an artificial intelligence platform.
In view of the above, an aspect of the embodiments of the present invention provides a method for monitoring a multi-node distributed cluster, including the following steps:
starting a telegraff component on each node in the cluster, deploying a plurality of inflixdb databases in the cluster, establishing a mapping relation between the nodes and the inflixdb databases, and storing the mapping relation into the configuration of the telegraff component of each node;
training according to the response time of the inflixdb database of the historical record to obtain an LSTM model;
acquiring the response time of each influxdb database and predicting the future response time based on an LSTM model;
comparing the predicted value obtained by prediction with an alarm threshold value;
and responding to the predicted value being larger than the alarm threshold value, and prompting that the influxdb database service needs to perform capacity expansion.
According to an embodiment of the present invention, starting a telegraff component on each node in a cluster, deploying a plurality of inflixdb databases in the cluster, establishing a mapping relationship between the nodes and the inflixdb databases, and storing the mapping relationship into the configuration of the telegraff component of each node includes:
installing telegraf components on all nodes needing to be monitored in a cluster, and starting the telegraf components after configuring monitoring acquisition items, acquisition time intervals and custom acquisition scripts of the telegraf components;
creating containers for starting the inflixdb database service in the cluster according to the expected number, and configuring corresponding IP;
establishing a mapping relation between a monitoring node managed by the telgraf component and an inflixdb database according to a consistent hash algorithm, and adding the mapping relation into the configuration of the telgraf component;
and the telegraff component collects node information and stores the collected information into a corresponding influxdb database based on the mapping relation.
According to an embodiment of the present invention, training the LSTM model according to the historical influxdb database response time includes:
collecting the response time of an influxdb database in historical data;
and using the LSTM neural network to take the collected response time as a training set and train according to a unit alarm detection period to obtain a prediction model.
According to one embodiment of the invention, obtaining the response time of each inflixdb database and predicting the corresponding time in the future based on the LSTM model comprises:
predicting the response time of the next alarm detection period to obtain a predicted value by using a prediction model obtained by training according to the response time of the current alarm detection period;
substituting the predicted value and real response time data obtained when the next alarm detection period is reached into Kalman filtering to obtain an optimal predicted value;
and transmitting the optimal predicted value as an input value into an LSTM model, predicting the next alarm detection period, and obtaining more accurate response time prediction data through a plurality of iterations.
According to one embodiment of the invention, comparing the predicted value with the alarm threshold comprises:
respectively comparing the predicted data of the response time of each influxdb database with a first alarm threshold;
the difference between the maximum response time and the minimum response time of each inflixdb database is compared to a second alarm threshold.
In another aspect of the embodiments of the present invention, there is also provided an apparatus for monitoring a multi-node distributed cluster, including:
the creating module is configured to start a telegraff component on each node in the cluster, deploy a plurality of inflixdb databases in the cluster, establish a mapping relation between the nodes and the inflixdb databases, and store the mapping relation into the configuration of the telegraff component of each node;
the training module is configured to train according to the response time of the influxdb database of the historical record to obtain an LSTM model;
the prediction module is configured to acquire the response time of each inflixdb database and predict the corresponding future time based on the LSTM model;
the comparison module is configured to compare a predicted value obtained by prediction with an alarm threshold value;
and the alarm module is configured to respond to the fact that the predicted value is larger than the alarm threshold value, and prompt the influxdb database service to need to expand and contract the capacity.
According to one embodiment of the invention, the creation module is further configured to:
installing telegraf components on all nodes needing to be monitored in a cluster, and starting the telegraf components after configuring monitoring acquisition items, acquisition time intervals and custom acquisition scripts of the telegraf components;
creating containers for starting the inflixdb database service in the cluster according to the expected number, and configuring corresponding IP;
establishing a mapping relation between a monitoring node managed by the telgraf component and an inflixdb database according to a consistent hash algorithm, and adding the mapping relation into the configuration of the telgraf component;
and the telegraff component collects node information and stores the collected information into a corresponding influxdb database based on the mapping relation.
According to one embodiment of the invention, the training module is further configured to:
collecting the response time of an influxdb database in historical data;
and using the LSTM neural network to take the collected response time as a training set and train according to a unit alarm detection period to obtain a prediction model.
In another aspect of an embodiment of the present invention, there is also provided a computer apparatus including:
at least one processor; and
a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method set forth in the above claims.
In another aspect of the embodiments of the present invention, a computer-readable storage medium is also provided, in which a computer program is stored, and the computer program is executed by a processor to implement the steps of the method set forth in the above technical solution.
The invention has the following beneficial technical effects: in the method for monitoring a multi-node distributed cluster provided by the embodiment of the invention, a telgraff component is started on each node in the cluster, a plurality of inflixdb databases are deployed in the cluster, a mapping relation between the nodes and the inflixdb databases is established, and the mapping relation is stored in the configuration of the telgraff component of each node; training according to the response time of the inflixdb database of the historical record to obtain an LSTM model; acquiring the response time of each influxdb database and predicting the corresponding future time based on an LSTM model; comparing the predicted value obtained by prediction with an alarm threshold value; responding to the technical scheme that the predicted value is larger than the alarm threshold value, prompting the influxdb database service to expand and contract the capacity, ensuring the high efficiency and stability of the storage and reading and writing of the monitoring data and the load balance, facilitating the visual checking of the dynamic information of the cluster resources and the arrangement of the report forms by the user, reducing the unnecessary waste of the node resources, improving the use safety and ensuring the integrity, stability and usability of the artificial intelligent platform function.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.
FIG. 1 is a schematic flow chart diagram of a method of monitoring of a multi-node distributed cluster in accordance with one embodiment of the present invention;
FIG. 2 is a diagram illustrating a node and inflixdb database mapping relationship, according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an apparatus for monitoring of a multi-node distributed cluster according to one embodiment of the present invention;
FIG. 4 is a schematic diagram of a computer device according to one embodiment of the present invention;
fig. 5 is a schematic diagram of a computer-readable storage medium according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.
In view of the above, a first aspect of the embodiments of the present invention proposes an embodiment of a method for monitoring a multi-node distributed cluster. Fig. 1 shows a schematic flow diagram of the method.
As shown in fig. 1, the method may include the steps of:
s1, a telegraff component is started on each node in the cluster, a plurality of inflixdb databases are deployed in the cluster, a mapping relation between the nodes and the inflixdb databases is established, and the mapping relation is stored in the configuration of the telegraff component of each node.
Installing telgraff components on all nodes needing to be monitored in a cluster, starting telgraff service after configuring configuration items such as monitoring acquisition items, acquisition time intervals, custom acquisition scripts and the like, creating a container for starting inflixdb database service in the cluster according to an expected quantity, configuring corresponding IP, initializing a mapping relation between a monitoring node managed by the telgraff components and an inflixdb database for storing monitoring data according to a consistent hash algorithm, adding the mapping relation into the configuration of the telgraff components, for example, storing a node1 (node 1) and a node2 into the inflixdb 1, storing the node3 into the inflixdb 2, starting platform service, and inquiring the monitoring data in the inflixdb database corresponding to the platform service according to business logic.
S2, training according to the response time of the influxdb database of the historical record to obtain an LSTM model.
And acquiring the response time of an inflixdb database in the historical data, and using an LSTM neural network to train the acquired response time as a training set according to a unit alarm detection period to obtain an LSTM prediction model.
S3 obtains the response time of each inflixdb database and predicts the future response time based on the LSTM model.
Predicting the response time of the next alarm detection period to obtain a predicted value by using a prediction model obtained by training according to the response time of the current alarm detection period, substituting the predicted value and real response time data obtained when the next alarm detection period is reached into Kalman filtering to obtain an optimal predicted value, transmitting the optimal predicted value as an input value into an LSTM model, predicting the next alarm detection period, and obtaining more accurate response time prediction data through iteration for a plurality of times.
S4 compares the predicted value with the alarm threshold.
S5, responding to the predicted value being larger than the alarm threshold value, prompting the influxdb database service to need to carry out capacity expansion.
And the platform initiates a capacity expansion and reduction request, the newly added nodes are added into the mapping of the corresponding influxdb database according to a consistent Hash algorithm, the capacity reduction directly removes the nodes, the mapping relation between the nodes in the cluster and the influxdb database is updated, and the changed node monitoring data carries out data migration.
By the technical scheme, the high-efficiency stability and load balance of the storage and reading and writing of the monitoring data can be ensured, a user can conveniently and visually check the dynamic information of the cluster resources and arrange the report, the unnecessary waste of the node resources can be reduced, the use safety is improved, and the integrity, the stability and the usability of the functions of the artificial intelligent platform can be ensured.
In a preferred embodiment of the present invention, starting a telegraff component on each node in a cluster, deploying a plurality of inflixdb databases in the cluster, establishing a mapping relationship between the nodes and the inflixdb databases, and storing the mapping relationship in the configuration of the telegraff component of each node includes:
installing telegraf components on all nodes needing to be monitored in a cluster, and starting the telegraf components after configuring monitoring acquisition items, acquisition time intervals and custom acquisition scripts of the telegraf components;
creating containers for starting the inflixdb database service in the cluster according to the expected number, and configuring corresponding IP;
establishing a mapping relation between a monitoring node managed by the telgraf component and an inflixdb database according to a consistent hash algorithm, and adding the mapping relation into the configuration of the telgraf component;
and the telegraff component collects node information and stores the collected information into a corresponding influxdb database based on the mapping relation.
The consistent Hash algorithm is to take a module of 2^32, imagine 32 power points of 2 into a circular ring called as a hash (Hash) ring, carry out Hash calculation on the IP address of the server, use the result after Hash to take a module of 2^32, and then calculate the position of each service or node on the ring. After looping, the IP of each node on the loop can be mapped to the IP of the infiluxdb database closest to the node in the clockwise direction. In order to avoid the skew of the hash ring, virtual nodes of the inflixdb database can be introduced, and the more virtual nodes, the more nodes on the hash ring, and the greater the probability that the nodes are uniformly distributed. When the inflixdb database nodes are added or removed, the corresponding nodes are transferred to the new clockwise nearest inflixdb database, the corresponding relations of other nodes are unchanged, and small changes can be guaranteed. As shown in the hash ring of FIG. 2, nodes 1, 3, and 6 map to inflixdb A, node 4 maps to inflixdb B, and nodes 2 and 5 map to inflixdb C.
In a preferred embodiment of the present invention, training the LSTM model according to historical influxdb database response times comprises:
collecting the response time of an influxdb database in historical data;
and using the LSTM neural network to take the collected response time as a training set and train according to a unit alarm detection period to obtain a prediction model.
In a preferred embodiment of the present invention, obtaining the response time of each inflixdb database and predicting the corresponding time in the future based on the LSTM model comprises:
predicting the response time of the next alarm detection period to obtain a predicted value by using a prediction model obtained by training according to the response time of the current alarm detection period;
substituting the predicted value and real response time data obtained when the next alarm detection period is reached into Kalman filtering to obtain an optimal predicted value;
and transmitting the optimal predicted value as an input value into an LSTM model, predicting the next alarm detection period, and obtaining more accurate response time prediction data through a plurality of iterations.
LSTM (Long Short-Term Memory network) is a time-recursive neural network suitable for processing and predicting important events with relatively Long intervals and delays in time series. The LSTM can keep the information which is transmitted before to flow out constantly through a chain structure, and the information can selectively pass through a structure of a gate, including a forgetting gate, an input gate and an output gate, so that the information is added or removed, and the disappearance of the gradient and the explosion of the gradient are solved. Kalman filtering is a combination of prediction and measurement, the prediction is derived from an empirical model and is derived by modeling of a system by a human, and the other part is measurement correction and is correction of the model. Simply speaking, the prediction error filtering is adopted, so that the information of the process measurement value cannot be filtered, and the prediction value is continuously corrected through the measurement value, so that the dynamic optimal prediction value is obtained.
Collecting the response time of the interface of the influxdb database, using the collected response time as a training set and training according to the unit alarm detection period by using an LSTM neural network to obtain a prediction model, putting the trained model into use, and according to the response time of the current alarm detection period, predicting the response time of the next period to obtain a predicted value, substituting the predicted value of the model and the real response time data obtained when the next period is reached into Kalman filtering by taking the alarm detection period as an axis, and obtaining the optimal predicted value of the period (Kalman gain can be substituted into other data sets for continuous use later), and selecting Kalman filtering to optimize the predicted value and the true value of the period according to the weight, so that the error and the noise of a predicted result are reduced as much as possible, the predicted value is closer to the true value, and the predicted value can be better used as an input parameter to be transmitted into an LSTM network of the next period for prediction. Kalman filtering can be understood simply, among others, as: and the final value is p × an observed value plus (1-p) a predicted value, wherein the observed value is an obtained actual value, the predicted value is a predicted value of the LSTM model, p is Kalman gain, and p is a parameter which can be continuously adjusted, so that the final value can obtain a result which is closer to the reality according to the observed value and the predicted value, the obtained optimal predicted value is used as input and is transmitted into the LSTM model to predict the next period, and the corrected predicted value and the real monitoring value are continuously injected through iteration to obtain more accurate response time prediction data.
In a preferred embodiment of the present invention, comparing the predicted value with the alarm threshold value includes:
respectively comparing the predicted data of the response time of each influxdb database with a first alarm threshold;
the difference between the maximum response time and the minimum response time of each inflixdb database is compared to a second alarm threshold.
By the technical scheme, the high-efficiency stability and load balance of the storage and reading and writing of the monitoring data can be ensured, a user can conveniently and visually check the dynamic information of the cluster resources and arrange the report, the unnecessary waste of the node resources can be reduced, the use safety is improved, and the integrity, the stability and the usability of the functions of the artificial intelligent platform can be ensured.
It should be noted that, as will be understood by those skilled in the art, all or part of the processes in the methods of the above embodiments may be implemented by instructing relevant hardware through a computer program, and the above programs may be stored in a computer-readable storage medium, and when executed, the programs may include the processes of the embodiments of the methods as described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like. The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.
Furthermore, the method disclosed according to an embodiment of the present invention may also be implemented as a computer program executed by a CPU, and the computer program may be stored in a computer-readable storage medium. The computer program, when executed by the CPU, performs the above-described functions defined in the method disclosed in the embodiments of the present invention.
In view of the above object, according to a second aspect of the embodiments of the present invention, there is provided an apparatus for managing microservices in a cluster, as shown in fig. 3, the apparatus 200 includes:
the creating module 201, the creating module 201 is configured to start a telegraff component on each node in the cluster, deploy a plurality of inflixdb databases in the cluster, establish a mapping relationship between the nodes and the inflixdb databases, and store the mapping relationship into the configuration of the telegraff component of each node;
the training module 202, the training module 202 is configured to train according to the response time of the influxdb database of the history record to obtain an LSTM model;
the prediction module 203, the prediction module 203 is configured to obtain the response time of each inflixdb database and predict the corresponding future time based on the LSTM model;
a comparison module 204, wherein the comparison module 204 is configured to compare the predicted value obtained by prediction with an alarm threshold;
and the alarm module 205, wherein the alarm module 205 is configured to prompt the influxdb database service to perform capacity expansion in response to the predicted value being greater than the alarm threshold.
In a preferred embodiment of the present invention, the creation module 201 is further configured to:
installing telegraf components on all nodes needing to be monitored in a cluster, and starting the telegraf components after configuring monitoring acquisition items, acquisition time intervals and custom acquisition scripts of the telegraf components;
creating containers for starting the inflixdb database service in the cluster according to the expected number, and configuring corresponding IP;
establishing a mapping relation between a monitoring node managed by the telgraf component and an inflixdb database according to a consistent hash algorithm, and adding the mapping relation into the configuration of the telgraf component;
and the telegraff component collects node information and stores the collected information into a corresponding influxdb database based on the mapping relation.
In a preferred embodiment of the present invention, training module 202 is further configured to:
collecting the response time of an influxdb database in historical data;
and using the LSTM neural network to take the collected response time as a training set and train according to a unit alarm detection period to obtain a prediction model.
In view of the above object, a third aspect of the embodiments of the present invention provides a computer device. Fig. 4 is a schematic diagram of an embodiment of a computer device provided by the present invention. As shown in fig. 4, an embodiment of the present invention includes the following means: at least one processor S21; and a memory S22, the memory S22 storing computer instructions S23 executable on the processor, the instructions when executed by the processor implementing the method of:
starting a telegraff component on each node in the cluster, deploying a plurality of inflixdb databases in the cluster, establishing a mapping relation between the nodes and the inflixdb databases, and storing the mapping relation into the configuration of the telegraff component of each node;
training according to the response time of the inflixdb database of the historical record to obtain an LSTM model;
acquiring the response time of each influxdb database and predicting the corresponding future time based on an LSTM model;
comparing the predicted value obtained by prediction with an alarm threshold value;
and responding to the predicted value being larger than the alarm threshold value, and prompting that the influxdb database service needs to perform capacity expansion.
In a preferred embodiment of the present invention, starting a telegraff component on each node in a cluster, deploying a plurality of inflixdb databases in the cluster, establishing a mapping relationship between the nodes and the inflixdb databases, and storing the mapping relationship in the configuration of the telegraff component of each node includes:
installing telegraf components on all nodes needing to be monitored in a cluster, and starting the telegraf components after configuring monitoring acquisition items, acquisition time intervals and custom acquisition scripts of the telegraf components;
creating containers for starting the inflixdb database service in the cluster according to the expected number, and configuring corresponding IP;
establishing a mapping relation between a monitoring node managed by the telgraf component and an inflixdb database according to a consistent hash algorithm, and adding the mapping relation into the configuration of the telgraf component;
and the telegraff component collects node information and stores the collected information into a corresponding influxdb database based on the mapping relation.
In a preferred embodiment of the present invention, training the LSTM model according to historical influxdb database response times comprises:
collecting the response time of an influxdb database in historical data;
and using the LSTM neural network to take the collected response time as a training set and train according to a unit alarm detection period to obtain a prediction model.
In a preferred embodiment of the present invention, obtaining the response time of each inflixdb database and predicting the corresponding time in the future based on the LSTM model comprises:
predicting the response time of the next alarm detection period to obtain a predicted value by using a prediction model obtained by training according to the response time of the current alarm detection period;
substituting the predicted value and real response time data obtained when the next alarm detection period is reached into Kalman filtering to obtain an optimal predicted value;
and transmitting the optimal predicted value as an input value into an LSTM model, predicting the next alarm detection period, and obtaining more accurate response time prediction data through a plurality of iterations.
In a preferred embodiment of the present invention, comparing the predicted value with the alarm threshold value includes:
respectively comparing the predicted data of the response time of each influxdb database with a first alarm threshold;
the difference between the maximum response time and the minimum response time of each inflixdb database is compared to a second alarm threshold.
In view of the above object, a fourth aspect of the embodiments of the present invention proposes a computer-readable storage medium. FIG. 5 is a schematic diagram illustrating an embodiment of a computer-readable storage medium provided by the present invention. As shown in fig. 5, the computer readable storage medium S31 stores a computer program S32 that when executed by a processor performs the method of:
starting a telegraff component on each node in the cluster, deploying a plurality of inflixdb databases in the cluster, establishing a mapping relation between the nodes and the inflixdb databases, and storing the mapping relation into the configuration of the telegraff component of each node;
training according to the response time of the inflixdb database of the historical record to obtain an LSTM model;
acquiring the response time of each influxdb database and predicting the corresponding future time based on an LSTM model;
comparing the predicted value obtained by prediction with an alarm threshold value;
and responding to the predicted value being larger than the alarm threshold value, and prompting that the influxdb database service needs to perform capacity expansion.
In a preferred embodiment of the present invention, starting a telegraff component on each node in a cluster, deploying a plurality of inflixdb databases in the cluster, establishing a mapping relationship between the nodes and the inflixdb databases, and storing the mapping relationship in the configuration of the telegraff component of each node includes:
installing telegraf components on all nodes needing to be monitored in a cluster, and starting the telegraf components after configuring monitoring acquisition items, acquisition time intervals and custom acquisition scripts of the telegraf components;
creating containers for starting the inflixdb database service in the cluster according to the expected number, and configuring corresponding IP;
establishing a mapping relation between a monitoring node managed by the telgraf component and an inflixdb database according to a consistent hash algorithm, and adding the mapping relation into the configuration of the telgraf component;
and the telegraff component collects node information and stores the collected information into a corresponding influxdb database based on the mapping relation.
In a preferred embodiment of the present invention, training the LSTM model according to historical influxdb database response times comprises:
collecting the response time of an influxdb database in historical data;
and using the LSTM neural network to take the collected response time as a training set and train according to a unit alarm detection period to obtain a prediction model.
In a preferred embodiment of the present invention, obtaining the response time of each inflixdb database and predicting the corresponding time in the future based on the LSTM model comprises:
predicting the response time of the next alarm detection period to obtain a predicted value by using a prediction model obtained by training according to the response time of the current alarm detection period;
substituting the predicted value and real response time data obtained when the next alarm detection period is reached into Kalman filtering to obtain an optimal predicted value;
and transmitting the optimal predicted value as an input value into an LSTM model, predicting the next alarm detection period, and obtaining more accurate response time prediction data through a plurality of iterations.
In a preferred embodiment of the present invention, comparing the predicted value with the alarm threshold value includes:
respectively comparing the predicted data of the response time of each influxdb database with a first alarm threshold;
the difference between the maximum response time and the minimum response time of each inflixdb database is compared to a second alarm threshold.
Furthermore, the methods disclosed according to embodiments of the present invention may also be implemented as a computer program executed by a processor, which may be stored in a computer-readable storage medium. Which when executed by a processor performs the above-described functions defined in the methods disclosed in embodiments of the invention.
Further, the above method steps and system elements may also be implemented using a controller and a computer readable storage medium for storing a computer program for causing the controller to implement the functions of the above steps or elements.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.
In one or more exemplary designs, the functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk, blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.
The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims (10)

1. A method of monitoring a multi-node distributed cluster, comprising the steps of:
starting a telegraff component on each node in the cluster, deploying a plurality of inflixdb databases in the cluster, establishing a mapping relation between the nodes and the inflixdb databases, and storing the mapping relation into the configuration of the telegraff component of each node;
training according to the response time of the inflixdb database of the historical record to obtain an LSTM model;
acquiring the response time of each influxdb database and predicting the future response time based on the LSTM model;
comparing the predicted value obtained by prediction with an alarm threshold value;
and prompting the influxdb database service to expand and contract the capacity in response to the predicted value being larger than the alarm threshold value.
2. The method of claim 1, wherein starting a telegraff component on each node in the cluster, deploying a plurality of inflixdb databases in the cluster, establishing a mapping relationship between the nodes and the inflixdb databases, and storing the mapping relationship into the configuration of the telegraff component of each node comprises:
installing telegraf components on all nodes needing to be monitored in a cluster, and starting the telegraf components after configuring monitoring acquisition items, acquisition time intervals and custom acquisition scripts of the telegraf components;
creating containers for starting the inflixdb database service in the cluster according to the expected number, and configuring corresponding IP;
establishing a mapping relation between a monitoring node managed by the telgraf component and an inflixdb database according to a consistent hash algorithm, and adding the mapping relation into the configuration of the telgraf component;
and the telegraff component collects node information and stores the collected information into a corresponding influxdb database based on the mapping relation.
3. The method of claim 1, wherein training the LSTM model based on historical influxdb database response times comprises:
collecting the response time of an influxdb database in historical data;
and using the LSTM neural network to take the collected response time as a training set and train according to a unit alarm detection period to obtain a prediction model.
4. The method of claim 3, wherein obtaining the response time for each inflixdb database and predicting a corresponding time in the future based on the LSTM model comprises:
predicting the response time of the next alarm detection period to obtain a predicted value by using a prediction model obtained by training according to the response time of the current alarm detection period;
substituting the predicted value and real response time data obtained when the next alarm detection period is reached into Kalman filtering to obtain an optimal predicted value;
and transmitting the optimal predicted value as an input value into an LSTM model, predicting the next alarm detection period, and obtaining more accurate response time prediction data through a plurality of iterations.
5. The method of claim 1, wherein comparing the predicted value to the alarm threshold comprises:
respectively comparing the predicted data of the response time of each influxdb database with a first alarm threshold;
the difference between the maximum response time and the minimum response time of each inflixdb database is compared to a second alarm threshold.
6. An apparatus for monitoring of a multi-node distributed cluster, the apparatus comprising:
the creating module is configured to start a telegraff component on each node in the cluster, deploy a plurality of inflixdb databases in the cluster, establish a mapping relation between the nodes and the inflixdb databases, and store the mapping relation into the configuration of the telegraff component of each node;
the training module is configured to train according to the response time of the influxdb database of the historical record to obtain an LSTM model;
a prediction module configured to obtain a response time for each inflixdb database and predict a corresponding time in the future based on the LSTM model;
a comparison module configured to compare a predicted value obtained by the prediction with an alarm threshold;
and the alarm module is configured to prompt the influxdb database service to need to perform capacity expansion and contraction in response to the predicted value being greater than the alarm threshold value.
7. The apparatus of claim 6, wherein the creation module is further configured to:
installing telegraf components on all nodes needing to be monitored in a cluster, and starting the telegraf components after configuring monitoring acquisition items, acquisition time intervals and custom acquisition scripts of the telegraf components;
creating containers for starting the inflixdb database service in the cluster according to the expected number, and configuring corresponding IP;
establishing a mapping relation between a monitoring node managed by the telgraf component and an inflixdb database according to a consistent hash algorithm, and adding the mapping relation into the configuration of the telgraf component;
and the telegraff component collects node information and stores the collected information into a corresponding influxdb database based on the mapping relation.
8. The apparatus of claim 6, wherein the training module is further configured to:
collecting the response time of an influxdb database in historical data;
and using the LSTM neural network to take the collected response time as a training set and train according to a unit alarm detection period to obtain a prediction model.
9. A computer device, comprising:
at least one processor; and
a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method of any one of claims 1 to 5.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.
CN202110999556.6A 2021-08-29 2021-08-29 Method, device and equipment for monitoring multi-node distributed cluster and readable medium Withdrawn CN113886173A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110999556.6A CN113886173A (en) 2021-08-29 2021-08-29 Method, device and equipment for monitoring multi-node distributed cluster and readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110999556.6A CN113886173A (en) 2021-08-29 2021-08-29 Method, device and equipment for monitoring multi-node distributed cluster and readable medium

Publications (1)

Publication Number Publication Date
CN113886173A true CN113886173A (en) 2022-01-04

Family

ID=79011373

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110999556.6A Withdrawn CN113886173A (en) 2021-08-29 2021-08-29 Method, device and equipment for monitoring multi-node distributed cluster and readable medium

Country Status (1)

Country Link
CN (1) CN113886173A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114546765A (en) * 2022-02-11 2022-05-27 苏州浪潮智能科技有限公司 Cluster monitoring method, system, device and medium
CN115022218A (en) * 2022-05-27 2022-09-06 中电信数智科技有限公司 Distributed Netconf protocol subscription alarm threshold setting method
CN115062053A (en) * 2022-06-10 2022-09-16 苏州浪潮智能科技有限公司 Method and system for processing data based on artificial intelligence platform
CN118394607A (en) * 2024-06-27 2024-07-26 之江实验室 Method and device for alarming temperature of computing cluster, storage medium and electronic equipment

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114546765A (en) * 2022-02-11 2022-05-27 苏州浪潮智能科技有限公司 Cluster monitoring method, system, device and medium
CN115022218A (en) * 2022-05-27 2022-09-06 中电信数智科技有限公司 Distributed Netconf protocol subscription alarm threshold setting method
CN115022218B (en) * 2022-05-27 2024-01-19 中电信数智科技有限公司 Distributed Netconf protocol subscription alarm threshold setting method
CN115062053A (en) * 2022-06-10 2022-09-16 苏州浪潮智能科技有限公司 Method and system for processing data based on artificial intelligence platform
CN118394607A (en) * 2024-06-27 2024-07-26 之江实验室 Method and device for alarming temperature of computing cluster, storage medium and electronic equipment
CN118394607B (en) * 2024-06-27 2024-09-03 之江实验室 Method and device for alarming temperature of computing cluster, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN113886173A (en) Method, device and equipment for monitoring multi-node distributed cluster and readable medium
CN108848184B (en) Block link point synchronization method and device based on trust mechanism
US10262019B1 (en) Distributed management optimization for IoT deployments
JP2022514508A (en) Machine learning model commentary Possibility-based adjustment
CN117909083B (en) Distributed cloud container resource scheduling method and system
CN114091610A (en) Intelligent decision method and device
EP3877861A1 (en) Managing computation load in a fog network
CN114546765A (en) Cluster monitoring method, system, device and medium
CN116047934B (en) Real-time simulation method and system for unmanned aerial vehicle cluster and electronic equipment
CN114048186A (en) Data migration method and system based on mass data
CN118193200A (en) Memory elastic expansion method, device, equipment, medium and product
CN111935005A (en) Data transmission method, device, processing equipment and medium
CN116826961A (en) Intelligent power grid dispatching and operation and maintenance system, method and storage medium
CN110879774B (en) Network element performance data alarming method and device
CN115034507A (en) Power load prediction method of charging pile and related components
CN115277249A (en) Network security situation perception method based on cooperation of multi-layer heterogeneous network
CN118445707B (en) Model early warning method and system for multiparty joint online modeling platform
CN113170592B (en) Thermal control optimization based on monitoring/control mechanism
CN113572633A (en) Root cause positioning method, system, equipment and storage medium
EP4035084A1 (en) Techniques for alerting metric baseline behavior change
US20240305453A1 (en) Machine learning to detect misclassified on-chain addresses
Alagar et al. Understanding and measuring risk due to uncertainties in IoT
CN118445707A (en) Model early warning method and system for multiparty joint online modeling platform
CN117041264B (en) Block chain resource management system and method based on data processing
CN114185896B (en) Data processing method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20220104

WW01 Invention patent application withdrawn after publication