CN113886173A

CN113886173A - Method, device and equipment for monitoring multi-node distributed cluster and readable medium

Info

Publication number: CN113886173A
Application number: CN202110999556.6A
Authority: CN
Inventors: 张书博
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2021-08-29
Filing date: 2021-08-29
Publication date: 2022-01-04

Abstract

The invention provides a method, a device, equipment and a readable medium for monitoring a multi-node distributed cluster, wherein the method comprises the following steps: starting a telegraff component on each node in the cluster, deploying a plurality of inflixdb databases in the cluster, establishing a mapping relation between the nodes and the inflixdb databases, and storing the mapping relation into the configuration of the telegraff component of each node; training according to the response time of the inflixdb database of the historical record to obtain an LSTM model; acquiring the response time of each influxdb database and predicting the corresponding future time based on an LSTM model; comparing the predicted value obtained by prediction with an alarm threshold value; and responding to the predicted value being larger than the alarm threshold value, and prompting that the influxdb database service needs to perform capacity expansion. By using the scheme of the invention, unnecessary waste of node resources can be reduced, the use safety is improved, and the integrity, the stability and the usability of the function of the artificial intelligence platform can be ensured.

Description

Method, device and equipment for monitoring multi-node distributed cluster and readable medium

Technical Field

The present invention relates to the field of computers, and more particularly, to a method, an apparatus, a device, and a readable medium for monitoring of a multi-node distributed cluster.

Background

Private cloud platforms (such as AIStation training platforms) will use technologies such as k8s to establish a plurality of physical nodes into a cluster, and for stability of the cluster and full utilization of cluster resources, the cluster will be generally established as a distributed high availability. In the process of carrying out deep learning model training inference by using a multi-node distributed high-availability cluster, monitoring the computing power of the cluster bottom layer resources, the node state and the like becomes particularly important, for example, monitoring whether the nodes in the cluster are healthy, whether the utilization rate of a GPU (graphics processing Unit) accelerator card, a video memory and the like are fully used, whether the temperature of the GPU accelerator card is normal and the like. When the indexes are in a normal range, the data can be used for debugging a model or generating a report, and when the indexes are abnormal, a platform alarm system can be triggered to avoid unnecessary loss. Therefore, the monitoring module is necessary in the private cloud platform.

In the past, the cluster monitoring of the platform can be implemented by using a component combination of TIGK (cloud environment monitoring solution, which is a combination of four components, namely Telegraf, infiux, Grafana, and Kapacitor), and correspondingly implementing resource monitoring steps of acquisition, storage, presentation, and alarm, wherein the acquisition and storage are the most critical two steps, that is, the acquisition and storage are implemented by using a Telegraf component (a plug-in driver server agent (installed in all hosts in the cloud)) for collecting and reporting indexes and infiluxdb (a time sequence database, which is constructed from the beginning to process high writing and query loads). Under the condition that the number of nodes in the cluster is small, only one influxdb is needed to be started to realize data storage service, all monitoring nodes start telegraf service to monitor bottom-layer resources, monitoring item data of the nodes are collected according to a configuration file and stored in the influxdb according to collection intervals, and finally, according to platform functions, the influxdb database is inquired by platform service to integrate the data and return to display. However, in a scenario where the number of nodes in the cluster is large, the above structure may cause situations of increased throughput, response timeout, and even breakdown.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method, an apparatus, a device, and a readable medium for monitoring a multi-node distributed cluster, which can ensure efficient stability and load balance of storage and read/write of monitoring data, facilitate a user to visually check dynamic information of cluster resources and sort reports, reduce unnecessary waste of node resources, improve usage safety, and ensure integrity, stability, and availability of functions of an artificial intelligence platform.

In view of the above, an aspect of the embodiments of the present invention provides a method for monitoring a multi-node distributed cluster, including the following steps:

starting a telegraff component on each node in the cluster, deploying a plurality of inflixdb databases in the cluster, establishing a mapping relation between the nodes and the inflixdb databases, and storing the mapping relation into the configuration of the telegraff component of each node;

training according to the response time of the inflixdb database of the historical record to obtain an LSTM model;

acquiring the response time of each influxdb database and predicting the future response time based on an LSTM model;

comparing the predicted value obtained by prediction with an alarm threshold value;

and responding to the predicted value being larger than the alarm threshold value, and prompting that the influxdb database service needs to perform capacity expansion.

According to an embodiment of the present invention, starting a telegraff component on each node in a cluster, deploying a plurality of inflixdb databases in the cluster, establishing a mapping relationship between the nodes and the inflixdb databases, and storing the mapping relationship into the configuration of the telegraff component of each node includes:

installing telegraf components on all nodes needing to be monitored in a cluster, and starting the telegraf components after configuring monitoring acquisition items, acquisition time intervals and custom acquisition scripts of the telegraf components;

creating containers for starting the inflixdb database service in the cluster according to the expected number, and configuring corresponding IP;

establishing a mapping relation between a monitoring node managed by the telgraf component and an inflixdb database according to a consistent hash algorithm, and adding the mapping relation into the configuration of the telgraf component;

and the telegraff component collects node information and stores the collected information into a corresponding influxdb database based on the mapping relation.

According to an embodiment of the present invention, training the LSTM model according to the historical influxdb database response time includes:

collecting the response time of an influxdb database in historical data;

and using the LSTM neural network to take the collected response time as a training set and train according to a unit alarm detection period to obtain a prediction model.

According to one embodiment of the invention, obtaining the response time of each inflixdb database and predicting the corresponding time in the future based on the LSTM model comprises:

predicting the response time of the next alarm detection period to obtain a predicted value by using a prediction model obtained by training according to the response time of the current alarm detection period;

substituting the predicted value and real response time data obtained when the next alarm detection period is reached into Kalman filtering to obtain an optimal predicted value;

and transmitting the optimal predicted value as an input value into an LSTM model, predicting the next alarm detection period, and obtaining more accurate response time prediction data through a plurality of iterations.

According to one embodiment of the invention, comparing the predicted value with the alarm threshold comprises:

respectively comparing the predicted data of the response time of each influxdb database with a first alarm threshold;

the difference between the maximum response time and the minimum response time of each inflixdb database is compared to a second alarm threshold.

In another aspect of the embodiments of the present invention, there is also provided an apparatus for monitoring a multi-node distributed cluster, including:

the creating module is configured to start a telegraff component on each node in the cluster, deploy a plurality of inflixdb databases in the cluster, establish a mapping relation between the nodes and the inflixdb databases, and store the mapping relation into the configuration of the telegraff component of each node;

the training module is configured to train according to the response time of the influxdb database of the historical record to obtain an LSTM model;

the prediction module is configured to acquire the response time of each inflixdb database and predict the corresponding future time based on the LSTM model;

the comparison module is configured to compare a predicted value obtained by prediction with an alarm threshold value;

and the alarm module is configured to respond to the fact that the predicted value is larger than the alarm threshold value, and prompt the influxdb database service to need to expand and contract the capacity.

According to one embodiment of the invention, the creation module is further configured to:

According to one embodiment of the invention, the training module is further configured to:

collecting the response time of an influxdb database in historical data;

In another aspect of an embodiment of the present invention, there is also provided a computer apparatus including:

at least one processor; and

a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method set forth in the above claims.

In another aspect of the embodiments of the present invention, a computer-readable storage medium is also provided, in which a computer program is stored, and the computer program is executed by a processor to implement the steps of the method set forth in the above technical solution.

The invention has the following beneficial technical effects: in the method for monitoring a multi-node distributed cluster provided by the embodiment of the invention, a telgraff component is started on each node in the cluster, a plurality of inflixdb databases are deployed in the cluster, a mapping relation between the nodes and the inflixdb databases is established, and the mapping relation is stored in the configuration of the telgraff component of each node; training according to the response time of the inflixdb database of the historical record to obtain an LSTM model; acquiring the response time of each influxdb database and predicting the corresponding future time based on an LSTM model; comparing the predicted value obtained by prediction with an alarm threshold value; responding to the technical scheme that the predicted value is larger than the alarm threshold value, prompting the influxdb database service to expand and contract the capacity, ensuring the high efficiency and stability of the storage and reading and writing of the monitoring data and the load balance, facilitating the visual checking of the dynamic information of the cluster resources and the arrangement of the report forms by the user, reducing the unnecessary waste of the node resources, improving the use safety and ensuring the integrity, stability and usability of the artificial intelligent platform function.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

FIG. 1 is a schematic flow chart diagram of a method of monitoring of a multi-node distributed cluster in accordance with one embodiment of the present invention;

FIG. 2 is a diagram illustrating a node and inflixdb database mapping relationship, according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an apparatus for monitoring of a multi-node distributed cluster according to one embodiment of the present invention;

FIG. 4 is a schematic diagram of a computer device according to one embodiment of the present invention;

fig. 5 is a schematic diagram of a computer-readable storage medium according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

In view of the above, a first aspect of the embodiments of the present invention proposes an embodiment of a method for monitoring a multi-node distributed cluster. Fig. 1 shows a schematic flow diagram of the method.

As shown in fig. 1, the method may include the steps of:

s1, a telegraff component is started on each node in the cluster, a plurality of inflixdb databases are deployed in the cluster, a mapping relation between the nodes and the inflixdb databases is established, and the mapping relation is stored in the configuration of the telegraff component of each node.

Installing telgraff components on all nodes needing to be monitored in a cluster, starting telgraff service after configuring configuration items such as monitoring acquisition items, acquisition time intervals, custom acquisition scripts and the like, creating a container for starting inflixdb database service in the cluster according to an expected quantity, configuring corresponding IP, initializing a mapping relation between a monitoring node managed by the telgraff components and an inflixdb database for storing monitoring data according to a consistent hash algorithm, adding the mapping relation into the configuration of the telgraff components, for example, storing a node1 (node 1) and a node2 into the inflixdb 1, storing the node3 into the inflixdb 2, starting platform service, and inquiring the monitoring data in the inflixdb database corresponding to the platform service according to business logic.

S2, training according to the response time of the influxdb database of the historical record to obtain an LSTM model.

And acquiring the response time of an inflixdb database in the historical data, and using an LSTM neural network to train the acquired response time as a training set according to a unit alarm detection period to obtain an LSTM prediction model.

S3 obtains the response time of each inflixdb database and predicts the future response time based on the LSTM model.

Predicting the response time of the next alarm detection period to obtain a predicted value by using a prediction model obtained by training according to the response time of the current alarm detection period, substituting the predicted value and real response time data obtained when the next alarm detection period is reached into Kalman filtering to obtain an optimal predicted value, transmitting the optimal predicted value as an input value into an LSTM model, predicting the next alarm detection period, and obtaining more accurate response time prediction data through iteration for a plurality of times.

S4 compares the predicted value with the alarm threshold.

S5, responding to the predicted value being larger than the alarm threshold value, prompting the influxdb database service to need to carry out capacity expansion.

And the platform initiates a capacity expansion and reduction request, the newly added nodes are added into the mapping of the corresponding influxdb database according to a consistent Hash algorithm, the capacity reduction directly removes the nodes, the mapping relation between the nodes in the cluster and the influxdb database is updated, and the changed node monitoring data carries out data migration.

By the technical scheme, the high-efficiency stability and load balance of the storage and reading and writing of the monitoring data can be ensured, a user can conveniently and visually check the dynamic information of the cluster resources and arrange the report, the unnecessary waste of the node resources can be reduced, the use safety is improved, and the integrity, the stability and the usability of the functions of the artificial intelligent platform can be ensured.

In a preferred embodiment of the present invention, starting a telegraff component on each node in a cluster, deploying a plurality of inflixdb databases in the cluster, establishing a mapping relationship between the nodes and the inflixdb databases, and storing the mapping relationship in the configuration of the telegraff component of each node includes:

The consistent Hash algorithm is to take a module of 2^32, imagine 32 power points of 2 into a circular ring called as a hash (Hash) ring, carry out Hash calculation on the IP address of the server, use the result after Hash to take a module of 2^32, and then calculate the position of each service or node on the ring. After looping, the IP of each node on the loop can be mapped to the IP of the infiluxdb database closest to the node in the clockwise direction. In order to avoid the skew of the hash ring, virtual nodes of the inflixdb database can be introduced, and the more virtual nodes, the more nodes on the hash ring, and the greater the probability that the nodes are uniformly distributed. When the inflixdb database nodes are added or removed, the corresponding nodes are transferred to the new clockwise nearest inflixdb database, the corresponding relations of other nodes are unchanged, and small changes can be guaranteed. As shown in the hash ring of FIG. 2,

nodes

1, 3, and 6 map to inflixdb A, node 4 maps to inflixdb B, and

nodes

2 and 5 map to inflixdb C.

In a preferred embodiment of the present invention, training the LSTM model according to historical influxdb database response times comprises:

collecting the response time of an influxdb database in historical data;

In a preferred embodiment of the present invention, obtaining the response time of each inflixdb database and predicting the corresponding time in the future based on the LSTM model comprises:

LSTM (Long Short-Term Memory network) is a time-recursive neural network suitable for processing and predicting important events with relatively Long intervals and delays in time series. The LSTM can keep the information which is transmitted before to flow out constantly through a chain structure, and the information can selectively pass through a structure of a gate, including a forgetting gate, an input gate and an output gate, so that the information is added or removed, and the disappearance of the gradient and the explosion of the gradient are solved. Kalman filtering is a combination of prediction and measurement, the prediction is derived from an empirical model and is derived by modeling of a system by a human, and the other part is measurement correction and is correction of the model. Simply speaking, the prediction error filtering is adopted, so that the information of the process measurement value cannot be filtered, and the prediction value is continuously corrected through the measurement value, so that the dynamic optimal prediction value is obtained.

Collecting the response time of the interface of the influxdb database, using the collected response time as a training set and training according to the unit alarm detection period by using an LSTM neural network to obtain a prediction model, putting the trained model into use, and according to the response time of the current alarm detection period, predicting the response time of the next period to obtain a predicted value, substituting the predicted value of the model and the real response time data obtained when the next period is reached into Kalman filtering by taking the alarm detection period as an axis, and obtaining the optimal predicted value of the period (Kalman gain can be substituted into other data sets for continuous use later), and selecting Kalman filtering to optimize the predicted value and the true value of the period according to the weight, so that the error and the noise of a predicted result are reduced as much as possible, the predicted value is closer to the true value, and the predicted value can be better used as an input parameter to be transmitted into an LSTM network of the next period for prediction. Kalman filtering can be understood simply, among others, as: and the final value is p × an observed value plus (1-p) a predicted value, wherein the observed value is an obtained actual value, the predicted value is a predicted value of the LSTM model, p is Kalman gain, and p is a parameter which can be continuously adjusted, so that the final value can obtain a result which is closer to the reality according to the observed value and the predicted value, the obtained optimal predicted value is used as input and is transmitted into the LSTM model to predict the next period, and the corrected predicted value and the real monitoring value are continuously injected through iteration to obtain more accurate response time prediction data.

In a preferred embodiment of the present invention, comparing the predicted value with the alarm threshold value includes:

It should be noted that, as will be understood by those skilled in the art, all or part of the processes in the methods of the above embodiments may be implemented by instructing relevant hardware through a computer program, and the above programs may be stored in a computer-readable storage medium, and when executed, the programs may include the processes of the embodiments of the methods as described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like. The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.

Furthermore, the method disclosed according to an embodiment of the present invention may also be implemented as a computer program executed by a CPU, and the computer program may be stored in a computer-readable storage medium. The computer program, when executed by the CPU, performs the above-described functions defined in the method disclosed in the embodiments of the present invention.

In view of the above object, according to a second aspect of the embodiments of the present invention, there is provided an apparatus for managing microservices in a cluster, as shown in fig. 3, the apparatus 200 includes:

the creating module 201, the creating module 201 is configured to start a telegraff component on each node in the cluster, deploy a plurality of inflixdb databases in the cluster, establish a mapping relationship between the nodes and the inflixdb databases, and store the mapping relationship into the configuration of the telegraff component of each node;

the training module 202, the training module 202 is configured to train according to the response time of the influxdb database of the history record to obtain an LSTM model;

the prediction module 203, the prediction module 203 is configured to obtain the response time of each inflixdb database and predict the corresponding future time based on the LSTM model;

a comparison module 204, wherein the comparison module 204 is configured to compare the predicted value obtained by prediction with an alarm threshold;

and the alarm module 205, wherein the alarm module 205 is configured to prompt the influxdb database service to perform capacity expansion in response to the predicted value being greater than the alarm threshold.

In a preferred embodiment of the present invention, the creation module 201 is further configured to:

In a preferred embodiment of the present invention, training module 202 is further configured to:

collecting the response time of an influxdb database in historical data;

In view of the above object, a third aspect of the embodiments of the present invention provides a computer device. Fig. 4 is a schematic diagram of an embodiment of a computer device provided by the present invention. As shown in fig. 4, an embodiment of the present invention includes the following means: at least one processor S21; and a memory S22, the memory S22 storing computer instructions S23 executable on the processor, the instructions when executed by the processor implementing the method of:

acquiring the response time of each influxdb database and predicting the corresponding future time based on an LSTM model;

collecting the response time of an influxdb database in historical data;

In view of the above object, a fourth aspect of the embodiments of the present invention proposes a computer-readable storage medium. FIG. 5 is a schematic diagram illustrating an embodiment of a computer-readable storage medium provided by the present invention. As shown in fig. 5, the computer readable storage medium S31 stores a computer program S32 that when executed by a processor performs the method of:

collecting the response time of an influxdb database in historical data;

Furthermore, the methods disclosed according to embodiments of the present invention may also be implemented as a computer program executed by a processor, which may be stored in a computer-readable storage medium. Which when executed by a processor performs the above-described functions defined in the methods disclosed in embodiments of the invention.

Further, the above method steps and system elements may also be implemented using a controller and a computer readable storage medium for storing a computer program for causing the controller to implement the functions of the above steps or elements.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.

In one or more exemplary designs, the functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk, blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A method of monitoring a multi-node distributed cluster, comprising the steps of:

acquiring the response time of each influxdb database and predicting the future response time based on the LSTM model;

and prompting the influxdb database service to expand and contract the capacity in response to the predicted value being larger than the alarm threshold value.

2. The method of claim 1, wherein starting a telegraff component on each node in the cluster, deploying a plurality of inflixdb databases in the cluster, establishing a mapping relationship between the nodes and the inflixdb databases, and storing the mapping relationship into the configuration of the telegraff component of each node comprises:

3. The method of claim 1, wherein training the LSTM model based on historical influxdb database response times comprises:

collecting the response time of an influxdb database in historical data;

4. The method of claim 3, wherein obtaining the response time for each inflixdb database and predicting a corresponding time in the future based on the LSTM model comprises:

5. The method of claim 1, wherein comparing the predicted value to the alarm threshold comprises:

6. An apparatus for monitoring of a multi-node distributed cluster, the apparatus comprising:

a prediction module configured to obtain a response time for each inflixdb database and predict a corresponding time in the future based on the LSTM model;

a comparison module configured to compare a predicted value obtained by the prediction with an alarm threshold;

and the alarm module is configured to prompt the influxdb database service to need to perform capacity expansion and contraction in response to the predicted value being greater than the alarm threshold value.

7. The apparatus of claim 6, wherein the creation module is further configured to:

8. The apparatus of claim 6, wherein the training module is further configured to:

collecting the response time of an influxdb database in historical data;

9. A computer device, comprising:

at least one processor; and

a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method of any one of claims 1 to 5.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.