CN113381996B

CN113381996B - C & C communication attack detection method based on machine learning

Info

Publication number: CN113381996B
Application number: CN202110637965.1A
Authority: CN
Inventors: 黄丽荣; 陈耿生; 蔡悦贞; 戴宏鹏; 黄嘉诚
Original assignee: China Telecom Fufu Information Technology Co Ltd
Current assignee: China Telecom Fufu Information Technology Co Ltd
Priority date: 2021-06-08
Filing date: 2021-06-08
Publication date: 2023-04-28
Anticipated expiration: 2041-06-08
Also published as: CN113381996A

Abstract

The invention discloses a machine learning-based C & C communication attack detection method, which comprises the following steps: obtaining continuous downlink flow packets and filtering the flow packets so that the distribution of the length of the flow packets is normal distribution, and performing session aggregation on the flow packets according to specified conditions; extracting session flow characteristics by utilizing random cluster sampling and Apriori algorithm; and performing similarity calculation on the aggregated traffic context data by combining the sequence similarity detection with the Longest Common Subsequence (LCS) by adopting an edit distance. The invention can detect undiscovered malicious software communication without relying on a feature library; when a large number of attack flow samples are detected, the detection time complexity is low, and the detection time is shorter.

Description

C & C communication attack detection method based on machine learning

Technical Field

The invention relates to the technical field of communication security, in particular to a C & C communication attack detection method based on machine learning.

Background

At present, three aspects of C & C communication detection are respectively statistical feature detection based on flow packets, feature code detection based on flow payload and supervised machine learning method detection based on existing malicious software.

The prior art detects certain defects aiming at C & C communication attack. First, existing methods have certain drawbacks for detection of unpublished or undiscovered malware. Secondly, the more dependent feature library of the detection effect of the existing method is not comprehensive. Finally, as the network scene used by the normal user is diversified, the situation that the traffic attribute characteristics of the normal user are similar to those of the malicious traffic is easily caused, for example, the traffic is judged according to the size and the arrival time interval of the data packet, and the communication process of the existing part of chat software is likely to have similar characteristics with the malicious software. Therefore, the conventional method has a certain limitation on the detection accuracy and detection effect of C & C communication. The method has certain defects in the aspect of C & C communication detection. Based on the statistical feature detection of the traffic packet, as the communication of the malicious software changes along with the change of network congestion, and as the current normal network application scene is more and more, the statistical features of the normal user traffic and the malicious user traffic are easy to be similar, so that the false alarm rate is higher. Based on the feature code detection in the traffic payload, the method has a higher detection effect on the existing known malicious software, but detection failure can be caused if the mutation feature code of the malicious software changes. The existing malicious software-based supervised machine learning method detection is mainly based on the flow characteristics of the existing malicious software for supervised learning, and the detection effect is more dependent on the coverage breadth of a training set of machine learning and the scientificity of the learning method.

Disclosure of Invention

The invention aims to provide a C & C communication attack detection method based on machine learning.

The technical scheme adopted by the invention is as follows:

the C & C communication attack detection method based on machine learning comprises the following steps:

step 1, filtering a flow packet: the continuous downlink flow packets are obtained and filtered, so that the distribution of the length of the flow packets is normal,

step 2, traffic session aggregation: performing session aggregation on the traffic packets according to specified conditions;

step 3, extracting session flow characteristics by utilizing random cluster sampling and Apriori algorithm;

and 4, performing similarity calculation on the aggregated flow context data by combining the sequence similarity detection with the Longest Common Subsequence (LCS) by adopting an edit distance.

And step 5, judging whether the C & C communication is abnormal according to whether the context similarity of the downlink flow of the session exceeds a set value.

Further, as a preferred embodiment, step 1 sets a filtering threshold according to the positive too much distribution of the traffic packet length, filters a portion of the uncorrelated traffic,

further, as a preferred embodiment, step 1 calculates a packet length critical value of the small-flow packet by setting a packet filtering rate, and the final filtering packet length is determined by adopting a normal distribution estimation and threshold setting mode to perform comprehensive calculation.

Further, as a preferred embodiment, in step 2, session aggregation is performed according to the source address, the source port, the destination address or the destination port.

Further, as a preferred embodiment, when the amount of the processed data in the step 3 is too large, probability sampling is performed by adopting a reservoir sampling algorithm.

Further, as a preferred embodiment, in step 4, the edit distance calculation is performed on the sequence pairs, the sequence pairs with larger distance values are filtered according to the calculation result, and then LCS calculation is performed on the sequence pairs.

According to the technical scheme, the method and the device for detecting the malicious software communication in the network traffic are used for filtering, sampling and aggregating the network traffic, and then detecting the context similarity of the aggregated session traffic data, so as to detect whether the malicious software communication exists. The invention has the following advantages: 1. undiscovered malware traffic may be detected without relying on feature libraries. 2. Unlike existing malware supervised machine learning methods, which mainly perform supervised learning based on flow characteristics of existing malware, the detection effect is more dependent on coverage breadth of a training set of machine learning and scientificity of the learning method. 3. In C & C communication detection, the downlink payload-based similarity detection algorithm has higher accuracy and recall rate relative to the flow packet detection algorithm and payload feature code detection, and has certain advantages in detection time, especially when detecting a large number of attack flow samples, the detection time is lower in complexity and shorter in detection time.

Drawings

The invention is described in further detail below with reference to the drawings and detailed description;

fig. 1 is a flow chart of a method for detecting a C & C communication attack based on machine learning according to the present invention.

Detailed Description

For the purposes, technical solutions and advantages of the embodiments of the present application, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

As shown in fig. 1, the invention discloses a method for detecting C & C communication attack based on machine learning, which comprises the following steps:

step 1, filtering a flow packet: acquiring a continuous downlink flow packet; at present, the flow in the existing network environment is bigger and bigger, and the downlink flow packet of malicious software is mostly smaller, in order to avoid the filtering of the flow packet caused by the meaningless analysis and detection of the irrelevant flow and the waste of resources, the distribution of the length of the flow packet is normal,

further, as a preferred embodiment, step 1 sets a filtering threshold according to the positive-ethernet distribution of the traffic packet lengths, and filters a portion of the uncorrelated traffic. Specifically, the packet filter rate is set to calculate the packet length critical value of the small flow packet, and the final filter packet length is determined by adopting a normal distribution estimation and set threshold value mode to comprehensively calculate.

Further, as a preferred embodiment, the sampling in step 3 refers to extracting a sample that can represent the population from the population through a certain sampling algorithm. The invention detects the content similarity in the payload of the continuous downlink flow by detecting the features of the extracted samples to predict the overall features, and considers the condition that the same advantages may appear continuity in the actual attack process, so a random cluster sampling algorithm is adopted, and if the processed data volume is too large, a reservoir sampling algorithm can be adopted to sample the washed probability.

Specifically, the detection of the sequence similarity of the downlink traffic packet is mainly based on a combination of a value algorithm for solving the Longest Common Subsequence (LCS) and calculating the edit distance of the two sequences. Wherein LCS is the longest common subsequence, and the similarity of two sequences is determined by determining the length of the largest common subsequence of the two sequences. The longest common subsequence is typically found using a dynamic programming algorithm. Wherein the edit distance, also known as the Levenshtein distance, represents the minimum number of edits required to convert from one string to another, where editing refers to replacing one character in the string with another, or inserting a delete character.

Because the calculation time complexity of the editing distance is low, some irrelevant sequence pairs can be removed firstly, and because the LCS calculation similarity is more accurate, the detection result is more credible.

It will be apparent that the embodiments described are some, but not all, of the embodiments of the present application. Embodiments and features of embodiments in this application may be combined with each other without conflict. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the detailed description of the embodiments of the present application is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

Claims

1. The C & C communication attack detection method based on machine learning is characterized by comprising the following steps of: which comprises the following steps:

step 3, extracting session flow characteristics by using random cluster sampling and Apriori algorithm;

step 4, calculating the similarity of the aggregated flow context data by adopting a mode of combining the edit distance with the longest public subsequence for sequence similarity detection;

2. The machine learning based C & C communication attack detection method of claim 1, wherein: step 1, setting a filtering threshold according to the positive-Ethernet distribution of the length of the flow packet, and filtering partial irrelevant flow; and calculating a packet length critical value of the small flow packet by setting a packet filtering rate, and determining the final filtering packet length by adopting a normal distribution estimation and set threshold value mode through comprehensive calculation.

3. The machine learning based C & C communication attack detection method of claim 1, wherein: and 2, performing session aggregation according to the source address, the source port, the destination address or the destination port.

4. The machine learning based C & C communication attack detection method of claim 1, wherein: and (3) when the processed data volume is too large in the step (3), probability sampling is carried out by adopting a reservoir sampling algorithm.