CN116192448A

CN116192448A - Malicious sample data packet analysis method and device and electronic equipment

Info

Publication number: CN116192448A
Application number: CN202211649321.5A
Authority: CN
Inventors: 毕光耀; 康学斌; 肖新光
Original assignee: Antiy Technology Group Co Ltd
Current assignee: Antiy Technology Group Co Ltd
Priority date: 2022-12-21
Filing date: 2022-12-21
Publication date: 2023-05-30

Abstract

The embodiment of the invention discloses a malicious sample data packet analysis method, a malicious sample data packet analysis device and electronic equipment, and relates to the technical field of network security. The method comprises the following steps: acquiring a plurality of malicious sample data packets; screening the plurality of malicious sample data packets to obtain a plurality of target malicious sample data packets; classifying each target malicious sample data packet according to the malicious sample characteristic information; aiming at each type of target malicious sample data packet, carrying out message grouping on each target malicious sample data packet to at least obtain a first group of target malicious sample data packets, wherein the first group of target malicious sample data packets at least comprise two similar target malicious sample data packets; common features in each set of malicious sample data packets are extracted. According to the method, the public features in the malicious samples of the unknown protocol are conveniently extracted, so that the capability of identifying and finding the malicious code threat of the unknown protocol can be improved.

Description

Malicious sample data packet analysis method and device and electronic equipment

Technical Field

The present invention relates to the field of network security technologies, and in particular, to a method and an apparatus for analyzing a malicious sample data packet, and an electronic device.

Background

With the development of network information technology, services and applications surrounding networks and data are explosive growth, and more network security risks and problems are exposed in rich application scenes, and wide and profound effects are generated. The complexity, variability, and vulnerability of the information system of the network environment determine that the network security threat is continuously and objectively present.

Some units are at risk of intrusion through attack means such as social engineering, public network business exploitation, etc. Since units generally have a certain scale, after the malicious code roams inside, a large area of infection is caused, so that internal core data and confidential data may be at risk of being stolen and revealed. If a threat of malicious code (also referred to as a malicious sample) of a potential unknown protocol cannot be found in time, identifying its features would cause significant loss.

Disclosure of Invention

In view of the above, embodiments of the present invention provide a method, an apparatus, and an electronic device for analyzing a malicious sample packet, which are convenient for extracting public features in a malicious sample of an unknown protocol, so as to improve the ability of identifying and discovering a malicious code threat of the unknown protocol.

In a first aspect, a malicious sample data packet analysis method provided by an embodiment of the present invention includes the steps of: acquiring a plurality of malicious sample data packets; screening the plurality of malicious sample data packets to obtain a plurality of target malicious sample data packets; classifying each target malicious sample data packet according to the malicious sample characteristic information; aiming at each type of target malicious sample data packet, carrying out message grouping on each target malicious sample data packet to at least obtain a first group of target malicious sample data packets, wherein the first group of target malicious sample data packets at least comprise two similar target malicious sample data packets; common features in each set of malicious sample data packets are extracted.

Optionally, the common features include: constant field fields, variable field fields, and/or constant fields.

Optionally, the malicious sample feature information includes: a virus family to which the malicious sample belongs;

classifying each target malicious sample data packet according to the malicious sample characteristic information, including: dividing each target malicious sample data packet into a plurality of corresponding categories according to virus families to which malicious samples belong, wherein each category comprises a plurality of target malicious sample data;

Before grouping each target malicious sample data packet, the method further includes: aiming at each type of target malicious sample data packet, carrying out message classification on each target malicious sample data packet, and determining the message type of the target malicious sample data packet;

the step of classifying the message of each target malicious sample data packet and determining the message type of the target malicious sample data packet includes: aiming at each virus family category, acquiring a plurality of malicious sample online data packets contained in the category; the target malicious sample data packet includes: malicious sample online data packets; extracting byte types of each malicious sample online data packet; the byte type includes: character type and/or binary type; judging whether the byte number of the character type exceeds a preset threshold value or not; if yes, determining the malicious sample online data packet as a text message; if not, determining the malicious sample online data packet as a binary class message.

Optionally, the classifying the message of each target malicious sample data packet to determine the message type of the target malicious sample data packet further includes: aiming at each virus family category, acquiring a plurality of malicious sample instruction data packets contained in the category; the target malicious sample data packet includes: malicious sample instruction packets; extracting byte types of each malicious sample instruction data packet; the byte type includes: character type and/or binary type; judging whether the byte number of the character type exceeds a preset threshold value or not; if yes, determining the malicious sample instruction data packet as a text message; if not, determining the malicious sample instruction data packet as a binary class message.

Optionally, the malicious sample feature information further includes: the step of grouping the message of each target malicious sample data packet according to the length of the communication data packet and the transmission direction of the communication data packet, at least obtaining a first group of target malicious sample data packets comprises: after determining the message type of the target malicious sample data packet, obtaining a preliminary packet of the target malicious sample data packet;

for each target malicious sample data packet, determining the target malicious sample data packet as a first target malicious sample data packet;

based on the length of a communication data packet and the transmission direction of the communication data packet, comparing a first target malicious sample data packet with target malicious sample data packets of different message types through a local sequence comparison algorithm, and determining whether a message classification matched with the first target malicious sample data packet exists or not; if so, adding the first target malicious sample data packet into a preliminary packet corresponding to the matched message classification to form a first message packet; if not, a new message packet is created, and the first target malicious sample data packet is added to the new message packet to form a second message packet.

Optionally, for the binary class packet, in determining whether there is a packet classification matching the first target malicious sample data packet, the limiting condition on whether the first target malicious sample data can be added to the type of packet classification includes: length limitation: the length distance between the two messages is not more than a specified threshold value; content restriction: the editing distance between the two messages is not more than a specified threshold value; format restriction: two messages of the same type have the same number and order of text fields, binary fields, and uniformly encoded fields.

Optionally, for a text type message, in determining whether there is a message classification matching the first target malicious sample data packet, the limiting condition on whether the first target malicious sample data can be added to the type message classification includes: fragment restriction: dividing the message into combinations of different fragments by using the line-feed symbol as a separator, wherein the number difference of the sequences of the message is not more than a specified threshold value; length limitation: the length difference of the two corresponding fragments between the messages does not exceed a specified threshold value; content restriction: the editing distance of two fragments corresponding to two messages of the same type does not exceed a specified threshold; string restriction: and (3) dividing a character set by taking a special symbol as a boundary for a preset segment, wherein the number of common character strings existing between corresponding segments between two messages of the same type meets a preset threshold condition.

Optionally, the grouping the packets of each target malicious sample data packet to obtain at least a first group of target malicious sample data packets further includes: when two target malicious sample data meet the limitation conditions of binary class message classification or text type message classification, judging that the two target malicious sample data have similar fragments; dividing the two target malicious sample data into the same message group;

the extracting the common features in each set of malicious sample data packets comprises: extracting similar fragments of the set of two target malicious sample data, and determining the similar fragments as public features; or,

when more than two target malicious sample data meet the limiting conditions of binary class message classification or text type message classification, judging that the more than two target malicious sample data have similar fragments; and extracting similar fragments in the more than two target malicious sample data, and determining fragments with smaller editing distance between the similar fragments as common features.

In a second aspect, a malicious sample data packet analysis device provided by an embodiment of the present invention includes: an acquisition program module for acquiring a plurality of malicious sample data packets; the screening program module is used for screening the plurality of malicious sample data packets to obtain a plurality of target malicious sample data packets; the primary classification program module is used for classifying each target malicious sample data packet according to the malicious sample characteristic information; the grouping program module is used for grouping each type of target malicious sample data packet according to the message, at least obtaining a first group of target malicious sample data packets, wherein the first group of target malicious sample data packets at least comprise two similar target malicious sample data packets; and the feature extraction program module is used for extracting the public features in each group of malicious sample data packets.

In a third aspect, an electronic device provided by an embodiment of the present invention includes: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space surrounded by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for performing any one of the malicious sample data packet analysis methods of the first aspect.

According to the malicious sample data packet analysis method, the malicious sample data packet analysis device and the electronic equipment, the collected large number of malicious samples of unknown network protocols are subjected to reverse message analysis operation, the malicious samples of various virus families are subjected to message grouping, and public features in each group of malicious sample data packets are extracted from the malicious samples, so that public features in malicious samples of unknown protocols are conveniently extracted, further malicious attributes of malicious samples of a certain virus family are identified, feature libraries of the malicious samples of various virus families can be enriched through the extracted public features, and therefore the capability of identifying malicious code threats of the unknown protocols can be improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart illustrating a malicious sample packet analysis method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a malicious sample packet analysis method according to another embodiment of the present invention;

FIG. 3 is a schematic partial sequence alignment based on the Smith-Waterman algorithm of the present invention;

FIG. 4 is a block diagram illustrating an exemplary embodiment of a network asset fingerprint feature identification apparatus according to the present invention;

fig. 5 is a schematic structural diagram of an embodiment of the electronic device of the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

It should be understood that the described embodiments are merely some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

According to the malicious sample data packet analysis method provided by the embodiment, a large number of threat samples (also called malicious samples, which are called as malicious samples in most of the time) are collected to carry out virus family preliminary classification, then the threat samples are further grouped in a packet grouping mode, and the packet structure and the public characteristics of the threat sample communication data packets are extracted from the threat sample communication data packets by an unknown network protocol reverse analysis technology, so that threat sample feature libraries can be enriched, and the recognition capability of various hidden unknown protocol malicious code programs in network data traffic can be effectively improved.

Example 1

Fig. 1 is a flow chart of an embodiment of a malicious sample data packet analysis method according to the present invention, and please refer to fig. 1, and the malicious sample data packet analysis method according to the embodiment of the present invention can be applied to a malicious sample feature library completion and network threat detection discovery scenario for analyzing a malicious sample data packet of an unknown network protocol, so as to extract corresponding public features characterizing malicious attributes.

It should be noted that the method may be solidified in a certain manufactured physical product in the form of software, and when a user needs to analyze a malicious sample data packet, the method flow of the present application may be triggered and reproduced.

Referring to fig. 1, the malicious sample data packet analysis method may include the steps of:

s110, acquiring a plurality of malicious sample data packets.

In this embodiment, the Pcab packet may be collected by a Pcap packet grasping tool, by a large number of threat active sample sites or malicious sample libraries, for example VirusShare, wireshark, etc.

Where Pcab is also a packet grabber library, many software are used for it as a packet grabber, such as the aforementioned WireShark. The Pcab packet is generally a new data format that is different from the original data stream format.

S120, screening the plurality of malicious sample data packets to obtain a plurality of target malicious sample data packets.

After a large number of Pcap data packets are obtained, the Pcap data packets are processed, for example, data cleaning and screening are performed, a large number of online data packets, heartbeat data packets, instruction data packets and the like are extracted from the Pcap data packets, and the online data packets, the heartbeat data packets and the instruction data packets can be used as the target malicious sample data packets according to requirements.

It should be appreciated that many times malicious sample packets, most of which have no heartbeat packets, are static in nature in the malicious sample library.

S130, classifying each target malicious sample data packet according to the malicious sample characteristic information.

Among other things, malicious sample feature information generally includes: the method comprises the steps of obtaining target malicious sample data packets, and performing preliminary simple classification on all the target malicious sample data packets according to the virus family. The length of the communication data packet, the transmission direction of the communication data packet and the like can be used as the basis of the subsequent message packet.

S140, aiming at each type of target malicious sample data packet, grouping the target malicious sample data packets to at least obtain a first group of target malicious sample data packets, wherein the first group of target malicious sample data packets at least comprise two similar target malicious sample data packets.

And S150, extracting common features in each group of malicious sample data packets.

It should be understood that the common feature may be used to characterize common malicious attribute features of a specific type of malicious sample, and when a malicious sample of an unknown network protocol is encountered, by extracting the malicious attribute features carried by the malicious sample, the malicious sample of the unknown network protocol may be quickly detected and identified, so as to improve the detection capability of the latent malicious sample.

In order to identify malicious samples of an unknown network protocol, characteristics or malicious attributes of similar samples or identical samples must be obtained, so in this embodiment, by grouping each target malicious sample data packet into a unified packet, and then performing reverse message parsing operation, public characteristics of the target malicious sample data packet are extracted from the unified packet, and the public characteristics are taken as malicious attribute characteristics of a malicious sample of a specific type, and further a rich malicious attribute characteristic library is constructed, so that detection of the malicious sample of a specific unknown network protocol can be achieved, and thus detection and identification capabilities of various latent malicious code programs (i.e., malicious samples) of unknown protocols in network traffic are improved.

It should be understood that the premise of realizing detection and identification of malicious samples is that network traffic data is reversely analyzed for a large number of malicious samples, and common malicious attributes of some type of malicious samples are mined, so that in the reverse analysis process, the network traffic data is classified and grouped according to viruses, and the method is an effective means for analyzing the malicious attribute characteristics of the malicious samples of unknown network protocols.

Fig. 2 is a flowchart illustrating a malicious sample packet analysis method according to another embodiment of the present application. In some embodiments, the malicious sample feature information comprises: a virus family to which the malicious sample belongs; wherein each virus of the virus families has specific transmission characteristics, although viruses of different virus families may have the same transmission characteristics. For example, in one study it was found that: most samples of a botnet virus family have the behavior characteristics of starting a new process, and the new process has abnormal behaviors, including setting a delay starting thread, and trying to remotely connect with another control end for communication.

The classifying each target malicious sample data packet according to the malicious sample feature information (step S130) includes: dividing each target malicious sample data packet into a plurality of corresponding categories according to virus families to which malicious samples belong, wherein each category comprises a plurality of target malicious sample data;

in step S140, before grouping the packets for each target malicious sample data packet, the method further includes: s135, for each type of target malicious sample data packet, classifying the message of each target malicious sample data packet (the data packet is also referred to as a message in the field of communication because the data packet is generally referred to as a message), and determining the message type of the target malicious sample data packet.

Message classification (also referred to as preliminary grouping) is a very important step, which is the basis for message format parsing and protocol state machine inference. The specific method comprises the following steps: for the acquired message, a preliminary byte type extraction processing procedure is performed first, and preliminary grouping of the message is performed according to the processing result.

Specifically, referring to fig. 2, the classifying the message for each target malicious sample data packet to determine the message type of the target malicious sample data packet includes: aiming at each virus family category, acquiring a plurality of malicious sample online data packets contained in the category; the target malicious sample data packet includes: malicious sample online data packets; extracting byte types of each malicious sample online data packet; the byte type includes: character type and/or binary type; judging whether the byte number of the character type exceeds a preset threshold value or not; if yes, determining the malicious sample online data packet as a text message; if not, determining the malicious sample online data packet as a binary class message.

Since the target malicious sample data packet also contains: and the instruction data packet and/or the heartbeat packet execute the message classification of the instruction data packet and/or the heartbeat packet based on the same message classification mode as the online data packet. Specifically, the classifying the message of each target malicious sample data packet, determining the message type of the target malicious sample data packet (step S130), further includes: aiming at each virus family category, acquiring a plurality of malicious sample instruction data packets contained in the category; the target malicious sample data packet includes: malicious sample instruction packets;

in step S140, for each malicious sample instruction packet, extracting a byte type of the malicious sample instruction packet; the byte type includes: character type and/or binary type; judging whether the byte number of the character type exceeds a preset threshold value or not; if yes, determining the malicious sample instruction data packet as a text message; if not, determining the malicious sample instruction data packet as a binary class message.

According to the above-mentioned message classification schemes of the online data packet and the instruction data packet in the present embodiment, in the process of message classification, the message type needs to be judged according to the type of the bytes in the message, and the following two principles are mainly based:

a. Each different byte in each message is analyzed, and if a printable character, typically represented as a line feed 0X0D0A, the byte is determined to be a character type, otherwise a binary byte is determined.

b. After judging each byte type of each message, analyzing the whole message, when the character type bytes appearing in the message are found to exceed a preset threshold value, judging that the message is a text type message, otherwise, judging that the message is a binary type message.

For the rule b, the rule for determining the text message may, for example, set the predetermined threshold to be that all bytes appearing in the entire message are character type bytes, or may be set to be that more than the predetermined threshold are character type bytes, and then the text message may be determined, or else the rule is determined to be a binary message.

In this embodiment, by classifying the target malicious sample data packets of different types according to virus families and based on the characteristics of the messages, the message packets are primarily determined, and one or more malicious samples are primarily labeled with a specific class label, so that accurate extraction of common characteristics of a certain subsequent malicious sample is facilitated.

After each packet is initially processed and classified, the next task is to group the packets (more specifically, with respect to the concept of packet classification), as shown in fig. 2, and in some embodiments, the malicious sample feature information further includes: communication data packet length and communication data packet transmission direction. In step S140, the grouping the packets of each target malicious sample data packet at least to obtain a first group of target malicious sample data packets includes: after determining the message type of the target malicious sample data packet, obtaining a preliminary packet of the target malicious sample data packet; for each target malicious sample data packet, determining the target malicious sample data packet as a first target malicious sample data packet; based on the length of a communication data packet and the transmission direction of the communication data packet, comparing a first target malicious sample data packet with target malicious sample data packets of different message types through a local sequence comparison algorithm, and determining whether a message classification matched with the first target malicious sample data packet exists or not; if so, adding the first target malicious sample data packet into a preliminary packet corresponding to the matched message classification to form a first message packet; if not, a new message packet is created, and the first target malicious sample data packet is added to the new message packet to form a second message packet.

The local sequence comparison algorithm is used for finding out fragments with similar heights in two message sequences so as to determine whether message classification matched with the first target malicious sample data packet exists or not, and therefore specific grouping of the target malicious sample data packet is judged.

In some embodiments, the local sequence alignment algorithm employs a Smith-Waterman local alignment algorithm, according to which the algorithm operates from a point diagonally to the upper left corner, also referred to as a partial sequence alignment algorithm.

In this embodiment, the policy of the specific packet may be: after obtaining the byte type of a message A, the message A is compared with the already divided message packet (the previous message classification, namely the preliminary packet at the beginning) through a Smith-Waterman partial order comparison algorithm to see whether the message classification matched with the message exists. If so, directly adding the message into the corresponding message packet, otherwise, creating a new message packet. The number of iterations of the loop execution comparison may be set in this process to achieve a more accurate grouping.

Specifically, a partial sequence comparison chart corresponding to a plurality of messages is obtained through a Smith-Waterman partial sequence comparison algorithm, whether two messages have the same segment can be judged according to the partial sequence comparison chart, and if so, the two messages can be divided into the same message group.

Wherein, the binary class message can only be in the same group as the binary class message, and the text class message can only be in the same group as the text class message. Similarly, messages in the same transmission direction may be stored in the same packet. Messages of different transmission directions cannot be stored in the same message packet.

Specifically, for a binary class packet, when determining whether there is a packet classification matching the first target malicious sample data packet, a constraint on whether the first target malicious sample data can be added to the type of packet classification includes: 1. length limitation: the length distance between the two messages is not more than a specified threshold value; 2. content restriction: the editing distance between the two messages is not more than a specified threshold value; 3. format restriction: two messages of the same type have the same number and order of text fields, binary fields, and Unicode fields.

And for a text type message, when determining whether a message classification matching the first target malicious sample data packet exists, limiting conditions on whether the first target malicious sample data can be added to the type message classification include: 1. fragment restriction: dividing the message into different fragments by using a line feed character 0x0D0A as a separator to form a combination, wherein the number difference of the sequences of the message is not more than a specified threshold; 2. length limitation: the length difference of the two corresponding fragments between the messages does not exceed a specified threshold value; 3. content restriction: the editing distance of two fragments corresponding to two messages of the same type does not exceed a specified threshold; 4. string restriction: and (3) dividing a character set by taking a special symbol as a boundary for a preset segment, wherein the number of common character strings existing between corresponding segments between two messages of the same type meets a preset threshold condition.

The step of grouping the packets of each target malicious sample data packet to obtain at least a first group of target malicious sample data packets further includes: when two target malicious sample data meet the limitation conditions of binary class message classification or text type message classification, judging that the two target malicious sample data have similar fragments; dividing the two target malicious sample data into the same packet.

The extracting the common features in each set of malicious sample data packets comprises: similar segments of the set of two target malicious sample data are extracted and determined to be common features.

Or in other embodiments, when more than two target malicious sample data meet the limitation condition of binary class message classification or the limitation condition of text type message, judging that the more than two target malicious sample data have similar fragments; and extracting similar fragments in the more than two target malicious sample data, and determining fragments with smaller editing distance between the similar fragments as common features.

The edit distance, also called Levenshtein, is a quantitative measure of the degree of difference between two strings (e.g., english characters), and the measure is to see how many times at least one process is needed to change one string into another, and the closer the edit distance is, the smaller the degree of difference between the two strings is.

In this embodiment, for a certain type of message, after the above corresponding constraint conditions are met, it is determined that the corresponding segments of the two messages are similar, so that it is determined that the two messages may be the same group of messages: if a plurality of messages are judged to be similar messages, a fragment with the minimum editing distance is selected to be used as the best matching characteristic of the message, the message is added into a corresponding message type group, and the best matching characteristic is used as an unchanged field domain in the message type group.

With continued reference to fig. 2, in some embodiments, after obtaining the variable field and the constant field of the plurality of communication data packets based on the partial sequence alignment chart, further includes: and filtering common characters in each communication data packet from the variable field domain and the constant field domain to obtain a final constant domain, a variable field domain and a constant field domain.

The common characters are common characters in common normal samples, such as 1.ASCII character set & code, GB2312 character set & code, unicode character set & code, UTF-8 code and the like. The invariable character domain is an invariable object carried in the communication data packet, and the variable character domain is a variable object carried in the communication data packet.

In this embodiment, after performing a local sequence comparison operation on each (each) communication data packet by using a Smith-Waterman partial sequence comparison algorithm, variable and invariable character fields of a plurality of communication data packets in a packet can be determined according to the obtained partial sequence comparison graph, so that a message structure and public characteristics of a certain type of malicious samples can be obtained from the character fields, and the method can be used for enriching a malicious sample feature library of an unknown network protocol.

In order to help understand the technical solution and the technical effects provided by the embodiments of the present invention, please refer to fig. 3, a detailed description of one embodiment will be given below with reference to a specific example:

in order to improve the detection rate of the network attack event of the unknown network protocol, a certain network security department needs to collect a large number of malicious samples (threat samples) and reversely analyze the malicious samples, so that the malicious attribute characteristics of a certain type of malicious samples are extracted to construct a detection feature library. The specific reverse analysis process is as follows:

step S2: collecting data, namely acquiring a large number of threat activity samples from a database with a large number of malicious samples, and collecting a pcap data packet;

In step S2: processing the pcap data packet, specifically extracting a large number of data packets of specified categories such as online, heartbeat, instructions and the like from the pcap data packet;

step S1 may be included before step S2, where a large number of malicious sample Pcab packet databases are provided.

Step S4: the data packets are initially classified, and all communication packets are initially and simply classified according to virus families and the like; of course, the partial reclassification may be further performed based on the information such as the packet length and the transmission direction.

Before step S4, step S3 may be included to calculate basic information of each datagram, for example, sample characteristic information such as a packet length and a transmission direction, for use in a basis of a subsequent packet. Of course, this step may be performed again in the subsequent packet.

Step S5, packet classification, including steps S5a and S5b, may also be performed only in one step according to the type of the obtained target malicious sample data. The preliminary classification of messages is a very important step as a basis for message format parsing and protocol state machine inference. The specific method comprises the following steps: for the acquired message, a preliminary byte type extraction processing procedure is performed first, and preliminary grouping of the message is performed according to the processing result. During byte type extraction. The first step is to judge the message type according to the character type in the message, mainly by the following two principles:

a. Analyzing each different byte in each message, if the character is a printable character, judging that the character is a character type, otherwise, judging that the character is a binary byte;

b. after judging each byte of each message, analyzing the whole message, when all character bytes appear in the message, including 0X0D0A, judging that the message is a text message, otherwise, judging that the message is a binary class message.

Steps S6 to S8: in the process of extracting the characteristics of the packet and each communication data packet, after each packet is initially processed, the following work is to group the packets, please refer to fig. 2, where the specific grouping policy is: after obtaining the byte type of a message, comparing the message with the previously divided message packet by using a Smith-Waterman partial sequence comparison algorithm to see whether the message classification matched with the message exists or not, and if so, directly adding the message into the corresponding message packet. The comparison schematic partial sequence comparison diagram is shown in fig. 3, and the same fragments in the two communication data packets can be obtained according to the partial sequence comparison diagram, wherein the fragments are as follows: GTT and AC as immutable field fields. The variable field fields may be different fragment portions of the figure. The constant field may be filtered out ASCII code.

Step 9: and finally extracting the public features, wherein the constant domain, the variable domain and the constant domain of the instruction features under each family name can be obtained through the steps, and the message structure and the public features of the specific malicious sample message packet can be obtained from the inconvenient domain, the variable domain and the constant domain.

After the message structure and the public features are obtained, a malicious attribute feature library can be enriched and used for detecting and identifying a specific type of malicious sample.

Therefore, according to the malicious sample data packet analysis method provided by the embodiment of the invention, the collected large number of malicious samples of unknown network protocols are subjected to reverse message analysis operation, the malicious samples of various virus families are subjected to message grouping, and the public features in each group of malicious sample data packets are extracted from the malicious samples, so that the public features in the malicious samples of an unknown protocol are conveniently extracted, further, the malicious properties of the malicious samples of a certain virus family are identified, and the feature library of the malicious samples of various virus families can be enriched through the extracted public features, so that the capability of identifying and finding the malicious code threat of the unknown protocol can be improved.

Example two

FIG. 4 is a block diagram of an exemplary embodiment of a malicious sample packet analysis apparatus according to the present invention. Referring to fig. 4, the apparatus for identifying network asset fingerprint features of the present embodiment includes: an acquiring program module 210, configured to acquire a plurality of malicious sample data packets; a screening program module 220, configured to perform screening processing on the multiple malicious sample data packets to obtain multiple target malicious sample data packets; a primary classification program module 230, configured to classify each target malicious sample data packet according to the malicious sample feature information; the grouping program module 240 is configured to perform packet grouping on each type of target malicious sample data packet, so as to obtain at least a first group of target malicious sample data packets, where the first group of target malicious sample data packets at least includes two similar target malicious sample data packets; a feature extraction program module 260 for extracting common features in each set of malicious sample data packets.

The device of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 1, and its implementation principle and technical effects are similar, and are not described here again.

In addition, the apparatus of this embodiment may also be used for performing other embodiments of the method for analyzing a malicious sample packet according to the first embodiment, where details are not mentioned, reference may be made to each other, and details are not repeated here.

Example III

Fig. 5 is a schematic structural diagram of an embodiment of an electronic device according to the present invention, based on the method provided in the first embodiment and the apparatus provided in the second embodiment, and as shown in fig. 5, the embodiment of the present invention further provides an electronic device, where, as shown in fig. 5, a step flow of any one of the embodiments of the present invention may be implemented, and the electronic device may include: the device comprises a shell 41, a processor 42, a memory 43, a circuit board 44 and a power circuit 45, wherein the circuit board 44 is arranged in a space surrounded by the shell 41, and the processor 42 and the memory 43 are arranged on the circuit board 44; a power supply circuit 45 for supplying power to the respective circuits or devices of the above-described electronic apparatus; the memory 43 is for storing executable program code; the processor 42 executes a program corresponding to the executable program code by reading the executable program code stored in the memory 43 for performing the malicious sample packet analysis method according to any one of the foregoing embodiments.

The specific implementation of the above steps by the processor 42 and the further implementation of the steps by the processor 42 through the execution of the executable program code may be referred to as the description of the first embodiment of the present invention, which is not repeated herein.

Still further embodiments of the present invention provide a computer readable storage medium storing the encrypted data according to any one of the first embodiment, wherein the encrypted data includes an executable decryption program executable by one or more processors to implement the malicious sample packet analysis method according to any one of the first embodiment.

In summary, compared with the existing asset data identification scheme based on feature matching, the malicious sample data packet analysis method and device provided by the embodiment of the invention are capable of enriching the feature library of the malicious samples of various virus families by performing reverse message analysis operation on a large number of collected malicious samples of unknown network protocols, grouping the malicious samples of various virus families, extracting the public features in each group of malicious sample data packets from the malicious samples, facilitating the extraction of the public features in the malicious samples of unknown protocols, further identifying the malicious attributes of the malicious samples of a certain virus family, and improving the capability of identifying the malicious code threats of the unknown protocols by the extracted public features.

Further, in the process of reverse analysis, a series of communication data messages are input, and the messages are classified. Further grouping the messages, deducing the format structure information of each message class, obtaining the format model of the protocol message, extracting the public features from the format model, and being capable of being used for accurately identifying a malicious sample of a specific type.

Such electronic devices exist in a variety of forms including, but not limited to:

(1) A mobile communication device: such devices are characterized by mobile communication capabilities and are primarily aimed at providing voice, data communications. Such terminals include: smart phones (e.g., iPhone), multimedia phones, functional phones, and low-end phones, etc.

(2) Ultra mobile personal computer device: such devices are in the category of personal computers, having computing and processing functions, and generally also having mobile internet access characteristics. Such terminals include: PDA, MID, and UMPC devices, etc., such as iPad.

(3) Portable entertainment device: such devices may display and play multimedia content. The device comprises: audio, video players (e.g., iPod), palm game consoles, electronic books, and smart toys and portable car navigation devices.

(4) And (3) a server: the configuration of the server includes a processor, a hard disk, a memory, a system bus, and the like, and the server is similar to a general computer architecture, but is required to provide highly reliable services, and thus has high requirements in terms of processing capacity, stability, reliability, security, scalability, manageability, and the like.

(5) Other electronic devices with data interaction functions.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments.

In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.

For convenience of description, the above apparatus is described as being functionally divided into various units/modules, respectively. Of course, the functions of the various elements/modules may be implemented in the same piece or pieces of software and/or hardware when implementing the present invention.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A method for analyzing malicious sample data packets, comprising the steps of:

acquiring a plurality of malicious sample data packets;

screening the plurality of malicious sample data packets to obtain a plurality of target malicious sample data packets;

classifying each target malicious sample data packet according to the malicious sample characteristic information;

aiming at each type of target malicious sample data packet, carrying out message grouping on each target malicious sample data packet to at least obtain a first group of target malicious sample data packets, wherein the first group of target malicious sample data packets at least comprise two similar target malicious sample data packets;

common features in each set of malicious sample data packets are extracted.

2. The malicious sample data packet analysis method of claim 1, wherein the common characteristics comprise: constant field fields, variable field fields, and/or constant fields.

3. The malicious sample data packet analysis method according to claim 1, wherein the malicious sample feature information comprises: a virus family to which the malicious sample belongs;

the step of classifying the message of each target malicious sample data packet and determining the message type of the target malicious sample data packet includes: aiming at each virus family category, acquiring a plurality of malicious sample online data packets contained in the category; the target malicious sample data packet includes: malicious sample online data packets;

extracting byte types of each malicious sample online data packet; the byte type includes: character type and/or binary type;

Judging whether the byte number of the character type exceeds a preset threshold value or not;

if yes, determining the malicious sample online data packet as a text message;

if not, determining the malicious sample online data packet as a binary class message.

4. The method for analyzing malicious sample data packets according to claim 3, wherein said classifying each target malicious sample data packet to determine a packet type of the target malicious sample data packet further comprises: aiming at each virus family category, acquiring a plurality of malicious sample instruction data packets contained in the category; the target malicious sample data packet includes: malicious sample instruction packets;

extracting byte types of each malicious sample instruction data packet; the byte type includes: character type and/or binary type;

if yes, determining the malicious sample instruction data packet as a text message;

if not, determining the malicious sample instruction data packet as a binary class message.

5. The malicious sample data packet analysis method according to claim 3 or 4, wherein the malicious sample feature information further comprises: the step of grouping the message of each target malicious sample data packet according to the length of the communication data packet and the transmission direction of the communication data packet, at least obtaining a first group of target malicious sample data packets comprises: after determining the message type of the target malicious sample data packet, obtaining a preliminary packet of the target malicious sample data packet;

based on the length of a communication data packet and the transmission direction of the communication data packet, comparing a first target malicious sample data packet with target malicious sample data packets of different message types through a local sequence comparison algorithm, and determining whether a message classification matched with the first target malicious sample data packet exists or not;

if so, adding the first target malicious sample data packet into a preliminary packet corresponding to the matched message classification to form a first message packet;

if not, a new message packet is created, and the first target malicious sample data packet is added to the new message packet to form a second message packet.

6. The method of claim 5, wherein for binary class packets, in determining whether there is a packet classification matching the first target malicious sample data packet, the constraint on whether the first target malicious sample data can be added to the class of packet classification comprises:

length limitation: the length distance between the two messages is not more than a specified threshold value;

Content restriction: the editing distance between the two messages is not more than a specified threshold value;

format restriction: two messages of the same type have the same number and order of text fields, binary fields, and uniformly encoded fields.

7. The method of claim 5, wherein for a text type message, in determining whether there is a message classification matching the first target malicious sample data packet, the constraint on whether the first target malicious sample data can be added to the type message classification comprises:

fragment restriction: dividing the message into combinations of different fragments by using the line-feed symbol as a separator, wherein the number difference of the sequences of the message is not more than a specified threshold value;

length limitation: the length difference of the two corresponding fragments between the messages does not exceed a specified threshold value;

content restriction: the editing distance of two fragments corresponding to two messages of the same type does not exceed a specified threshold;

string restriction: and (3) dividing a character set by taking a special symbol as a boundary for a preset segment, wherein the number of common character strings existing between corresponding segments between two messages of the same type meets a preset threshold condition.

8. The method for analyzing malicious sample data packets according to claim 1, wherein said grouping each of the target malicious sample data packets into packets, at least obtaining the first group of target malicious sample data packets, further comprises: when two target malicious sample data meet the limitation conditions of binary class message classification or text type message classification, judging that the two target malicious sample data have similar fragments;

dividing the two target malicious sample data into the same message group;

the extracting the common features in each set of malicious sample data packets comprises:

extracting similar fragments of the set of two target malicious sample data, and determining the similar fragments as public features; or,

when more than two target malicious sample data meet the limiting conditions of binary class message classification or text type message classification, judging that the more than two target malicious sample data have similar fragments;

and extracting similar fragments in the more than two target malicious sample data, and determining fragments with smaller editing distance between the similar fragments as common features.

9. A malicious sample packet analysis apparatus, comprising:

An acquisition program module for acquiring a plurality of malicious sample data packets;

the screening program module is used for screening the plurality of malicious sample data packets to obtain a plurality of target malicious sample data packets;

the primary classification program module is used for classifying each target malicious sample data packet according to the malicious sample characteristic information;

the grouping program module is used for grouping each type of target malicious sample data packet according to the message, at least obtaining a first group of target malicious sample data packets, wherein the first group of target malicious sample data packets at least comprise two similar target malicious sample data packets;

and the feature extraction program module is used for extracting the public features in each group of malicious sample data packets.

10. An electronic device, the electronic device comprising: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space surrounded by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; a processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory for performing the malicious sample packet analysis method according to any one of the preceding claims 1 to 8.