KR20230076389A

KR20230076389A - Method and apparatus for generating artificial intelligence-based reconnaissance false positive identification model and method and apparatus for artificial intelligence-based reconnaissance false positive identification

Info

Publication number: KR20230076389A
Application number: KR1020210163254A
Authority: KR
Inventors: 최병환
Original assignee: 주식회사 윈스
Priority date: 2021-11-24
Filing date: 2021-11-24
Publication date: 2023-05-31

Abstract

According to one embodiment of the present invention, a method of artificial intelligence-based correct/incorrect detection identification comprises: A) a step in which an apparatus for artificial intelligence-based correct/incorrect detection identification generates a plurality of artificial intelligence-based data models based on a plurality of attack packets which are learning data and have been determined as correct detection and incorrect detection; and (B) a step of identifying correct detection or incorrect detection status of a plurality of attack packets which are test data and can be correct detection or incorrect detection based on the plurality of artificial intelligence-based data models. The step (A) includes: (A-1) a step of generating a plurality of types of features from the plurality of attack packets which are learning data and have been determined as correct detection and incorrect detection; (A-2) a step of generating a plurality of clusters by clustering one type of features among the plurality of types of features; (A-3) a step of selecting one feature from each of the plurality of clusters; and (A-4) a step of generating an artificial intelligence-based data model for each type of features to identify correct detection or incorrect detection of attack packets based on the selected one type of features and the remaining types of features corresponding to the selected one type of features.

Description

Method and apparatus for generating artificial intelligence-based false positive identification model and method and apparatus for generating artificial intelligence-based false positive identification model and method and apparatus for artificial intelligence-based reconnaissance false positive identification}

본 발명은 정탐 또는 오탐일 수 있는 공격 패킷의 정오탐 여부를 식별할 수 있는 인공지능 기반 정오탐 식별 모델 생성 및 인공지능 기반 정오탐 식별에 관한 것으로, 특히 네트워크 트래픽 환경에서 공격 패킷을 수집하여 패킷 데이터 기반으로 학습하여 인공지능(AI)/머신러닝(ML) 모델을 생성하고 이를 이용하여 공격을 탐지하기 위한, 인공지능 기반 정오탐 식별 모델 생성 방법과 장치 및 인공지능 기반 정오탐 식별 방법과 장치에 관한 것이다.The present invention relates to the creation of an artificial intelligence-based false-positive identification model and artificial intelligence-based false-positive identification capable of identifying true positives of attack packets that may be true positives or false positives. A method and apparatus for generating an AI-based true positive identification model and an artificial intelligence-based false positive identification method and apparatus for generating an artificial intelligence (AI) / machine learning (ML) model by learning based on data and using the same to detect an attack It is about.

본 발명은 하기의 국가연구개발사업 과제에 의해 지원되었다.The present invention was supported by the following national research and development project.

부처명: 과학기술정보통신부Department Name: Ministry of Science and ICT

과제관리 기관명: 한국인터넷진흥원(KISA)Name of project management agency: Korea Internet & Security Agency (KISA)

연구사업명: 2021년 AI기반 보안 제품 및 서비스 개발 지원사업Research project name: 2021 AI-based security product and service development support project

연구과제명: 인공지능 지속학습과 분산마이닝 기법을 적용한 차세대 SIEM 솔루션 개발Title of research project: Development of next-generation SIEM solution applying artificial intelligence continuous learning and distributed mining techniques

기여율: 1/1Contribution rate: 1/1

과제수행기관명: (주)윈스Name of project performing organization: Wins Co., Ltd.

연구기간: 2021.06.09 ~ 2021.11.30 (6개월)Research period: 2021.06.09 ~ 2021.11.30 (6 months)

네트워크 트래픽 환경에서 공격의 정오탐을 식별하기 위해 공격 패킷을 분석해야 하며, 공격 패킷을 매번 사람이 분석하기에는 한계가 있다.Attack packets must be analyzed to identify false positives of attacks in a network traffic environment, and there is a limit to human analysis of attack packets each time.

또한, 공격 패킷의 정오탐을 식별하기 위한 인공지능 학습 모델을 생성하는데 있어서, 정탐으로 판정된 복수의 공격 패킷 및 오탐으로 판정된 복수의 공격 패킷을 사용하여 학습 모델을 생성해야 하는데, 정오탐을 식별하기 위한 최적의 학습 모델을 생성하기 위해서는 정탐으로 판정된 대량의 공격 패킷 및 오탐으로 판정된 대량의 공격 패킷을 사용하여 학습을 해야 하기 때문에, 학습 모델을 생성하는 데 시간이 많이 소요되는 문제점이 있다.In addition, in generating an artificial intelligence learning model for identifying true positives of attack packets, it is necessary to generate a learning model using a plurality of attack packets determined as true positives and a plurality of attack packets determined as false positives. In order to create an optimal learning model for identification, a large amount of attack packets determined to be true positives and a large number of attack packets determined to be false positives must be used for learning. there is.

또한, 네트워크 트래픽 환경에서, 공격 패킷이 급증함으로 인하여 공격 패킷의 정오탐 여부를 식별하는데 상당한 시간이 요소되고 있으며, 또한 변종 공격 패킷이 급증하고 있다.In addition, in a network traffic environment, due to the rapid increase in attack packets, it takes a considerable amount of time to identify whether an attack packet is a false positive or not, and variant attack packets are rapidly increasing.

따라서, 사람을 대신하여 인공지능(AI) 학습 모델을 활용하여 공격 패킷의 정오탐을 식별하고 공격을 탐지할 수 있으며, 최적의 학습 모델을 생성하는데 소요되는 시간을 단축할 수 있고, 공격 패킷의 정오탐 여부를 식별하는데 소요되는 시간을 단축할 수 있으며, 변종 공격 패킷을 탐지할 수 있는 기술이 필요하다.Therefore, it is possible to identify false positives of attack packets and detect attacks by using artificial intelligence (AI) learning models instead of humans, reduce the time required to create an optimal learning model, and reduce the number of attack packets. The time required to identify false positives can be shortened, and a technology capable of detecting variant attack packets is needed.

KRKR 10-2271449 10-2271449 B1B1

본 발명이 해결하고자 하는 과제는 네트워크 트래픽 환경에서 공격 패킷의 특성을 파악하여 사람이 아닌 인공지능(AI)/머신러닝(ML)을 통해 공격 패킷을 자동 분석하고 이를 통해 공격 패킷의 정탐 또는 오탐을 식별하고 공격을 탐지할 수 있는 인공지능 기반 정오탐 식별 모델 생성 방법과 장치 및 인공지능 기반 정오탐 식별 방법과 장치를 제공하는 것이다.The problem to be solved by the present invention is to identify the characteristics of attack packets in a network traffic environment, automatically analyze attack packets through artificial intelligence (AI) / machine learning (ML) rather than humans, and thereby detect true positives or false positives of attack packets. To provide an artificial intelligence-based false positive identification model generation method and apparatus capable of identifying and detecting attacks, and an artificial intelligence-based false positive identification method and apparatus.

본 발명이 해결하고자 하는 다른 과제는 네트워크 트래픽 환경에서 공격 패킷의 특성을 파악하여 변종 공격을 탐지할 수 있는 인공지능 기반 정오탐 식별 모델 생성 방법과 장치 및 인공지능 기반 정오탐 식별 방법과 장치를 제공하는 것이다.Another problem to be solved by the present invention is to provide an artificial intelligence-based false positive identification model generation method and apparatus capable of detecting variant attacks by identifying the characteristics of attack packets in a network traffic environment, and an artificial intelligence-based false positive identification method and apparatus. is to do

본 발명이 해결하고자 하는 또 다른 과제는 최적의 학습 모델을 생성하는데 소요되는 시간을 단축할 수 있는 인공지능 기반 정오탐 식별 모델 생성 방법과 장치 및 인공지능 기반 정오탐 식별 방법과 장치를 제공하는 것이다.Another problem to be solved by the present invention is to provide a method and apparatus for generating an artificial intelligence-based false positive identification model and an artificial intelligence-based false positive identification method and apparatus capable of reducing the time required to generate an optimal learning model. .

본 발명이 해결하고자 하는 또 다른 과제는 대량의 공격 패킷의 정탐 또는 오탐 여부를 식별하는데 소요되는 시간을 단축할 수 있는 인공지능 기반 정오탐 식별 모델 생성 방법과 장치 및 인공지능 기반 정오탐 식별 방법과 장치를 제공하는 것이다.Another problem to be solved by the present invention is a method and apparatus for generating an artificial intelligence-based false positive identification model capable of reducing the time required to identify true positives or false positives in a large amount of attack packets, and an artificial intelligence-based false positive identification method and to provide the device.

상기 과제를 해결하기 위한 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 모델 생성 방법은,An artificial intelligence-based false positive identification model generation method according to an embodiment of the present invention for solving the above problems is,

(A) 인공지능 기반 정오탐 식별 모델 생성 장치가, 학습 데이터인 정탐 및 오탐으로 판정된 복수의 공격 패킷으로부터 각각 복수 유형의 특징들을 생성하는 단계;(A) generating, by an artificial intelligence-based true positive identification model generating apparatus, a plurality of types of features from a plurality of attack packets determined to be true positives and false positives, which are training data;

(B) 상기 복수 유형의 특징들 중 한 유형의 특징들을 클러스터링하여 복수의 클러스터를 생성하는 단계;(B) generating a plurality of clusters by clustering features of one type among the plurality of types of features;

(C) 상기 복수의 클러스터 각각에서 하나의 특징을 선택하는 단계; 및(C) selecting one feature from each of the plurality of clusters; and

(D) 상기 선택된 한 유형의 특징들 및 상기 선택된 한 유형의 특징들 각각에 대응하는 나머지 유형들의 특징들에 기반하여, 공격 패킷의 정탐 또는 오탐을 식별하기 위한 각 유형의 특징에 대한 인공지능 기반 데이터 모델을 생성하는 단계를 포함한다.(D) Based on the selected features of one type and the features of the remaining types corresponding to each of the selected features of the one type, an artificial intelligence basis for each type of feature for identifying true positives or false positives of the attack packet. It involves creating a data model.

본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 모델 생성 방법에 있어서, 상기 복수 유형의 특징들은 각각, 데이터 인덱스 정보, 데이터 컨텐츠 카운트 벡터 정보 및 데이터 토큰 정보를 포함할 수 있다.In the artificial intelligence-based false positive identification model generation method according to an embodiment of the present invention, each of the plurality of types of features may include data index information, data content count vector information, and data token information.

또한, 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 모델 생성 방법에 있어서, 상기 단계 (B)에서, 상기 한 유형의 특징들은 상기 데이터 컨텐츠 카운트 벡터 정보를 포함할 수 있다.In addition, in the artificial intelligence-based false positive identification model generation method according to an embodiment of the present invention, in step (B), the one type of features may include the data content count vector information.

또한, 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 모델 생성 방법에 있어서, 상기 단계 (B)는, 상기 데이터 컨텐츠 카운트 벡터 정보들 간의 유사도에 기반하여 상기 복수의 데이터 컨텐츠 카운트 벡터 정보를 클러스터링하여 복수의 클러스터를 생성할 수 있다.In addition, in the artificial intelligence-based false positive identification model generation method according to an embodiment of the present invention, the step (B) includes generating the plurality of data content count vector information based on the degree of similarity between the data content count vector information. A plurality of clusters can be created by clustering.

또한, 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 모델 생성 방법에 있어서, 상기 단계 (D)는,In addition, in the artificial intelligence-based false positive identification model generation method according to an embodiment of the present invention, the step (D) is,

(D-1) 상기 선택된 한 유형의 특징들 및 상기 선택된 한 유형의 특징들 각각에 대응하는 나머지 유형들의 특징들에 기반하여, 상기 공격 패킷의 정탐 또는 오탐을 식별하기 위한 각 유형의 특징에 대한 복수의 인공지능 기반 데이터 모델을 생성하는 단계; 및(D-1) Based on the selected characteristics of the one type and the characteristics of the remaining types corresponding to each of the selected characteristics of the one type, for each type of characteristics for identifying true positives or false positives of the attack packet Generating a plurality of artificial intelligence-based data models; and

(D-2) 상기 각 유형의 특징에 대한 복수의 인공지능 기반 데이터 모델 중 정확도가 가장 높은 데이터 모델을 상기 각 유형의 특징에 대한 인공지능 기반 데이터 모델로 최종 선정하는 단계를 포함할 수 있다.(D-2) finally selecting a data model with the highest accuracy among the plurality of artificial intelligence-based data models for each type of feature as the artificial intelligence-based data model for each type of feature.

또한, 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 모델 생성 방법에 있어서, 상기 단계 (D-1)의 복수의 인공지능 기반 데이터 모델은,In addition, in the artificial intelligence-based false positive identification model generation method according to an embodiment of the present invention, the plurality of artificial intelligence-based data models in step (D-1),

인공신경망(ANN: Artificial Neural Network), 심층 신경망(DNN: Deep Neural Network), 1D 컨벌루션 신경망(CNN1D: 1D Convolution Neural Network), 2D 컨벌루션 신경망(CNN2D: 2D Convolution Neural Network), 랜덤 포레스트(RF: Random Forest), 서포트 벡터 머신(SVM: Support Vector Machine) 및 XGBoost(Extreme Gradient Boosting)에 기반한 데이터 모델들을 포함할 수 있다.Artificial Neural Network (ANN), Deep Neural Network (DNN), 1D Convolution Neural Network (CNN1D), 2D Convolution Neural Network (CNN2D), Random Forest (RF) Forest), support vector machine (SVM), and extreme gradient boosting (XGBoost) data models.

또한, 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 모델 생성 방법에 있어서, 상기 데이터 인덱스 정보는, 상기 공격 패킷의 패킷 데이터의 디코딩된 데이터의 헥사(hexa)값들을 아스키 코드의 정수값들로 변환하여 생성된 인덱스 데이터를 포함하고,In addition, in the artificial intelligence-based false positive identification model generation method according to an embodiment of the present invention, the data index information includes hexa values of decoded data of packet data of the attack packet as integer values of ASCII codes. Include index data generated by converting to

상기 데이터 컨텐츠 카운트 벡터 정보는, 상기 디코딩된 데이터를 청킹(Chunking) 알고리즘을 이용하여 소정 개수의 문자 당 하나의 블록을 생성하여 해시값을 생성한 후 중복된 해시값을 누적카운트하여 생성된 소정 크기의 벡터값을 포함하며,The data content count vector information is a predetermined size generated by generating a hash value by generating one block per a predetermined number of characters using a chunking algorithm for the decoded data and then accumulating and counting duplicated hash values. contains the vector value of

상기 데이터 토큰 정보는, 상기 디코딩된 데이터를 URL 디코딩하여 문장으로 형성하고 형성된 문장에서 중복문자를 제거하고, 상기 형성된 문장 내의 공백 및 특수문자를 소정 바이트의 식별가능한 문자로 대체한 후 문장을 분리하여 생성된 토큰값을 포함할 수 있다.The data token information is obtained by URL-decoding the decoded data to form a sentence, removing redundant characters from the formed sentence, replacing blanks and special characters in the formed sentence with identifiable characters of a predetermined byte, and separating the sentence It can contain the generated token value.

또한, 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 모델 생성 방법에 있어서, 상기 선택된 한 유형의 특징들 및 상기 선택된 한 유형의 특징들 각각에 대응하는 나머지 유형들의 특징들 각각을 3-그램(gram), 4-그램(gram) 및 5-그램(gram) 단위로 데이터를 구분하고 해싱처리한 후 해싱 처리된 각각의 3-그램(gram), 4-그램(gram) 및 5-그램(gram) 단위 데이터를 연결하여 생성된 이진화된 값을, 상기 각 유형의 특징에 대한 인공지능 기반 데이터 모델을 생성하기 위한 입력 벡터값으로 사용할 수 있다.In addition, in the artificial intelligence-based false positive identification model generation method according to an embodiment of the present invention, each of the selected one type of features and the remaining types of features corresponding to each of the selected one type of features is set to 3- Data is divided into gram, 4-gram, and 5-gram units, hashed, and then hashed 3-gram, 4-gram, and 5-gram, respectively. A binarized value generated by connecting (gram) unit data can be used as an input vector value for generating an artificial intelligence-based data model for each type of feature.

상기 과제를 달성하기 위한 인공지능 기반 정오탐 식별 모델 생성 장치는,An artificial intelligence-based false positive identification model generating device for achieving the above object,

학습 데이터인 정탐 및 오탐으로 판정된 복수의 공격 패킷으로부터 각각 복수 유형의 특징들을 생성하는 가공 데이터 생성부;a processed data generating unit generating a plurality of types of features from a plurality of attack packets determined as true positives and false positives, which are learning data;

상기 복수 유형의 특징들 중 한 유형의 특징들을 클러스터링하여 복수의 클러스터를 생성하는 데이터 클러스터 생성부;a data cluster generation unit generating a plurality of clusters by clustering features of one type among the plurality of types of features;

상기 복수의 클러스터 각각에서 하나의 특징을 선택하는 데이터 선택부; 및a data selector selecting one feature from each of the plurality of clusters; and

상기 선택된 한 유형의 특징들 및 상기 선택된 한 유형의 특징들 각각에 대응하는 나머지 유형들의 특징들에 기반하여, 공격 패킷의 정탐 또는 오탐을 식별하기 위한 각 유형의 특징에 대한 인공지능 기반 데이터 모델을 생성하는 데이터 모델 생성부를 포함한다.Based on the selected features of one type and the features of the remaining types corresponding to each of the selected features of one type, an artificial intelligence-based data model for each type of feature to identify true positives or false positives of an attack packet is generated. It includes a data model creation unit to generate.

상기 과제를 달성하기 위한 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 방법은,An artificial intelligence-based false positive identification method according to an embodiment of the present invention for achieving the above object is,

(A) 인공지능 기반 정오탐 식별 장치가, 학습 데이터인 정탐 및 오탐으로 판정된 복수의 공격 패킷에 기반하여 복수의 인공지능 기반 데이터 모델을 생성하는 단계; 및(A) generating a plurality of AI-based data models based on true positives as training data and a plurality of attack packets determined to be false positives, by an artificial intelligence-based false positive identification apparatus; and

(B) 상기 복수의 인공지능 기반 데이터 모델에 기반하여, 테스트 데이터인 정탐 또는 오탐일 수 있는 복수의 공격 패킷의 정탐 또는 오탐 여부를 식별하는 단계를 포함하고,(B) identifying true positives or false positives of a plurality of attack packets, which may be true positives or false positives, which are test data, based on the plurality of artificial intelligence-based data models;

상기 단계 (A)는,In the step (A),

(A-1) 상기 학습 데이터인 정탐 및 오탐으로 판정된 복수의 공격 패킷으로부터 각각 복수 유형의 특징들을 생성하는 단계;(A-1) generating a plurality of types of features from a plurality of attack packets determined as true positives and false positives, which are the training data;

(A-2) 상기 복수 유형의 특징들 중 한 유형의 특징들을 클러스터링하여 복수의 클러스터를 생성하는 단계;(A-2) generating a plurality of clusters by clustering features of one type among the plurality of types of features;

(A-3) 상기 복수의 클러스터 각각에서 하나의 특징을 선택하는 단계; 및(A-3) selecting one feature from each of the plurality of clusters; and

(A-4) 상기 선택된 한 유형의 특징들 및 상기 선택된 한 유형의 특징들 각각에 대응하는 나머지 유형들의 특징들에 기반하여, 공격 패킷의 정탐 또는 오탐을 식별하기 위한 각 유형의 특징에 대한 인공지능 기반 데이터 모델을 생성하는 단계를 포함한다.(A-4) Based on the selected features of one type and the features of the remaining types corresponding to each of the selected features of one type, artificial intelligence for each type of feature to identify true positives or false positives of the attack packet and generating an intelligence-based data model.

본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 방법에 있어서, 상기 단계 (B)는,In the artificial intelligence-based false positive identification method according to an embodiment of the present invention, the step (B) comprises:

(B-1) 상기 테스트 데이터인 정탐 또는 오탐일 수 있는 복수의 공격 패킷으로부터 각각 복수 유형의 특징들을 생성하는 단계;(B-1) generating a plurality of types of characteristics from a plurality of attack packets that may be true positives or false positives as the test data;

(B-2) 상기 복수 유형의 특징들 중 한 유형의 특징들을 클러스터링하여 복수의 클러스터를 생성하는 단계;(B-2) generating a plurality of clusters by clustering features of one type among the plurality of types of features;

(B-3) 상기 복수의 클러스터 각각에서 하나의 특징을 선택하는 단계; 및(B-3) selecting one feature from each of the plurality of clusters; and

(B-4) 상기 선택된 한 유형의 특징들 및 상기 선택된 한 유형의 특징들 각각에 대응하는 나머지 유형들의 특징들에 기반하여, 상기 각 유형의 특징에 대한 인공지능 기반 데이터 모델을 사용하여 상기 복수의 클러스터 각각에서 선택된 하나의 특징에 대응하는 정탐 또는 오탐일 수 있는 공격 패킷의 정탐 또는 오탐 여부를 최종 결정하는 단계를 포함할 수 있다.(B-4) Based on the selected features of the one type and the features of the remaining types corresponding to each of the selected features of the one type, an artificial intelligence-based data model for each type of feature is used to determine the plurality of features. and finally determining whether an attack packet, which may be a true positive or a false positive, corresponding to one feature selected from each of the clusters of is true positive or false positive.

또한, 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 방법에 있어서, 상기 복수 유형의 특징들은 각각, 데이터 인덱스 정보, 데이터 컨텐츠 카운트 벡터 정보 및 데이터 토큰 정보를 포함할 수 있다.In addition, in the artificial intelligence-based false positive identification method according to an embodiment of the present invention, the plurality of types of characteristics may include data index information, data content count vector information, and data token information, respectively.

또한, 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 방법에 있어서, 상기 단계 (A-2) 및 상기 단계 (B-2)에서, 상기 한 유형의 특징들은 상기 데이터 컨텐츠 카운트 벡터 정보를 포함할 수 있다.In addition, in the artificial intelligence-based false positive identification method according to an embodiment of the present invention, in steps (A-2) and (B-2), the one type of features includes the data content count vector information can include

또한, 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 방법에 있어서, 상기 단계 (A-2) 및 상기 단계 (B-2) 각각은, 상기 데이터 컨텐츠 카운트 벡터 정보들 간의 유사도에 기반하여 상기 복수의 데이터 컨텐츠 카운트 벡터 정보를 클러스터링하여 복수의 클러스터를 생성할 수 있다.In addition, in the artificial intelligence-based false positive identification method according to an embodiment of the present invention, each of the steps (A-2) and (B-2) is based on the similarity between the data content count vector information A plurality of clusters may be generated by clustering the plurality of data content count vector information.

또한, 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 방법에 있어서, 상기 단계 (A-4)는,In addition, in the artificial intelligence-based false positive identification method according to an embodiment of the present invention, the step (A-4) is

(A-4-1) 상기 선택된 한 유형의 특징들 및 상기 선택된 한 유형의 특징들 각각에 대응하는 나머지 유형들의 특징들에 기반하여, 상기 공격 패킷의 정탐 또는 오탐을 식별하기 위한 각 유형의 특징에 대한 복수의 인공지능 기반 데이터 모델을 생성하는 단계; 및(A-4-1) Characteristics of each type for identifying true positives or false positives of the attack packet based on the selected characteristics of the one type and the characteristics of the remaining types corresponding to each of the selected characteristics of the one type. Generating a plurality of artificial intelligence-based data models for; and

(A-4-2) 상기 각 유형의 특징에 대한 복수의 인공지능 기반 데이터 모델 중 정확도가 가장 높은 데이터 모델을 상기 각 유형의 특징에 대한 인공지능 기반 데이터 모델로 최종 선정하는 단계를 포함할 수 있다.(A-4-2) final selection of a data model with the highest accuracy among the plurality of artificial intelligence-based data models for each type of feature as the artificial intelligence-based data model for each type of feature. there is.

또한, 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 방법에 있어서, 상기 단계 (A-4-1)의 복수의 인공지능 기반 데이터 모델은,In addition, in the artificial intelligence-based false positive identification method according to an embodiment of the present invention, the plurality of artificial intelligence-based data models in step (A-4-1),

또한, 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 방법에 있어서, 상기 데이터 인덱스 정보는, 상기 공격 패킷의 패킷 데이터의 디코딩된 데이터의 헥사(hexa)값들을 아스키 코드의 정수값들로 변환하여 생성된 인덱스 데이터를 포함하고,In addition, in the AI-based false positive identification method according to an embodiment of the present invention, the data index information includes hexa values of decoded data of packet data of the attack packet as integer values of ASCII code. Contains the index data generated by the conversion;

또한, 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 방법에 있어서, 상기 선택된 한 유형의 특징들 및 상기 선택된 한 유형의 특징들 각각에 대응하는 나머지 유형들의 특징들 각각을 3-그램(gram), 4-그램(gram) 및 5-그램(gram) 단위로 데이터를 구분하고 해싱처리한 후 해싱 처리된 각각의 3-그램(gram), 4-그램(gram) 및 5-그램(gram) 단위 데이터를 연결하여 생성된 이진화된 값을, 상기 각 유형의 특징에 대한 인공지능 기반 데이터 모델의 입력 벡터값으로 사용할 수 있다.In addition, in the artificial intelligence-based false positive identification method according to an embodiment of the present invention, each of the selected one type of features and the remaining types of features corresponding to each of the selected one type of features is 3-gram ( gram), 4-gram, and 5-gram units, and hashing the data, and then hashing each 3-gram, 4-gram, and 5-gram ), the binarized value generated by connecting the unit data can be used as an input vector value of the artificial intelligence-based data model for each type of feature.

또한, 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 방법에 있어서, 상기 단계 (B-4)는,In addition, in the artificial intelligence-based false positive identification method according to an embodiment of the present invention, the step (B-4) includes:

상기 선택된 한 유형의 특징들 및 상기 선택된 한 유형의 특징들 각각에 대응하는 나머지 유형들의 특징들에 기반하여, 상기 각 유형의 특징에 대한 인공지능 기반 데이터 모델을 사용하여 상기 단계 (B-3)의 상기 복수의 클러스터 각각에서 선택된 하나의 특징에 대응하는 정탐 또는 오탐일 수 있는 공격 패킷의 정탐 또는 오탐을 식별하는 단계; 및Step (B-3) using an artificial intelligence-based data model for each type of feature based on the selected features of the one type and the features of the other types corresponding to each of the selected features of the one type. identifying true positives or false positives of attack packets that may be true positives or false positives corresponding to one feature selected from each of the plurality of clusters of ; and

상기 각 유형의 특징에 대한 인공지능 기반 데이터 모델의 정오탐 식별 결과를 집계하여 다수결에 따라 상기 복수의 클러스터 각각에서 선택된 하나의 특징에 대응하는 정탐 또는 오탐일 수 있는 공격 패킷의 정탐 또는 오탐 여부를 최종 결정하는 단계를 포함할 수 있다.Aggregate the results of identifying true positives of the artificial intelligence-based data model for each type of feature, and determine whether an attack packet that can be a true positive or false positive corresponding to one feature selected from each of the plurality of clusters according to a majority vote is true positive or false positive. It may include a final decision-making step.

상기 과제를 달성하기 위한 인공지능 기반 정오탐 식별 장치는,An artificial intelligence-based false positive identification device for achieving the above object,

학습 데이터인 정탐 및 오탐으로 판정된 복수의 공격 패킷에 기반하여 복수의 인공지능 기반 데이터 모델을 생성하는 정오탐 식별 모델 생성부; 및a false-positive identification model generating unit that generates a plurality of artificial intelligence-based data models based on a plurality of attack packets determined to be true positives and false positives, which are learning data; and

상기 복수의 인공지능 기반 데이터 모델에 기반하여, 테스트 데이터인 정탐 또는 오탐일 수 있는 복수의 공격 패킷의 정탐 또는 오탐 여부를 식별하는 정오탐 식별부를 포함하고,Based on the plurality of artificial intelligence-based data models, a true positive identification unit for identifying true positives or false positives of a plurality of attack packets that may be true positives or false positives as test data;

상기 정오탐 식별 모델 생성부는,The false positive identification model generation unit,

상기 학습 데이터인 정탐 및 오탐으로 판정된 복수의 공격 패킷으로부터 각각 복수 유형의 특징들을 생성하는 가공 데이터 생성부;a processed data generating unit generating a plurality of types of features from a plurality of attack packets determined as true positives and false positives, which are the learning data;

상기 선택된 한 유형의 특징들 및 상기 선택된 한 유형의 특징들 각각에 대응하는 나머지 유형들의 특징들에 기반하여, 공격 패킷의 정탐 또는 오탐을 식별하기 위한 상기 각 유형의 특징에 대한 인공지능 기반 데이터 모델을 생성하는 데이터 모델 생성부를 포함할 수 있다.An artificial intelligence-based data model for each type of feature to identify true positives or false positives of an attack packet based on the selected one type of features and the other types of features corresponding to each of the selected one type of features. It may include a data model generation unit that generates.

본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 모델 생성 방법과 장치 및 인공지능 기반 정오탐 식별 방법과 장치에 의하면, 네트워크 트래픽 환경에서 공격 패킷의 특성을 파악하여 사람이 아닌 인공지능(AI)/머신러닝(ML)을 통해 공격 패킷을 자동 분석하고 이를 통해 공격 패킷의 정탐 또는 오탐을 식별하고 공격을 탐지할 수 있다.According to an artificial intelligence-based false positive identification model generation method and apparatus and an artificial intelligence-based false positive identification method and apparatus according to an embodiment of the present invention, by identifying the characteristics of an attack packet in a network traffic environment, artificial intelligence (AI) )/Machine learning (ML) automatically analyzes attack packets, and through this, it is possible to identify true positives or false positives of attack packets and detect attacks.

또한, 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 모델 생성 방법과 장치 및 인공지능 기반 정오탐 식별 방법과 장치에 의하면, 네트워크 트래픽 환경에서 공격 패킷의 특성을 파악하여 변종 공격을 탐지할 수 있다.In addition, according to an artificial intelligence-based false positive identification model generation method and apparatus and an artificial intelligence-based false positive identification method and apparatus according to an embodiment of the present invention, a variant attack can be detected by identifying the characteristics of an attack packet in a network traffic environment. can

또한, 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 모델 생성 방법과 장치 및 인공지능 기반 정오탐 식별 방법과 장치에 의하면, 최적의 학습 모델을 생성하는데 소요되는 시간을 단축할 수 있다.In addition, according to the artificial intelligence-based false positive identification model generation method and apparatus and artificial intelligence-based false positive identification method and apparatus according to an embodiment of the present invention, the time required to generate an optimal learning model can be reduced.

또한, 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 모델 생성 방법과 장치 및 인공지능 기반 정오탐 식별 방법과 장치에 의하면, 대량의 공격 패킷의 정탐 또는 오탐 여부를 식별하는데 소요되는 시간을 단축할 수 있다.In addition, according to the artificial intelligence-based false positive identification model generation method and apparatus and artificial intelligence-based false positive identification method and apparatus according to an embodiment of the present invention, the time required to identify true positives or false positives of a large number of attack packets is reduced. can be shortened

도 1은 네트워크 보안 장비로부터 발생한 이벤트 로그의 로그 항목 정규화를 위한 표준 필드 항목을 도시한 도면.
도 2는 정규화된 로그 샘플 데이터를 도시한 도면.
도 3은 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 방법의 전체 흐름도.
도 4는 인공지능 기반 정오탐 식별 모델을 생성하기 위한 데이터 학습 단계를 도시한 흐름도.
도 5는 인공지능 기반 정오탐 식별을 위한 데이터 모델 적용 단계를 도시한 흐름도.
도 6은 데이터 임베딩 처리의 세부 처리를 도시한 도면.
도 7은 데이터 모델을 생성하기 위한 데이터 모델링 단계의 개략도.
도 8은 정오탐 식별을 위한 데이터 모델 적용 단계의 개략도.
도 9는 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 장치를 도시한 도면.
도 10은 본 발명의 다른 실시예에 의한 인공지능 기반 정오탐 식별 장치를 도시한 도면.1 is a diagram showing standard field items for normalizing log items of an event log generated from network security equipment;
Figure 2 shows normalized log sample data;
3 is an overall flowchart of an artificial intelligence-based false positive identification method according to an embodiment of the present invention.
4 is a flowchart illustrating data learning steps for generating an artificial intelligence-based false positive identification model;
5 is a flowchart illustrating a data model application step for artificial intelligence-based false positive identification;
Fig. 6 is a diagram showing detailed processing of data embedding processing;
Fig. 7 is a schematic diagram of data modeling steps for creating a data model;
Fig. 8 is a schematic diagram of data model application steps for identifying false positives;
9 is a diagram illustrating an artificial intelligence-based false positive identification device according to an embodiment of the present invention.
10 is a diagram illustrating an artificial intelligence-based false positive identification device according to another embodiment of the present invention.

본 발명의 목적, 특정한 장점들 및 신규한 특징들은 첨부된 도면들과 연관되어지는 이하의 상세한 설명과 바람직한 실시예들로부터 더욱 명백해질 것이다.Objects, specific advantages and novel features of the present invention will become more apparent from the following detailed description and preferred embodiments taken in conjunction with the accompanying drawings.

이에 앞서 본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이고 사전적인 의미로 해석되어서는 아니되며, 발명자가 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있는 원칙에 입각하여 본 발명의 기술적 사상에 부합되는 의미와 개념으로 해석되어야 한다.Prior to this, the terms or words used in this specification and claims should not be interpreted in a conventional and dictionary sense, and the inventor may appropriately define the concept of the term in order to explain his or her invention in the best way. It should be interpreted as a meaning and concept consistent with the technical spirit of the present invention based on the principles.

본 명세서에서 각 도면의 구성요소들에 참조번호를 부가함에 있어서, 동일한 구성 요소들에 한해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 번호를 가지도록 하고 있음에 유의하여야 한다.In adding reference numerals to components of each drawing in this specification, it should be noted that the same components have the same numbers as much as possible, even if they are displayed on different drawings.

또한, "제1", "제2", "일면", "타면" 등의 용어는, 하나의 구성요소를 다른 구성요소로부터 구별하기 위해 사용되는 것으로, 구성요소가 상기 용어들에 의해 제한되는 것은 아니다.In addition, terms such as “first”, “second”, “one side”, and “other side” are used to distinguish one component from another component, and the components are limited by the terms. It is not.

이하, 본 발명을 설명함에 있어, 본 발명의 요지를 불필요하게 흐릴 수 있는 관련된 공지 기술에 대한 상세한 설명은 생략한다.Hereinafter, in describing the present invention, detailed descriptions of related known technologies that may unnecessarily obscure the subject matter of the present invention will be omitted.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시형태를 상세히 설명하기로 한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 방법 및 장치는, 정탐인 공격 패킷 및 오탐인 공격 패킷의 데이터로부터 데이터 인덱스 정보와 데이터 컨텐츠 카운트 벡터 정보 및 데이터 토큰 정보를 기준으로 특징들을 추출하고, 이를 그룹핑하고 벡터화하여 학습함으로써 데이터 학습 모델을 생성한다. 생성된 데이터 학습 모델을 통해 정탐 또는 오탐일 수 있는 공격 패킷의 정탐 또는 오탐을 분류하고, 이를 기반으로 공격을 탐지한다.A method and apparatus for identifying true positives based on artificial intelligence according to an embodiment of the present invention extracts features based on data index information, data content count vector information, and data token information from data of true positive attack packets and false positive attack packets Then, a data learning model is created by grouping, vectorizing, and learning. Through the generated data learning model, true or false positives of attack packets, which can be true positives or false positives, are classified, and attacks are detected based on this.

또한, 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 방법 및 장치는 네트워크 트래픽 환경에서 공격 패킷의 데이터로부터 데이터 인덱스 정보, 데이터 컨텐츠 카운트 벡터 정보 및 데이터 토큰 정보를 기준으로 특징을 추출하여 다중 벡터를 생성할 수 있다.In addition, an artificial intelligence-based false positive identification method and apparatus according to an embodiment of the present invention extracts features based on data index information, data content count vector information, and data token information from data of an attack packet in a network traffic environment, and multiple Vectors can be created.

본 발명은 네트워크 트래픽 환경에서 공격 패킷에서 추출된 다중 벡터 정보를 클러스터링하여 복수의 클러스터를 생성하고 분류하여 각 클러스터에서 하나의 특징을 선택하여 중복된 특징을 제거함으로써 각 클러스터에서 유일한 특징 벡터를 생성할 수 있다.The present invention clusters multi-vector information extracted from attack packets in a network traffic environment to create and classify a plurality of clusters, selects one feature from each cluster and removes duplicate features to generate a unique feature vector from each cluster. can

본 발명은 네트워크 트래픽 환경에서 각 클러스터의 유일한 특징 벡터를 활용하여 인공지능(AI) 학습이 가능하고, 학습된 데이터 모델을 이용하여 공격 패킷의 정탐과 오탐을 식별할 수 있다.In the present invention, artificial intelligence (AI) learning is possible by utilizing unique feature vectors of each cluster in a network traffic environment, and true positives and false positives of attack packets can be identified using the learned data model.

도 3은 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 방법의 전체 흐름도이다.3 is an overall flowchart of an artificial intelligence-based false positive identification method according to an embodiment of the present invention.

도 3을 참조하면, 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 방법은, 학습 데이터인 정탐 및 오탐으로 판정된 복수의 공격 패킷에 기반하여 복수의 인공지능 기반 데이터 모델(310)을 생성하는 단계(단계 S300) 및 상기 복수의 인공지능 기반 데이터 모델(310)에 기반하여, 테스트 데이터인 정탐 또는 오탐일 수 있는 복수의 공격 패킷의 정탐 또는 오탐 여부를 식별하는 단계(단계 S302)를 포함한다.Referring to FIG. 3 , the artificial intelligence-based false positive identification method according to an embodiment of the present invention uses a plurality of artificial intelligence-based data models 310 based on true positives and a plurality of attack packets determined to be false positives, which are learning data. Based on the generating step (step S300) and the plurality of artificial intelligence-based data models 310, identifying true positives or false positives of a plurality of attack packets that may be true positives or false positives as test data (step S302) include

참조번호 S304는 데이터 모델링 단계를 나타낸 것이고, 참조번호 S306은 데이터 모델 적용 단계를 나타낸 것이다. 이에 대해서는 추후 상세히 설명하기로 한다.Reference number S304 indicates a data modeling step, and reference number S306 indicates a data model application step. This will be described in detail later.

도 9는 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 장치를 도시한 도면이고, 도 10은 본 발명의 다른 실시예에 의한 인공지능 기반 정오탐 식별 장치를 도시한 도면이다.9 is a diagram illustrating an artificial intelligence-based false positive identification device according to an embodiment of the present invention, and FIG. 10 is a diagram illustrating an artificial intelligence-based false positive identification device according to another embodiment of the present invention.

우선 도 9를 참조하면, 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 장치(900)는 인공지능 기반 정오탐 식별 모델 생성부(903) 및 인공지능 기반 정오탐 식별부(905)를 포함한다.First of all, referring to FIG. 9 , an artificial intelligence based false positive identification apparatus 900 according to an embodiment of the present invention includes an artificial intelligence based false positive identification model generator 903 and an artificial intelligence based false positive identification unit 905. include

도 10을 참조하면, 본 발명의 다른 실시예에 의한 인공지능 기반 정오탐 식별 장치(1000)는 인공지능 기반 정오탐 식별 장치(1000)의 동작을 제어하기 위한 프로세서(1002), 상기 프로세서(1002)에 메모리 제어부(1006)를 통해 연결된 메모리(1004) 및 인터페이스부(1008)를 포함한다.Referring to FIG. 10 , an artificial intelligence based false positive identification apparatus 1000 according to another embodiment of the present invention includes a processor 1002 for controlling an operation of the artificial intelligence based false positive identification apparatus 1000, the processor 1002 ) includes a memory 1004 and an interface unit 1008 connected through the memory control unit 1006.

상기 프로세서(1002)는, 다양한 소프트웨어 프로그램과 메모리(1004)에 저장되어 있는 명령어 집합을 실행하여 여러 기능을 수행하고 데이터를 처리하는 기능을 수행할 수 있다.The processor 1002 may execute various software programs and instruction sets stored in the memory 1004 to perform various functions and process data.

상기 메모리(1004)는 고속 랜덤 액세스 메모리, 하나 이상의 자기 디스크 저장 장치, 플래시 메모리 장치와 같은 불휘발성 메모리 등을 포함할 수 있다. 또한, 메모리(1004)는 프로세서(1002)로부터 떨어져 위치하는 저장장치나, 인터넷 등의 통신 네트워크를 통하여 액세스되는 네트워크 부착형 저장장치 등을 더 포함할 수 있다.The memory 1004 may include high-speed random access memory, one or more magnetic disk storage devices, non-volatile memory such as a flash memory device, and the like. In addition, the memory 1004 may further include a storage device located away from the processor 1002 or a network attached storage device accessed through a communication network such as the Internet.

상기 메모리(1004)는, 상기 프로세서(1002)에 의해 실행되도록 구성되는 하나 이상의 모듈을 포함하는데, 상기 하나 이상의 모듈은, 인공지능 기반 정오탐 식별 장치(1000)의 전반적인 동작을 제어하기 위한 명령어들을 포함하는 운영제체(1010), 인공지능 기반 정오탐 식별 모델 생성 모듈(1012) 및 인공지능 기반 정오탐 식별 모듈(1014)을 포함한다.The memory 1004 includes one or more modules configured to be executed by the processor 1002, and the one or more modules include instructions for controlling overall operations of the artificial intelligence-based false positive identification apparatus 1000. and an operating system 1010 including an artificial intelligence-based false positive identification model generation module 1012 and an artificial intelligence-based false positive identification module 1014.

상기 인터페이스부(1008)는 통신 네트워크(미도시)를 통해 학습 데이터 및 테스트 데이터를 수집하고 최종 정오탐 결과를 출력할 수 있으며, 입출력 주변 장치를 프로세서(1002) 또는 메모리(1004)에 연결할 수 있고, 메모리 제어부(1006)는 프로세서(1002)나 인터페이스부(1008)가 메모리(1004)에 접근하는 경우에, 메모리 액세스를 제어하는 기능을 수행할 수 있다. 실시예에 따라서는, 프로세서(1002), 메모리(1004), 메모리 제어부(1006) 및 인터페이스부(1008)를 단일 칩 상에 구현하거나, 별개의 칩으로 구현할 수 있다.The interface unit 1008 may collect training data and test data through a communication network (not shown), output a final false positive result, connect an input/output peripheral device to the processor 1002 or the memory 1004, , When the processor 1002 or the interface unit 1008 accesses the memory 1004, the memory control unit 1006 may perform a function of controlling memory access. Depending on embodiments, the processor 1002, the memory 1004, the memory control unit 1006, and the interface unit 1008 may be implemented on a single chip or may be implemented on separate chips.

도 3에 도시된 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 방법은 도 9에 도시된 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 장치(900) 또는 도 10에 도시된 본 발명의 다른 실시예에 의한 인공지능 기반 정오탐 식별 장치(1000)에 의해 수행될 수 있다.The artificial intelligence-based false positive identification method according to an embodiment of the present invention shown in FIG. 3 is an artificial intelligence-based false positive identification apparatus 900 according to an embodiment of the present invention shown in FIG. 9 or shown in FIG. 10. This may be performed by the apparatus 1000 for identifying false positives based on artificial intelligence according to another embodiment of the present invention.

또한, 도 3에 도시된 데이터 모델링 단계(단계 S304)는, 인공지능 기반 정오탐 식별 모델 생성부(903)에 의해 수행될 수 있거나, 도 10에 도시된 프로세서(1002)가 메모리(1004)에 저장된 인공지능 기반 정오탐 식별 모델 생성 모듈(1012)을 실행함으로써 수행될 수 있다.In addition, the data modeling step (step S304) shown in FIG. 3 may be performed by the AI-based false positive identification model generation unit 903, or the processor 1002 shown in FIG. 10 may store the memory 1004 This may be performed by executing the stored artificial intelligence-based false positive identification model generation module 1012 .

또한, 도 3에 도시된 데이터 모델 적용 단계(단계 S306)는, 인공지능 기반 정오탐 식별부(905)에 의해 수행될 수 있거나, 도 10에 도시된 프로세서(1002)가 메모리(1004)에 저장된 인공지능 기반 정오탐 식별 모듈(1014)을 실행함으로써 수행될 수 있다.In addition, the step of applying the data model shown in FIG. 3 (step S306) may be performed by the AI-based false positive identification unit 905, or may be performed by the processor 1002 shown in FIG. 10 stored in the memory 1004. It can be performed by executing the AI-based false positive identification module 1014.

도 3에 도시된 데이터 모델링 단계(단계 S304)에 대해 상세히 설명하기로 한다.The data modeling step (step S304) shown in FIG. 3 will be described in detail.

도 4는 도 3에 도시된 데이터 모델링 단계(단계 S304)의 상세 흐름도로서, 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 방법에서 인공지능 기반 정오탐 식별 모델을 생성하기 위한 데이터 학습 단계를 도시한 흐름도이다.4 is a detailed flowchart of the data modeling step (step S304) shown in FIG. 3, a data learning step for generating an artificial intelligence-based false positive identification model in the artificial intelligence-based false positive identification method according to an embodiment of the present invention. It is a flow chart showing

단계 S400에서, 제1 데이터 정규화부(901)는, 정탐 및 오탐 학습 데이터로서 정탐으로 판정된 공격 패킷 및 오탐으로 판정된 공격 패킷과 관련된 이벤트 로그를 수집하고, 수집된 이벤트 로그의 로그 데이터를 정규화한다.In step S400, the first data normalization unit 901 collects event logs related to attack packets determined to be true positives and attack packets determined to be false positives as learning data for true positives and false positives, and normalizes the log data of the collected event logs. do.

네트워크 보안 장비(IPS, IDS, DDX, FW, Server, System 등)로부터 발생한 이벤트 로그를 도 1에 도시된 로그 항목 정규화를 위한 표준 필드 항목에 따라 정규화하여 중복 필드를 제거하고 최적화된 특징 후보군을 선정한다. 도 2는 이벤트 로그를 정규화하여 생성된 정규화된 로그 샘플 데이터를 도시한 것이다.Event logs generated from network security equipment (IPS, IDS, DDX, FW, Server, System, etc.) are normalized according to the standard field items for log item normalization shown in FIG. 1 to remove redundant fields and select optimized feature candidates do. 2 shows normalized log sample data generated by normalizing event logs.

도 2에서 참조번호 200은 공격 패킷의 패킷 데이터를 나타낸 것이다.In FIG. 2, reference number 200 represents packet data of an attack packet.

단계 S402에서, 제1 데이터 전처리부(902)는, 정규화 로그 데이터를 전처리한다.In step S402, the first data pre-processing unit 902 pre-processes the normalized log data.

제1 데이터 전처리부(902)는 도 2에 도시된 정규화 로그 데이터에서 패킷 데이터(200)(B64)를 추출한다.The first data pre-processing unit 902 extracts packet data 200 (B64) from the normalized log data shown in FIG.

제1 데이터 전처리부(902)는 패킷 데이터(200)의 데이터가 base64 인코딩된 데이터인지, 바이너리(Binary) 데이터인지, 스트링(String) 데이터인지 분류하고, 유형별 데이터를 디코딩하여 디코딩 데이터를 생성한다.The first data pre-processing unit 902 classifies whether the data of the packet data 200 is base64 encoded data, binary data, or string data, and decodes data for each type to generate decoded data.

단계 S404에서, 제1 가공 데이터 생성부(904)는, 디코딩 데이터에 기반하여 제1 유형 내지 제3 유형의 가공 데이터를 생성한다.In step S404, the first processed data generating unit 904 generates first to third types of processed data based on the decoded data.

제1 가공 데이터 생성 모듈(904_1)은 디코딩 데이터의 헥사(hexa)값들을 아스키 코드의 정수값들로 변환하여 제1 유형의 가공 데이터인 인덱스 데이터 정보를 생성한다. 인덱스 데이터 정보는 디코딩 데이터의 헥사값들을 아스키 코드의 정수값들로 변환한 것이므로, 원본 데이터의 특징을 가지고 있다.The first processed data generation module 904_1 converts hexa values of the decoded data into integer values of ASCII code to generate index data information, which is the first type of processed data. Since index data information is obtained by converting hexadecimal values of decoded data into integer values of ASCII code, it has characteristics of original data.

제2 가공 데이터 생성 모듈(904_2)은, 디코딩된 데이터를 청킹(Chunking) 알고리즘을 이용하여 윈도우 크기를 4로 설정하여 문자 4개씩 하나의 블록을 생성하여 해시값을 생성한 후 중복된 해시값을 누적카운트하여 제2 유형의 가공 데이터인 크기 512 바이트의 데이터 컨텐츠 카운트 벡터 정보를 생성한다. 데이터 컨텐츠 카운트 벡터 정보는 원본 데이터가 4 바이트씩 변형된 것으로, 변형된 공격 패킷, 즉 변종 공격을 탐지하기 위한 것이다.The second processed data generation module 904_2 sets the window size to 4 using a chunking algorithm for the decoded data to generate a block of 4 characters each to generate a hash value, and then generates a duplicate hash value. By cumulatively counting, data content count vector information having a size of 512 bytes, which is the second type of processed data, is generated. The data content count vector information is original data modified by 4 bytes, and is used to detect a modified attack packet, that is, a variant attack.

제3 가공 데이터 생성 모듈(904_3)은, 디코딩된 데이터를 URL 디코딩하여 문장으로 형성하고 생성된 문장에서 중복문자를 제거하고, 상기 생성된 문자 내의 공백 및 특수문자를 최대 5 바이트의 식별가능한 문자로 대체한 후 문장을 분리하여 제3 유형의 가공 데이터인 데이터 토큰 정보를 생성한다. 데이터 토큰 정보는 공격 패킷의 패킷 데이터를 문장으로 만들었을 때의 특징을 가지고 있다.The third processed data generation module 904_3 URL-decodes the decoded data to form a sentence, removes redundant characters from the generated sentence, and converts spaces and special characters in the generated text into identifiable characters of up to 5 bytes. After the replacement, the sentence is separated to generate data token information, which is the third type of processed data. The data token information has characteristics when the packet data of the attack packet is made into a sentence.

단계 S406에서, 제1 데이터 클러스터 생성부(906)는, 데이터 컨텐츠 카운트 벡터 정보들 간의 유사도(예를 들어, 코사인 유사도)에 기반하여 복수의 데이터 컨텐츠 카운트 벡터 정보를 클러스터링하여 복수의 데이터 클러스터를 생성한다.In step S406, the first data cluster generation unit 906 generates a plurality of data clusters by clustering the plurality of data content count vector information based on the similarity (eg, cosine similarity) between the data content count vector information. do.

예를 들어, 데이터 컨텐츠 카운트 벡터 정보들의 유사도가 소정값 이상인 경우, 하나의 클러스터로 구성함으로써, 복수의 데이터 컨텐츠 카운트 벡터 정보들을 을 클러스터링하여 복수의 클러스터를 생성한다.For example, when the similarity of the data content count vector information is greater than or equal to a predetermined value, a plurality of data content count vector information is clustered to generate a plurality of clusters by forming one cluster.

단계 S408에서, 제1 데이터 선택부(908)는, 복수의 클러스터 각각에서 하나의 데이터 컨텐츠 카운트 벡터 정보를 선택하고, 상기 선택된 데이터 컨텐츠 카운트 벡터 정보에 대응하는 데이터 인덱스 정보, 상기 선택된 데이터 컨텐츠 카운트 벡터 정보 및 데이터 토큰 정보를 출력한다.In step S408, the first data selection unit 908 selects one piece of data content count vector information from each of a plurality of clusters, and provides data index information corresponding to the selected data content count vector information and the selected data content count vector information. Output information and data token information.

본 발명의 일 실시예에 의하면, 정탐으로 판정된 대량의 공격 패킷들 및 오탐으로 판정된 대량의 공격 패킷들 모두를 학습 데이터로서 사용하여 데이터 모델을 생성하는 것이 아니라, 한 유형의 특징의 유사도가 매우 높은 공격 패킷들을 하나의 클러스터로 클러스터링하고, 각 클러스터에서 하나의 공격 패킷만을 선택하여 선택된 공격 패킷의 특징들을 학습 데이터로서 사용하기 때문에, 최적의 데이터 학습 모델을 생성하는 데 소요되는 시간을 대폭 단축할 수 있다.According to an embodiment of the present invention, a data model is not created using both a large amount of attack packets determined as true positives and a large amount of attack packets determined as false positives as training data, but the similarity of one type of feature Since very high attack packets are clustered into one cluster, only one attack packet is selected from each cluster, and the characteristics of the selected attack packet are used as training data, the time required to create an optimal data learning model is greatly reduced. can do.

단계 S410에서, 제1 가공 데이터 임베딩 처리부(910)는, 도 6에 도시된 바와 같이, 상기 선택된 데이터 컨텐츠 카운트 벡터 정보에 대응하는 데이터 인덱스 정보, 상기 선택된 데이터 컨텐츠 카운트 벡터 정보 및 상기 선택된 데이터 컨텐츠 카운트 벡터 정보에 대응하는 데이터 토큰 정보 각각을 3-그램(gram), 4-그램(gram) 및 5-그램(gram) 단위로 데이터를 구분하고 해싱처리한 후 해싱 처리된 각각의 3-그램(gram), 4-그램(gram) 및 5-그램(gram) 단위 데이터를 연결하여 생성된 이진화된 총 12288 바이트(4096×3)의 입력 벡터값을 생성한다.In step S410, the first processed data embedding processing unit 910, as shown in FIG. 6 , data index information corresponding to the selected data content count vector information, the selected data content count vector information, and the selected data content count Each of the data token information corresponding to the vector information is divided into 3-gram, 4-gram, and 5-gram units, hashed, and hashed each 3-gram. ), a total of 12288 bytes (4096 × 3) of binarized input vector values generated by concatenating 4-gram and 5-gram unit data are generated.

제1 가공 데이터 임베딩 처리 모듈(910_1)은, 상기 선택된 데이터 컨텐츠 카운트 벡터 정보에 대응하는 데이터 인덱스 정보를 3-그램(gram), 4-그램(gram) 및 5-그램(gram) 단위로 데이터를 구분하고 해싱처리한 후 해싱 처리된 각각의 3-그램(gram), 4-그램(gram) 및 5-그램(gram) 단위 데이터를 연결하여 생성된 이진화된 총 12288 바이트(4096×3)의 입력 벡터값을 생성한다.The first processed data embedding processing module 910_1 converts data index information corresponding to the selected data content count vector information into 3-gram, 4-gram, and 5-gram units. A total of 12288 bytes (4096×3) of binarized input created by dividing, hashing, and concatenating the hashed 3-gram, 4-gram, and 5-gram unit data, respectively. create a vector value

제2 가공 데이터 임베딩 처리 모듈(910_2)은, 상기 선택된 데이터 컨텐츠 카운트 벡터 정보를 3-그램(gram), 4-그램(gram) 및 5-그램(gram) 단위로 데이터를 구분하고 해싱처리한 후 해싱 처리된 각각의 3-그램(gram), 4-그램(gram) 및 5-그램(gram) 단위 데이터를 연결하여 생성된 이진화된 총 12288 바이트(4096×3)의 입력 벡터값을 생성한다.The second processed data embedding processing module 910_2 divides the selected data content count vector information into 3-gram, 4-gram, and 5-gram units and hashes the data, and then A total of 12288 bytes (4096 × 3) of binarized input vector values generated by connecting each of the hashed 3-gram, 4-gram, and 5-gram unit data is generated.

제3 가공 데이터 임베딩 처리 모듈(910_3)은, 상기 선택된 데이터 컨텐츠 카운트 벡터 정보에 대응하는 데이터 토큰 정보를 3-그램(gram), 4-그램(gram) 및 5-그램(gram) 단위로 데이터를 구분하고 해싱처리한 후 해싱 처리된 각각의 3-그램(gram), 4-그램(gram) 및 5-그램(gram) 단위 데이터를 연결하여 생성된 이진화된 총 12288 바이트(4096×3)의 입력 벡터값을 생성한다.The third processed data embedding processing module 910_3 converts data token information corresponding to the selected data content count vector information into 3-gram, 4-gram, and 5-gram units. A total of 12288 bytes (4096×3) of binarized input created by dividing, hashing, and concatenating the hashed 3-gram, 4-gram, and 5-gram unit data, respectively. create a vector value

단계 S412에서, 데이터 모델 생성부(912)는, 도 7에 도시된 바와 같이, 데이터 인덱스 정보, 데이터 컨텐츠 카운트 벡터 정보 및 데이터 토큰 정보의 입력 벡터값이 인공지능 및 머신러닝 알고리즘의 입력 정보로 사용되어, 복수의 데이터 모델을 생성한다.In step S412, as shown in FIG. 7, the data model generation unit 912 uses the input vector values of data index information, data content count vector information, and data token information as input information for artificial intelligence and machine learning algorithms. and create multiple data models.

제1 데이터 모델 생성 모듈(912_1)은 데이터 인덱스 정보의 입력 벡터값에 기반하여 복수의 인공지능 기반 데이터 모델을 생성한다. The first data model generating module 912_1 generates a plurality of artificial intelligence-based data models based on input vector values of data index information.

제2 데이터 모델 생성 모듈(912_2)은 상기 선택된 데이터 컨텐츠 카운트 벡터 정보의 입력 벡터값에 기반하여 복수의 인공지능 기반 데이터 모델을 생성한다.The second data model generation module 912_2 generates a plurality of artificial intelligence-based data models based on the input vector value of the selected data content count vector information.

제3 데이터 모델 생성 모듈(912_3)은 데이터 토큰 정보의 입력 벡터값에 기반하여 복수의 인공지능 기반 데이터 모델을 생성한다.The third data model generating module 912_3 generates a plurality of artificial intelligence-based data models based on input vector values of data token information.

상기 복수의 인공지능 기반 데이터 모델은, 인공신경망(ANN: Artificial Neural Network), 심층 신경망(DNN: Deep Neural Network), 1D 컨벌루션 신경망(CNN1D: 1D Convolution Neural Network), 2D 컨벌루션 신경망(CNN2D: 2D Convolution Neural Network), 랜덤 포레스트(RF: Random Forest), 서포트 벡터 머신(SVM: Support Vector Machine) 및 XGBoost(Extreme Gradient Boosting)에 기반한 데이터 모델들을 포함할 수 있다.The plurality of artificial intelligence-based data models include an artificial neural network (ANN), a deep neural network (DNN), a 1D convolutional neural network (CNN1D), and a 2D convolutional neural network (CNN2D). Neural Network), random forest (RF), support vector machine (SVM), and data models based on XGBoost (Extreme Gradient Boosting).

단계 S414에서, 데이터 모델 선정부(914)는, 데이터 인덱스 정보, 데이터 컨텐츠 카운트 벡터 정보 및 데이터 토큰 정보 각각에 대한 복수의 데이터 모델 중 가장 정확도가 높은 데이터 모델을 선정한다.In step S414, the data model selector 914 selects a data model with the highest accuracy among a plurality of data models for each of the data index information, data content count vector information, and data token information.

제1 데이터 모델 선정 모듈(914_1)은, 데이터 인덱스 정보에 대한 복수의 데이터 모델 중 가장 정확도가 높은 데이터 모델을 선정한다.The first data model selection module 914_1 selects a data model with the highest accuracy among a plurality of data models for data index information.

제2 데이터 모델 선정 모듈(914_2)은, 데이터 컨텐츠 카운트 벡터 정보에 대한 복수의 데이터 모델 중 가장 정확도가 높은 데이터 모델을 선정한다.The second data model selection module 914_2 selects a data model with the highest accuracy from among a plurality of data models for data content count vector information.

제3 데이터 모델 선정 모듈(914_3)은, 데이터 토큰 정보에 대한 복수의 데이터 모델 중 가장 정확도가 높은 데이터 모델을 선정한다.The third data model selection module 914_3 selects a data model with the highest accuracy among a plurality of data models for data token information.

도 3에 도시된 데이터 모델 적용 단계(단계 S306)에 대해 상세히 설명하기로 한다.The step of applying the data model shown in FIG. 3 (step S306) will be described in detail.

도 5는 도 3에 도시된 데이터 모델 적용 단계(단계 S306)의 상세 흐름도로서, 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 방법에서 인공지능 기반 정오탐 식별 모델에 기반하여, 테스트 데이터인 정탐 또는 오탐일 수 있는 복수의 공격 패킷의 정탐 또는 오탐 여부를 식별하기 위한 데이터 모델 적용 단계를 도시한 흐름도이다.FIG. 5 is a detailed flowchart of the step of applying the data model shown in FIG. 3 (step S306). In the method for identifying false positives based on artificial intelligence according to an embodiment of the present invention, based on the artificial intelligence based false positive identification model, test data A flowchart illustrating a data model application step for identifying true positives or false positives of a plurality of attack packets that may be true positives or false positives.

도 5에 도시된 인공지능 기반 정오탐 식별을 위한 데이터 모델 적용 단계에서는, 상기 데이터 인덱스 정보에 대해 선정된 데이터 모델, 상기 데이터 컨텐츠 카운트 벡터 정보에 대해 선정된 데이터 모델 및 상기 데이터 토큰 정보에 대해 선정된 데이터 모델을 적용하여, 테스트 데이터인 정탐 또는 오탐일 수 있는 공격 패킷의 정탐 또는 오탐을 최종적으로 결정한다.In the step of applying the data model for artificial intelligence-based false positive identification shown in FIG. 5, the data model selected for the data index information, the data model selected for the data content count vector information, and the data token information are selected. The true positive or false positive of the attack packet, which can be true positive or false positive, which is the test data is finally determined by applying the modified data model.

단계 S500에서, 제2 데이터 정규화부(916)는, 정탐 및 오탐 테스트 데이터로서 정탐 또는 오탐일 수 있는 공격 패킷과 관련된 이벤트 로그를 수집하고, 수집된 이벤트 로그의 로그 데이터를 정규화한다.In step S500, the second data normalization unit 916 collects event logs related to attack packets that may be true positives or false positives as true positive and false positive test data, and normalizes the log data of the collected event log.

단계 S502에서, 제2 데이터 전처리부(918)는, 정규화 로그 데이터를 전처리한다.In step S502, the second data pre-processing unit 918 pre-processes the normalized log data.

제2 데이터 전처리부(916)는 도 2에 도시된 정규화 로그 데이터에서 패킷 데이터(200)(B64)를 추출한다.The second data pre-processing unit 916 extracts packet data 200 (B64) from the normalized log data shown in FIG.

제2 데이터 전처리부(916)는 패킷 데이터(200)의 데이터가 base64 인코딩된 데이터인지, 바이너리(Binary) 데이터인지, 스트링(String) 데이터인지 분류하고, 유형별 데이터를 디코딩하여 디코딩 데이터를 생성한다.The second data preprocessor 916 classifies whether the data of the packet data 200 is base64 encoded data, binary data, or string data, and decodes the data for each type to generate decoded data.

단계 S504에서, 제2 가공 데이터 생성부(920)는, 디코딩 데이터에 기반하여 제1 유형 내지 제3 유형의 가공 데이터를 생성한다.In step S504, the second processed data generating unit 920 generates first to third types of processed data based on the decoded data.

제4 가공 데이터 생성 모듈(920_1)은 디코딩 데이터의 헥사(hexa)값들을 아스키 코드의 정수값들로 변환하여 제1 유형의 가공 데이터인 인덱스 데이터 정보를 생성한다. 인덱스 데이터 정보는 디코딩 데이터의 헥사값들을 아스키 코드의 정수값들로 변환한 것이므로, 원본 데이터의 특징을 가지고 있다.The fourth processed data generation module 920_1 converts hexa values of the decoded data into integer values of ASCII codes to generate index data information, which is the first type of processed data. Since index data information is obtained by converting hexadecimal values of decoded data into integer values of ASCII code, it has characteristics of original data.

제5 가공 데이터 생성 모듈(920_2)은, 디코딩된 데이터를 청킹(Chunking) 알고리즘을 이용하여 윈도우 크기를 4로 설정하여 문자 4개씩 하나의 블록을 생성하여 해시값을 생성한 후 중복된 해시값을 누적카운트하여 제2 유형의 가공 데이터인 크기 512 바이트의 데이터 컨텐츠 카운트 벡터 정보를 생성한다. 데이터 컨텐츠 카운트 벡터 정보는 원본 데이터가 4 바이트씩 변형된 것으로, 변형된 공격 패킷, 즉 변종 공격을 탐지하기 위한 것이다.The fifth processed data generation module 920_2 sets the window size to 4 using a chunking algorithm for the decoded data to generate a block of 4 characters each to generate a hash value, and then generates a hash value. By cumulatively counting, data content count vector information having a size of 512 bytes, which is the second type of processed data, is generated. The data content count vector information is original data modified by 4 bytes, and is used to detect a modified attack packet, that is, a variant attack.

제6 가공 데이터 생성 모듈(920_3)은, 디코딩된 데이터를 URL 디코딩하여 문장으로 형성하고 형성된 문장에서 중복문자를 제거하고, 상기 형성된 문장 내의 공백 및 특수문자를 최대 5 바이트의 식별가능한 문자로 대체한 후 문장을 분리하여 제3 유형의 가공 데이터인 데이터 토큰 정보를 생성한다. 데이터 토큰 정보는 공격 패킷의 패킷 데이터를 문장으로 만들었을 때의 특징을 가지고 있다.The sixth processed data generation module 920_3 URL-decodes the decoded data to form a sentence, removes redundant characters from the formed sentence, and replaces spaces and special characters in the formed sentence with identifiable characters of up to 5 bytes. Then, the sentence is separated to generate data token information, which is the third type of processed data. The data token information has characteristics when the packet data of the attack packet is made into a sentence.

단계 S506에서, 제2 데이터 클러스터 생성부(922)는, 데이터 컨텐츠 카운트 벡터 정보들 간의 유사도(예를 들어, 코사인 유사도)에 기반하여 복수의 데이터 컨텐츠 카운트 벡터 정보를 클러스터링하여 복수의 데이터 클러스터를 생성한다.In step S506, the second data cluster generation unit 922 generates a plurality of data clusters by clustering the plurality of data content count vector information based on the similarity (eg, cosine similarity) between the data content count vector information do.

단계 S508에서, 제2 데이터 선택부(924)는, 복수의 클러스터 각각에서 하나의 데이터 컨텐츠 카운트 벡터 정보를 선택하고, 상기 선택된 데이터 컨텐츠 카운트 벡터 정보에 대응하는 데이터 인덱스 정보, 상기 선택된 데이터 컨텐츠 카운트 벡터 정보 및 데이터 토큰 정보를 출력한다.In step S508, the second data selector 924 selects one piece of data content count vector information from each of a plurality of clusters, and provides data index information corresponding to the selected data content count vector information and the selected data content count vector information. Output information and data token information.

본 발명의 일 실시예에 의하면, 정탐일 수 있는 공격 패킷 및 오탐일 수 있는 대량의 공격 패킷 모두를 테스트 데이터로서 사용하여 정오탐을 식별하는 것이 아니라, 한 유형의 특징의 유사도가 매우 높은 공격 패킷들을 하나의 클러스터로 클러스터링하고, 각 클러스터에서 하나의 공격 패킷만을 선택하여 선택된 공격 패킷의 특징들을 테스트 데이터로서 사용하기 때문에, 테스트 데이터인 대량의 공격 패킷의 정탐 또는 오탐 여부를 식별하는 데 소요되는 시간을 대폭 단축할 수 있다.According to an embodiment of the present invention, both attack packets that may be true positives and a large amount of attack packets that may be false positives are used as test data to identify true positives, but attack packets having a very high similarity of one type of characteristics. Since the attack packets are clustered into one cluster, and only one attack packet is selected from each cluster and the characteristics of the selected attack packet are used as test data, the time required to identify true positives or false positives of a large number of attack packets as test data can be drastically shortened.

단계 S510에서, 제2 가공 데이터 임베딩 처리부(926)는, 도 6에 도시된 바와 같이, 상기 선택된 데이터 컨텐츠 카운트 벡터 정보에 대응하는 데이터 인덱스 정보, 상기 선택된 데이터 컨텐츠 카운트 벡터 정보 및 상기 선택된 데이터 컨텐츠 카운트 벡터 정보에 대응하는 데이터 토큰 정보들 각각을 3-그램(gram), 4-그램(gram) 및 5-그램(gram) 단위로 데이터를 구분하고 해싱처리한 후 해싱 처리된 각각의 3-그램(gram), 4-그램(gram) 및 5-그램(gram) 단위 데이터를 연결하여 생성된 이진화된 총 12288 바이트(4096×3)의 입력 벡터값을 생성한다.In step S510, the second processed data embedding processing unit 926, as shown in FIG. 6 , data index information corresponding to the selected data content count vector information, the selected data content count vector information, and the selected data content count Each of the data token information corresponding to the vector information is divided into 3-gram, 4-gram, and 5-gram units, hashed, and hashed each 3-gram ( gram), 4-gram (gram), and 5-gram (gram) unit data are concatenated to generate a total of 12288 bytes (4096 × 3) of binarized input vector values.

제4 가공 데이터 임베딩 처리 모듈(926_1)은, 상기 선택된 데이터 컨텐츠 카운트 벡터 정보에 대응하는 데이터 인덱스 정보를 3-그램(gram), 4-그램(gram) 및 5-그램(gram) 단위로 데이터를 구분하고 해싱처리한 후 해싱 처리된 각각의 3-그램(gram), 4-그램(gram) 및 5-그램(gram) 단위 데이터를 연결하여 생성된 이진화된 총 12288 바이트(4096×3)의 입력 벡터값을 생성한다.The fourth processed data embedding processing module 926_1 converts data index information corresponding to the selected data content count vector information into 3-gram, 4-gram, and 5-gram units. A total of 12288 bytes (4096×3) of binarized input created by dividing, hashing, and concatenating the hashed 3-gram, 4-gram, and 5-gram unit data, respectively. create a vector value

제5 가공 데이터 임베딩 처리 모듈(926_2)은, 상기 선택된 데이터 컨텐츠 카운트 벡터 정보를 3-그램(gram), 4-그램(gram) 및 5-그램(gram) 단위로 데이터를 구분하고 해싱처리한 후 해싱 처리된 각각의 3-그램(gram), 4-그램(gram) 및 5-그램(gram) 단위 데이터를 연결하여 생성된 이진화된 총 12288 바이트(4096×3)의 입력 벡터값을 생성한다.The fifth processed data embedding processing module 926_2 classifies the selected data content count vector information into 3-gram, 4-gram, and 5-gram units, and hashing the data. A total of 12288 bytes (4096 × 3) of binarized input vector values generated by connecting each of the hashed 3-gram, 4-gram, and 5-gram unit data is generated.

제6 가공 데이터 임베딩 처리 모듈(926_3)은, 상기 선택된 데이터 컨텐츠 카운트 벡터 정보에 대응하는 데이터 토큰 정보를 3-그램(gram), 4-그램(gram) 및 5-그램(gram) 단위로 데이터를 구분하고 해싱처리한 후 해싱 처리된 각각의 3-그램(gram), 4-그램(gram) 및 5-그램(gram) 단위 데이터를 연결하여 생성된 이진화된 총 12288 바이트(4096×3)의 입력 벡터값을 생성한다.The sixth processed data embedding processing module 926_3 converts data token information corresponding to the selected data content count vector information into 3-gram, 4-gram, and 5-gram units. A total of 12288 bytes (4096×3) of binarized input created by dividing, hashing, and concatenating the hashed 3-gram, 4-gram, and 5-gram unit data, respectively. create a vector value

단계 S512에서, 정오탐 식별부(928)는, 도 8에 도시된 바와 같이, 데이터 인덱스 정보, 데이터 컨텐츠 카운트 벡터 정보 및 데이터 토큰 정보 각각의 데이터 모델에 기반하여, 공격 패킷의 정오탐을 식별한다(도 8에서 단계 S800, S802, S804).In step S512, the false positive identification unit 928 identifies true positives of the attack packet based on the data models of each of the data index information, data content count vector information, and data token information, as shown in FIG. 8 . (Steps S800, S802, S804 in FIG. 8).

제1 정오탐 식별 모듈(928_1)은, 데이터 인덱스 정보의 입력 벡터값을 입력으로 하여, 데이터 인덱스 정보의 데이터 모델에 기반하여, 공격 패킷의 정오탐을 식별한다.The first false positive identification module 928_1 takes the input vector value of the data index information as an input and identifies a false positive of the attack packet based on the data model of the data index information.

제2 정오탐 식별 모듈(928_2)은, 데이터 컨텐츠 카운트 벡터 정보의 입력 벡터값을 입력으로 하여, 데이터 컨텐츠 카운트 벡터 정보의 데이터 모델에 기반하여, 공격 패킷의 정오탐을 식별한다.The second false positive identification module 928_2 takes the input vector value of the data content count vector information as an input and identifies a false positive of the attack packet based on the data model of the data content count vector information.

제3 정오탐 식별 모듈(928_3)은, 데이터 토큰 정보의 입력 벡터값을 입력으로 하여, 데이터 토큰 정보의 데이터 모델에 기반하여, 공격 패킷의 정오탐을 식별한다.The third false positive identification module 928_3 takes the input vector value of the data token information as an input and identifies a false positive of the attack packet based on the data model of the data token information.

정오탐 결정부(930)는, 제1 정오탐 식별 모듈(928_1)의 정오탐 결과, 제2 정오탐 식별 모듈(928_2)의 정오탐 결과 및 제3 정오탐 식별 모듈(928_3)의 정오탐 결과를 집계하여 다수결에 의하여 공격 패킷의 정탐 또는 오탐 여부를 식별하여 최종 정오탐 결과를 출력한다(도 8에서 단계 S806).The positive positive determination unit 930 determines the positive positive result of the first positive positive identification module 928_1, the positive positive result of the second positive positive identification module 928_2, and the positive positive result of the third positive positive identification module 928_3. is counted, and whether the attack packet is a true positive or a false positive is identified by a majority vote, and a final false positive result is output (step S806 in FIG. 8).

이상 본 발명을 구체적인 실시예를 통하여 상세하게 설명하였으나, 이는 본 발명을 구체적으로 설명하기 위한 것으로, 본 발명은 이에 한정되지 않으며, 본 발명의 기술적 사상 내에서 당 분야의 통상의 지식을 가진 자에 의해 그 변형이나 개량이 가능함은 명백하다고 할 것이다.Although the present invention has been described in detail through specific examples, this is for explaining the present invention in detail, the present invention is not limited thereto, and within the technical spirit of the present invention, those skilled in the art It will be clear that the modification or improvement is possible by

본 발명의 단순한 변형 내지 변경은 모두 본 발명의 영역에 속하는 것으로, 본 발명의 구체적인 보호 범위는 첨부된 청구범위에 의하여 명확해질 것이다.All simple modifications or changes of the present invention fall within the scope of the present invention, and the specific protection scope of the present invention will be clarified by the appended claims.

200: 공격 패킷의 패킷 데이터
900: 본 발명의 일 실시예에 의한 인공지능 기반 정오탐 식별 장치
901: 제1 데이터 정규화부 902: 제1 데이터 전처리부
903: 인공지능 기반 정오탐 식별 모델 생성부
904: 제1 가공 데이터 생성부
904_1 내지 904_3: 제1 내지 제3 가공 데이터 생성 모듈
905: 인공지능 기반 정오탐 식별부 906: 제1 데이터 클러스터 생성부
908: 제1 데이터 선택부
910: 제1 가공 데이터 임베딩 처리부
910_1 내지 910_3: 제1 내지 제3 가공 데이터 임베딩 처리 모듈
912: 데이터 모델 생성부
912_1 내지 912_3: 제1 내지 제3 데이터 모델 생성 모듈
914: 데이터 모델 선정부
914_1 내지 914_3: 제1 내지 제3 데이터 모델 선정 모듈
916: 제2 데이터 정규화부 918: 제2 데이터 전처리부
920: 제2 가공 데이터 생성부
920_1 내지 920_3: 제4 내지 제6 가공 데이터 생성 모듈
922: 제2 데이터 클러스터 생성부
924: 제2 데이터 선택부
926: 제2 가공 데이터 임베딩 처리부
926_1 내지 926_3: 제1 내지 제3 가공 데이터 임베딩 처리 모듈
928: 정오탐 식별부
928_1 내지 928_3: 제1 내지 제3 정오탐 식별 모듈
930: 정오탐 결정부200: packet data of attack packet
900: Artificial intelligence-based false positive identification device according to an embodiment of the present invention
901: first data normalization unit 902: first data pre-processing unit
903: artificial intelligence-based false positive identification model generation unit
904: first processing data generation unit
904_1 to 904_3: first to third processing data generating modules
905: artificial intelligence-based false positive identification unit 906: first data cluster generation unit
908: first data selector
910: first processing data embedding processing unit
910_1 to 910_3: first to third processing data embedding processing modules
912: data model generation unit
912_1 to 912_3: first to third data model generation modules
914: Data model selection unit
914_1 to 914_3: first to third data model selection modules
916: second data normalization unit 918: second data pre-processing unit
920: second processing data generation unit
920_1 to 920_3: 4th to 6th processing data generating modules
922: second data cluster generation unit
924: second data selector
926: second processing data embedding processing unit
926_1 to 926_3: first to third processing data embedding processing modules
928: false positive identification unit
928_1 to 928_3: first to third false positive identification modules
930: true positive determination unit

Claims

(A) generating, by an artificial intelligence-based true positive identification model generating apparatus, a plurality of types of features from a plurality of attack packets determined to be true positives and false positives, which are training data;
(B) generating a plurality of clusters by clustering features of one type among the plurality of types of features;
(C) selecting one feature from each of the plurality of clusters; and
(D) Based on the selected features of one type and the features of the remaining types corresponding to each of the selected features of the one type, an artificial intelligence basis for each type of feature for identifying true positives or false positives of the attack packet. A method for generating a false positive identification model based on artificial intelligence, comprising generating a data model.

The method of claim 1,
wherein the plurality of types of features each include data index information, data content count vector information, and data token information.

The method of claim 2,
In the step (B), the one type of features includes the data content count vector information.

The method of claim 2,
In the step (B), a plurality of clusters are generated by clustering the plurality of data content count vector information based on the similarity between the data content count vector information.

The method of claim 1,
In the step (D),
(D-1) Based on the selected characteristics of the one type and the characteristics of the remaining types corresponding to each of the selected characteristics of the one type, for each type of characteristics for identifying true positives or false positives of the attack packet Generating a plurality of artificial intelligence-based data models; and
(D-2) artificial intelligence including the step of finally selecting a data model with the highest accuracy among the plurality of artificial intelligence-based data models for each type of feature as an artificial intelligence-based data model for each type of feature Based false positive identification model generation method.

The method of claim 5,
The plurality of artificial intelligence-based data models of step (D-1),
Artificial Neural Network (ANN), Deep Neural Network (DNN), 1D Convolution Neural Network (CNN1D), 2D Convolution Neural Network (CNN2D), Random Forest (RF) Forest), support vector machine (SVM), and XGBoost (Extreme Gradient Boosting) based data models, artificial intelligence-based false positive identification model generation method.

The method of claim 2,
The data index information includes index data generated by converting hexa values of decoded data of the packet data of the attack packet into integer values of ASCII code,
The data content count vector information is a predetermined size generated by generating a hash value by generating one block per a predetermined number of characters using a chunking algorithm for the decoded data and then accumulating and counting duplicated hash values. contains the vector value of
The data token information is obtained by URL-decoding the decoded data to form a sentence, removing redundant characters from the formed sentence, replacing blanks and special characters in the formed sentence with identifiable characters of a predetermined byte, and separating the sentence A method for generating an artificial intelligence-based false positive identification model including a generated token value.

The method of claim 1,
Data of the selected one type of features and each of the other types of features corresponding to each of the selected one type of features are obtained in 3-gram, 4-gram, and 5-gram units. After classification and hashing, the binarized value generated by connecting each of the hashed 3-gram, 4-gram, and 5-gram unit data, for each type of feature A method for generating an artificial intelligence-based false positive identification model, which is used as an input vector value for generating an artificial intelligence-based data model.

a processed data generating unit generating a plurality of types of features from a plurality of attack packets determined as true positives and false positives, which are learning data;
a data cluster generation unit generating a plurality of clusters by clustering features of one type among the plurality of types of features;
a data selector selecting one feature from each of the plurality of clusters; and
Based on the selected features of one type and the features of the remaining types corresponding to each of the selected features of one type, an artificial intelligence-based data model for each type of feature to identify true positives or false positives of an attack packet is generated. An apparatus for generating an artificial intelligence-based false positive identification model, including a data model generator to generate a data model.

(A) generating a plurality of AI-based data models based on true positives as training data and a plurality of attack packets determined to be false positives, by an artificial intelligence-based false positive identification apparatus; and
(B) identifying true positives or false positives of a plurality of attack packets, which may be true positives or false positives, which are test data, based on the plurality of artificial intelligence-based data models;
In the step (A),
(A-1) generating a plurality of types of features from a plurality of attack packets determined as true positives and false positives, which are the training data;
(A-2) generating a plurality of clusters by clustering features of one type among the plurality of types of features;
(A-3) selecting one feature from each of the plurality of clusters; and
(A-4) Based on the selected features of one type and the features of the remaining types corresponding to each of the selected features of one type, artificial intelligence for each type of feature to identify true positives or false positives of the attack packet A method for identifying false positives based on artificial intelligence, comprising generating an intelligence-based data model.

The method of claim 10,
In the step (B),
(B-1) generating a plurality of types of characteristics from a plurality of attack packets that may be true positives or false positives as the test data;
(B-2) generating a plurality of clusters by clustering features of one type among the plurality of types of features;
(B-3) selecting one feature from each of the plurality of clusters; and
(B-4) Based on the selected features of the one type and the features of the remaining types corresponding to each of the selected features of the one type, an artificial intelligence-based data model for each type of feature is used to determine the plurality of features. and finally determining whether an attack packet, which may be a true positive or a false positive corresponding to one feature selected from each cluster of, is a true positive or a false positive.

The method of claim 10,
Wherein the plurality of types of characteristics include data index information, data content count vector information, and data token information, respectively.

The method of claim 12,
In the step (A-2) and the step (B-2), the one type of characteristics includes the data content count vector information.

The method of claim 12,
In each of the steps (A-2) and (B-2), a plurality of clusters are generated by clustering the plurality of data content count vector information based on the degree of similarity between the data content count vector information, artificial intelligence Based false positive identification method.

The method of claim 10,
In the step (A-4),
(A-4-1) Characteristics of each type for identifying true positives or false positives of the attack packet based on the selected characteristics of the one type and the characteristics of the remaining types corresponding to each of the selected characteristics of the one type. Generating a plurality of artificial intelligence-based data models for; and
(A-4-2) finally selecting a data model with the highest accuracy among the plurality of artificial intelligence-based data models for each type of feature as an artificial intelligence-based data model for each type of feature, Artificial intelligence based false positive identification method.

The method of claim 15
The plurality of artificial intelligence-based data models in step (A-4-1),
Artificial Neural Network (ANN), Deep Neural Network (DNN), 1D Convolution Neural Network (CNN1D), 2D Convolution Neural Network (CNN2D), Random Forest (RF) Forest), Support Vector Machine (SVM) and Extreme Gradient Boosting (XGBoost) based artificial intelligence based false positive identification methods.

The method of claim 12,
The data index information includes index data generated by converting hexa values of decoded data of the packet data of the attack packet into integer values of ASCII code,
The data content count vector information is a predetermined size generated by generating a hash value by generating one block per a predetermined number of characters using a chunking algorithm for the decoded data and then accumulating and counting duplicated hash values. contains the vector value of
The data token information is obtained by URL-decoding the decoded data to form a sentence, removing redundant characters from the formed sentence, replacing blanks and special characters in the formed sentence with identifiable characters of a predetermined byte, and separating the sentence An artificial intelligence-based false positive identification method including generated token values.

The method of claim 11,
Data of the selected one type of features and each of the other types of features corresponding to each of the selected one type of features are obtained in 3-gram, 4-gram, and 5-gram units. After classification and hashing, the binarized value generated by connecting each of the hashed 3-gram, 4-gram, and 5-gram unit data, for each type of feature An artificial intelligence-based false positive identification method used as an input vector value of an artificial intelligence-based data model.

The method of claim 11,
In the step (B-4),
Step (B-3) using an artificial intelligence-based data model for each type of feature based on the selected features of the one type and the features of the other types corresponding to each of the selected features of the one type. identifying true positives or false positives of attack packets that may be true positives or false positives corresponding to one feature selected from each of the plurality of clusters of ; and
Aggregate the results of identifying true positives of the artificial intelligence-based data model for each type of feature, and determine whether an attack packet that can be a true positive or false positive corresponding to one feature selected from each of the plurality of clusters according to a majority vote is true positive or false positive. A method for identifying false positives based on artificial intelligence, including the step of making a final decision.

a false-positive identification model generating unit that generates a plurality of artificial intelligence-based data models based on a plurality of attack packets determined to be true positives and false positives, which are learning data; and
Based on the plurality of artificial intelligence-based data models, a true positive identification unit for identifying true positives or false positives of a plurality of attack packets that may be true positives or false positives as test data;
The false positive identification model generation unit,
a processed data generating unit generating a plurality of types of features from a plurality of attack packets determined as true positives and false positives, which are the learning data;
a data cluster generation unit generating a plurality of clusters by clustering features of one type among the plurality of types of features;
a data selector selecting one feature from each of the plurality of clusters; and
An artificial intelligence-based data model for each type of feature to identify true positives or false positives of an attack packet based on the selected one type of features and the other types of features corresponding to each of the selected one type of features. A device for identifying false positives based on artificial intelligence, including a data model generating unit that generates a.