CN112632355A - Fragment content processing method and device for harmful information - Google Patents
Fragment content processing method and device for harmful information Download PDFInfo
- Publication number
- CN112632355A CN112632355A CN202011354462.5A CN202011354462A CN112632355A CN 112632355 A CN112632355 A CN 112632355A CN 202011354462 A CN202011354462 A CN 202011354462A CN 112632355 A CN112632355 A CN 112632355A
- Authority
- CN
- China
- Prior art keywords
- harmful information
- content
- fragment
- fragmented
- page
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000012634 fragment Substances 0.000 title claims abstract description 138
- 238000003672 processing method Methods 0.000 title claims abstract description 28
- 230000009193 crawling Effects 0.000 claims abstract description 57
- 238000000034 method Methods 0.000 claims abstract description 21
- 238000012545 processing Methods 0.000 claims abstract description 15
- 238000004590 computer program Methods 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 5
- 208000001613 Gambling Diseases 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000001788 irregular Effects 0.000 description 4
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- VYZAMTAEIAYCRO-UHFFFAOYSA-N Chromium Chemical compound [Cr] VYZAMTAEIAYCRO-UHFFFAOYSA-N 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention provides a fragment content processing method and a fragment content processing device for harmful information, wherein the method comprises the following steps: crawling page content containing harmful information in a target website; extracting fragment contents of harmful information in the page contents according to preset content characteristics; wherein the content features comprise keywords. According to the method and the device for processing the fragment contents of the harmful information, the page contents containing the harmful information in the target website are crawled, the page contents are matched according to the content characteristics, the fragment contents of all the harmful information contained in the page contents are obtained, the fragment contents of the harmful information can be identified more accurately, the propagation of the harmful information can be prevented more effectively, and the safety risk of a network can be reduced.
Description
Technical Field
The invention relates to the technical field of mobile internet, in particular to a fragment content processing method and device for harmful information.
Background
Harmful information present in the mobile internet refers to data that may damage or threaten existing legal order and other public order.
With the development of network science and technology, lawless persons decompose harmful information into fragmented information fragments, and the propagation range of the harmful information is expanded through the propagation of the fragment contents of the harmful information, so that the illegal purpose is achieved.
In the prior art, harmful information is identified, so that the harmful information is filtered and managed. However, the fragment content of the harmful information has the characteristics of disorder and irregularity, so that the fragment content of the harmful information is difficult to accurately identify in the prior art, and further effective filtering and management of the fragment content of the harmful information cannot be realized.
Disclosure of Invention
The invention provides a fragment content processing method and device of harmful information, which are used for solving the defect that the fragment content of the harmful information is difficult to accurately identify in the prior art and realizing more accurate identification of the fragment content of the harmful information.
The invention provides a fragment content processing method of harmful information, which comprises the following steps:
crawling page content containing harmful information in a target website;
extracting fragment contents of harmful information in the page contents according to preset content characteristics;
wherein the content features comprise keywords.
The invention provides a fragment content processing method of harmful information, which further comprises the following steps after the fragment content of the harmful information in the page content is extracted:
determining a type of fragmented content for the harmful information.
The invention provides a fragment content processing method of harmful information, which further comprises the following steps after the type of the fragment content of the harmful information is determined:
and storing the fragment content of the harmful information according to the type of the fragment content of the harmful information.
According to the present invention, after the fragment content of the harmful information is stored according to the type of the fragment content of the harmful information, the method further includes:
and responding to a retrieval request, and returning a retrieval result corresponding to the retrieval request according to the stored fragment content of the harmful information.
The invention provides a fragment content processing method of harmful information, wherein the crawling of page content containing harmful information in a target website specifically comprises the following steps:
and after a crawling instruction is received, crawling page content containing harmful information in the target website according to the crawling instruction.
The invention provides a fragment content processing method of harmful information, which further comprises the following steps before extracting fragment content of harmful information in page content according to preset content characteristics:
and receiving and storing the preset content characteristics.
The invention provides a fragment content processing method of harmful information, which further comprises the following steps after the type of the fragment content of the harmful information is determined:
and analyzing the fragment content of each piece of harmful information according to the type of the fragment content of each piece of harmful information.
The present invention also provides a fragmented content processing apparatus for harmful information, including:
the crawling module is used for crawling page content containing harmful information in the target website;
the extraction module is used for extracting fragment contents of harmful information in the page contents according to preset content characteristics;
wherein the content features comprise keywords.
The present invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the fragmented content processing method for harmful information as described in any of the above when executing the program.
The present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the fragmented content processing method for harmful information as described in any of the above.
According to the method and the device for processing the fragment contents of the harmful information, the page contents containing the harmful information in the target website are crawled, the page contents are matched according to the content characteristics, the fragment contents of all the harmful information contained in the page contents are obtained, the fragment contents of the harmful information can be identified more accurately, the propagation of the harmful information can be prevented more effectively, and the safety risk of a network can be reduced.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flow chart of a fragment content processing method for harmful information provided by the present invention;
FIG. 2 is a schematic diagram of a fragmented content processing apparatus for harmful information provided by the present invention;
fig. 3 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
In order to overcome the problems in the prior art, the invention provides a method and a device for processing the fragment content of the harmful information.
Fig. 1 is a schematic flow chart of a fragmented content processing method for harmful information provided by the present invention. The fragmented content processing method of harmful information of the present invention is described below with reference to fig. 1. As shown in fig. 1, the method includes: and S101, crawling page content containing harmful information in the target website.
Harmful information, which is present or present in a computer information system and its storage medium, contains contents that endanger social order.
Because of the spatial particularity of the network, harmful information in the website is stored in a server of the website, and the harmful information is not easy to know before being accessed.
According to the embodiment of the invention, the web crawler based on the target data mode is used for crawling the page content containing the harmful information in the target website.
It should be noted that the web crawler can selectively capture page content conforming to a certain mode according to data on the web page.
Specifically, related sensitive words corresponding to the harmful information can be set, and page content of the related sensitive words containing the harmful information in the target website is crawled according to the related sensitive words. For example: the sensitive words may be set as "bet", "earning of day", or "swipe", etc. The number of the related sensitive words of the harmful information can be one or more.
It should be noted that, in order to improve the accuracy of identifying the fragmented content of the harmful information, the page content containing the harmful information crawled by the web crawler may refer to all the contents in the page containing the relevant sensitive words of the harmful information in the target website, for example: all content in the page in the target web site, including the "swipe" field.
In order to improve the efficiency of identifying fragmented content of harmful information, the page content containing harmful information crawled by a web crawler may also refer to a part of page content containing sensitive words related to harmful information in a target website, for example: the target website comprises page contents of 300 words before and after the 'gambling' word.
And S102, extracting fragment contents of harmful information in the page contents according to preset content characteristics.
The content features comprise keywords.
The fragment content of the harmful information may refer to harmful information mixed in the normal information, may refer to an information fragment decomposed into fragmented harmful information, and may be a URL including the harmful information.
The fragment content of the harmful information is mixed with other content, so that the harmful information can be disguised as normal information, the propagation of the harmful information is more secret, and the accurate identification of the harmful information is more difficult.
According to the embodiment of the invention, the fragment content of the harmful information in the page content containing the harmful information in the target website crawled by the web crawler can be extracted according to the preset content characteristics.
Specifically, by searching for the page content, the content matched with the preset content features in the page content can be obtained, and the content matched with the preset content features can be extracted as the fragment content of the harmful information.
The preset content features may include keywords.
Keywords may match fragmented content of harmful information. By searching page content, all sentences or one section of URL containing keywords in the page content can be obtained, and the sentences, the sections where the sentences are located or one section of URL containing the keywords are extracted to serve as fragment content of harmful information.
The rule that the keyword matches the fragmented content of the harmful information may be: within a preset word count threshold, keywords constituting the keyword appear at the same time. For example: if the keyword is "bet" and the keyword "bet" occur simultaneously within any 20 words in the page content, the keyword can be considered to match the fragmented content of the harmful information. Specifically, if the page content in 20 words includes: the sentence of "gambling website address" or "gambling a particular network website address" can extract the fragmented content of the harmful information: a sentence or paragraph containing "bet website address" or "bet one-beat web site address". The preset word number threshold may be determined according to actual conditions, and is not specifically limited in the embodiment of the present invention.
It should be noted that the keywords may be determined according to actual situations, and are not specifically limited in the embodiment of the present invention.
It should be noted that, paragraphs where sentences including keywords in the page content are located may include other illegal information, and in order to improve accuracy of identifying fragmented content of harmful information, paragraphs where sentences or sentences are located may be extracted according to actual situations to serve as fragmented content of harmful information.
The rule that the keyword matches the fragment content of the harmful information may be that keywords constituting the keyword simultaneously appear in a segment of URL corresponding to the page content.
The rule that the keyword matches the fragmented content of the harmful information can be obtained by referring to the way that the harmful information is decomposed. Ways to decompose harmful information may include, but are not limited to: adding other characters or symbols in the harmful information, adjusting the character sequence of the harmful information or combining the two modes.
The preset content features can also comprise harmful feature patterns. The harmful signature may match key content of the harmful information.
By searching page content, all pictures containing harmful characteristic graphs in the page content can be obtained, and the pictures or the pictures and characters before and after the pictures are extracted to be used as fragment content of harmful information.
When the page content is searched, all contents containing the keywords and the harmful characteristic graphs in the page content can be obtained by combining the keywords and the harmful characteristic graphs and serve as fragment contents of harmful information.
It should be noted that after the fragment content of the harmful information in the page content is extracted, the page information of the page where the fragment content of the harmful information is located may also be obtained, including: the address of the page, etc.
It should be noted that, fragmented content of harmful information in the page content may be extracted by using an open-source JavaScript technology.
It should be noted that the method for extracting fragmented content of harmful information provided in the embodiment of the present invention is suitable for extracting fragmented content of harmful information based on the internet, and is particularly suitable for extracting fragmented content of harmful information based on the mobile internet.
According to the embodiment of the invention, the page content containing the harmful information in the target website is crawled, and the page content is matched according to the content characteristics to obtain the fragment content of all the harmful information contained in the page content, so that the fragmented content of the harmful information can be more accurately identified, the propagation of the harmful information can be more effectively prevented, and the safety risk of a network can be reduced.
Based on the content of the above embodiments, after extracting the fragment content of the harmful information in the page content, the method further includes: the type of fragmented content of the harmful information is determined.
The fragment content of the harmful information has the characteristics of disorder and irregularity, and the fragment content of the harmful information cannot be directly analyzed and managed.
After the fragment content of the harmful information in the page content is extracted, the fragment content of the harmful information can be classified.
The fragmented content of harmful information may be classified according to different classification criteria. The classification criteria are not particularly limited in the embodiments of the present invention, and the following are specific classification criteria of fragment contents of several kinds of harmful information.
The classification may be performed according to preset content characteristics. Each preset content feature may include keywords corresponding to several types, for example: the keyword "bet" may correspond to "illegal harmful information", and the keyword "swipe ticket" may correspond to "fraudulent harmful information" and "illegal harmful information". According to the type of the keyword, the type of fragment content of the harmful information corresponding to the keyword can be determined.
The classification may be based on different pages in the target web site. After the fragment content of the harmful information in the page content is extracted, the fragment content of the harmful information can be classified by taking one page in the target website as one type.
The classification may be based on different target web sites. After the fragment contents of the harmful information are extracted from the target websites, the fragment contents of the harmful information can be classified by taking one target website as one type.
After the types of the fragment contents of the harmful information are determined, the types of the fragment contents of the harmful information in each type can be further refined, and the sub-types of the fragment contents of the harmful information are determined. For example: after the fragment contents of the harmful information are classified according to different target websites, the subtype of the fragment contents of the harmful information can be determined in each target website type according to the type of the keyword.
It should be noted that the open source JavaScript technology can be used to determine the type of the fragmented content of the harmful information.
The embodiment of the invention classifies the disordered and irregular fragment contents of the harmful information according to different dimensions and determines the type of the fragment contents of the harmful information, thereby changing the disordered and irregular fragment contents of the harmful information into order and providing a data basis for the analysis and management of the fragment contents of the harmful information.
Based on the content of the above embodiments, after determining the type of the fragmented content of the harmful information, the method further includes: and storing the fragment contents of the harmful information according to the type of the fragment contents of the harmful information.
After the type of the fragment content of the harmful information is determined, the fragment content of the harmful information can be respectively stored in each type of data set according to the type of the fragment content of the harmful information.
When the fragment content of the harmful information is stored, page information of a page where the fragment content of the harmful information is located can be stored.
In addition, if the fragment contents of the same harmful information correspond to a plurality of types, the fragment contents of the harmful information may be stored in each of the data sets of the plurality of types.
It should be noted that a mysql database or a hbase distributed system may be used to store fragmented content of harmful information.
The embodiment of the invention stores the fragment contents of the harmful information according to the type of the fragment contents of the harmful information, can establish the classification data set of the fragment contents of the harmful information, can change the disordered and irregular fragment contents of the harmful information into order, can provide support for the retrieval of the fragment contents of the harmful information, and can provide a data basis for the analysis and management of the fragment contents of the harmful information.
Based on the content of the above embodiments, after storing the fragmented content of the harmful information according to the type of the fragmented content of the harmful information, the method further includes: and responding to the retrieval request, and returning a retrieval result corresponding to the retrieval request according to the stored fragment content of the harmful information.
It should be noted that the execution subject of the embodiment of the present invention is a server.
Specifically, the client may send a retrieval request to the server. After receiving the retrieval request sent by the client, the server can return information corresponding to the retrieval request according to the retrieval request.
The retrieval request may be a request carrying relevant retrieval conditions.
The retrieval condition may be to retrieve the fragment content of the harmful information corresponding to the known type or the page information according to the known type or the page information.
The retrieval condition can also be that the type or page information corresponding to the fragment content of the known harmful information is retrieved according to the fragment content of the known harmful information.
The specific content of the retrieval request may be determined according to practical situations, and is not particularly limited in the embodiment of the present invention.
The retrieval result may be fragmented content of harmful information.
The retrieval result may also be a type of fragmented content of harmful information or page information of a page in which fragmented content of harmful information exists.
It should be noted that, the information corresponding to the retrieval request may be returned according to the retrieval request by using the open-source JavaScript technology.
It should be noted that the client may be a PC. The operating system of the PC can be a Windows XP system, a Windows 7 system or a Windows 8 system, and can support browsers such as Firefox or chrome.
The PC comprises a retrieval module, and a retrieval request can be sent through the retrieval module.
In particular, the retrieval module may send a retrieval request based on the browser.
A retrieval request may be entered in the retrieval module and sent to the server by the retrieval module.
It should be noted that the server may also receive a retrieval request input from the peripheral.
According to the embodiment of the invention, the retrieval request is received, the corresponding information is returned according to the retrieval request, the required information can be obtained according to the requirement, and the fragment content of the harmful information can be analyzed through the information obtained by retrieval.
Based on the content of each embodiment, crawling page content containing harmful information in the target website specifically includes: and after receiving the crawling instruction, crawling page contents containing harmful information in the target website according to the crawling instruction.
Specifically, the client may send the crawling instruction to the server. And after receiving a crawling instruction sent by the client, the server crawls page content containing harmful information in the target website through a web crawler based on a target data mode according to the crawling instruction.
The crawling instruction can carry related sensitive words of preset harmful information.
Other crawling conditions for executing crawling tasks can be carried in the crawling instruction, and the crawling conditions comprise: any number of the crawling start time, the crawling end time, the crawling cycle, the crawling range and the like.
For example: if the crawling starting time carried by the crawling instruction is current day 0, the crawling period is 48 hours, the crawling range is target websites A, B and C, the sensitive word is 'gambling', and a crawling task of crawling page contents containing harmful information of the sensitive word 'gambling' in the target websites A, B and C in real time within 48 hours from current day 0 can be executed; if the crawling start time carried by the crawling instruction is 0 per day, the crawling end time is 6 per morning, the crawling cycle is 7 days, the crawling range is the target website A, and the sensitive word is the 'order swiping', a crawling task of crawling page contents containing harmful information of the sensitive word which is the 'order swiping' in the target website A for 7 consecutive days from 0 per day to 6 per morning can be executed.
According to the embodiment of the invention, the page content containing the harmful information in the target website is crawled according to the received crawling instruction, the page content of the fragment content possibly containing the harmful information can be obtained, the identification range of the fragment content for identifying the harmful information can be reduced, and a data basis can be provided for identifying the fragment content of the harmful information.
Based on the content of the above embodiments, before extracting the fragment content of the harmful information in the page content according to the preset content feature, the method further includes: receiving and storing the preset content characteristics.
Specifically, the retrieval module in the client may send the preset content characteristics to the server.
The retrieval module may send the preset content characteristics based on the browser.
The preset content characteristics can be input in the retrieval module, and are sent to the server through the retrieval module.
It should be noted that the preset content features can be obtained by self-defining according to actual requirements.
The server may receive the preset content characteristics sent by the client, and store the preset content characteristics in the repository.
It should be noted that the server may also receive preset content features input from the peripheral.
It should be noted that the server may use mysql database or hbase distribution system to store the preset content features.
According to the embodiment of the invention, the server can acquire the corresponding fragment content of the harmful information according to the preset content characteristics by receiving and storing the preset content characteristics, and the fragment content of the harmful information can be more accurately identified.
Based on the content of the above embodiments, after determining the type of the fragmented content of the harmful information, the method further includes: and analyzing the fragment content of each piece of harmful information according to the type of the fragment content of each piece of harmful information.
Specifically, the fragment content of each type of harmful information may be counted according to the type of the fragment content of each type of harmful information. By counting the fragmented content of each type of harmful information, the number of fragmented content of harmful information in each type or each subtype in each type can be obtained.
According to the number of fragmented contents of each type of harmful information, the association between fragmented contents of harmful information can be analyzed. The following is a specific analysis method of the fragment content of several harmful information.
After the quantity of the fragment contents of the harmful information in different target websites is obtained, if the quantity of the fragment contents of the harmful information extracted by a certain target website is large, it is indicated that the target website has a greater security risk. Further, the target websites with greater security risks can be preferentially processed correspondingly.
After the quantity of the fragment contents of the harmful information in different pages in the same target website is obtained, if the quantity of the fragment contents of the harmful information extracted from a certain page is large, it is indicated that the page has a greater security risk. Further, the pages with greater security risk can be preferentially processed correspondingly.
After the number of the fragment contents of different types of harmful information in the same target website is obtained, if the number of the fragment contents of a certain type of harmful information extracted by the target website is large, the type is a type of the target website with security risk. Further, corresponding processing may be performed according to the type of security risk present.
After the number of the fragmented contents of different types of harmful information in the same target website is obtained, if the number of the fragmented contents of multiple types of harmful information extracted by the target website is large, it is indicated that there is a correlation between the multiple types. Further, corresponding processing may be performed according to the association between the types.
It should be noted that, according to the number of fragmented contents of harmful information in each type, the analysis of the association between fragmented contents of harmful information may not be limited to the above-described exemplary case.
According to the embodiment of the invention, the fragment content of each harmful information is analyzed according to the type of the fragment content of each harmful information, the analysis result of the fragment content of each disordered and irregular harmful information can be obtained, the targeted processing can be carried out according to the analysis result, the more effective management of the fragment content of each harmful information can be realized, the propagation of the harmful information can be more effectively prevented, and the safety risk of a network can be reduced.
Fig. 2 is a schematic structural diagram of a fragmented content processing apparatus for harmful information provided by the present invention. The following describes the fragment content processing apparatus for harmful information provided by the present invention with reference to fig. 2, and the fragment content processing apparatus for harmful information described below and the fragment content processing method for harmful information described above may be referred to in correspondence with each other. As shown in fig. 2, the apparatus includes: a crawling module 201 and an extracting module 202, wherein:
and the crawling module 201 is used for crawling page content containing harmful information in the target website.
The extracting module 202 is configured to extract fragmented content of harmful information in the page content according to a preset content feature.
The content features comprise keywords.
Specifically, the crawling module 201 and the extraction module 202 are electrically connected.
The crawling module 201 can crawl page content containing harmful information in a target website through a web crawler based on a target data mode.
Specifically, related sensitive words corresponding to the harmful information can be set, and page content of the related sensitive words containing the harmful information in the target website is crawled according to the related sensitive words. For example: the sensitive words may be set as "bet", "earning of day", or "swipe", etc. The number of the related sensitive words of the harmful information can be one or more.
It should be noted that, in order to improve the accuracy of identifying the fragmented content of the harmful information, the page content containing the harmful information crawled by the web crawler may refer to all the contents in the page containing the relevant sensitive words of the harmful information in the target website, for example: all content in the page in the target web site, including the "swipe" field.
In order to improve the efficiency of identifying fragmented content of harmful information, the page content containing harmful information crawled by a web crawler may also refer to a part of page content containing sensitive words related to harmful information in a target website, for example: the target website comprises page contents of 300 words before and after the 'gambling' word.
The crawling module 201 may also be configured to receive a crawling instruction sent by the client.
The extraction module 202 may extract fragment content of harmful information in page content containing harmful information in a target website crawled by a web crawler according to preset content features after receiving the preset content features.
The preset content features may include keywords.
Keywords may match fragmented content of harmful information. By searching page content, all sentences or one section of URL containing keywords in the page content can be obtained, and the sentences, the sections where the sentences are located or one section of URL containing the keywords are extracted to serve as fragment content of harmful information.
The rule that the keyword matches the fragmented content of the harmful information may be: within a preset word count threshold, keywords constituting the keyword appear at the same time.
It should be noted that the keywords may be determined according to actual situations, and are not specifically limited in the embodiment of the present invention.
It should be noted that, paragraphs where sentences including keywords in the page content are located may include other illegal information, and in order to improve accuracy of identifying fragmented content of harmful information, paragraphs where sentences or sentences are located may be extracted according to actual situations to serve as fragmented content of harmful information.
The rule that the keyword matches the fragment content of the harmful information may be that keywords constituting the keyword simultaneously appear in a segment of URL corresponding to the page content.
The preset content features can also comprise harmful feature patterns.
By searching page content, all pictures containing harmful characteristic graphs in the page content can be obtained, and the pictures or the pictures and characters before and after the pictures are extracted to be used as fragment content of harmful information.
When the page content is searched, all contents containing the keywords and the harmful characteristic graphs in the page content can be obtained by combining the keywords and the harmful characteristic graphs and serve as fragment contents of harmful information.
The extraction module 202 may also be used to determine the type of fragmented content of the harmful information.
It should be noted that the fragmented content processing apparatus for harmful information according to the embodiment of the present invention may further include a storage module.
And the storage module can be used for storing the fragment contents of the harmful information according to the type of the fragment contents of the harmful information.
The storage module can also be used for storing preset content characteristics.
And the storage module can be also used for responding to the retrieval request and returning a retrieval result corresponding to the retrieval request according to the stored fragment content of the harmful information.
According to the embodiment of the invention, after page content containing harmful information in the target website is crawled, the page content is searched, all words and sentences containing key words in the page content are obtained, and the words and sentences or paragraphs where the words and sentences are located are extracted to serve as fragment content of the harmful information, so that fragmented content of the harmful information can be identified more accurately, propagation of the harmful information can be prevented more effectively, and safety risk of a network can be reduced.
Fig. 3 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 3: a processor (processor)310, a communication Interface (communication Interface)320, a memory (memory)330 and a communication bus 340, wherein the processor 310, the communication Interface 320 and the memory 330 communicate with each other via the communication bus 340. The processor 310 may invoke logic instructions in the memory 330 to perform a fragmented content processing method of harmful information, the method comprising: crawling page content containing harmful information in a target website; extracting fragment contents of harmful information in the page contents according to preset content characteristics; the content features comprise keywords.
In addition, the logic instructions in the memory 330 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform a fragmented content processing method for harmful information provided by the above methods, the method comprising: crawling page content containing harmful information in a target website; extracting fragment contents of harmful information in the page contents according to preset content characteristics; the content features comprise keywords.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a fragmented content processing method for performing the harmful information provided above, the method including: crawling page content containing harmful information in a target website; extracting fragment contents of harmful information in the page contents according to preset content characteristics; the content features comprise keywords.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A fragmented content processing method for harmful information, comprising:
crawling page content containing harmful information in a target website;
extracting fragment contents of harmful information in the page contents according to preset content characteristics;
wherein the content features comprise keywords.
2. The fragmented content processing method for harmful information according to claim 1, further comprising, after extracting fragmented content of harmful information in the page content:
determining a type of fragmented content for the harmful information.
3. The fragmented content processing method for harmful information according to claim 2, wherein after determining the type of the fragmented content of the harmful information, the method further comprises:
and storing the fragment content of the harmful information according to the type of the fragment content of the harmful information.
4. The fragmented content processing method for harmful information according to claim 3, further comprising, after storing the fragmented content of the harmful information according to the type of the fragmented content of the harmful information:
and responding to a retrieval request, and returning a retrieval result corresponding to the retrieval request according to the stored fragment content of the harmful information.
5. The harmful information fragment content processing method according to claim 1, wherein the crawling of the page content containing harmful information in the target website specifically comprises:
and after a crawling instruction is received, crawling page content containing harmful information in the target website according to the crawling instruction.
6. The method for processing fragmented content of harmful information according to any one of claims 1 to 5, wherein before extracting fragmented content of harmful information in the page content according to the preset content feature, the method further comprises:
and receiving and storing the preset content characteristics.
7. The fragmented content processing method for harmful information according to any one of claims 2 to 4, further comprising, after determining the type of the fragmented content of the harmful information:
and analyzing the fragment content of each piece of harmful information according to the type of the fragment content of each piece of harmful information.
8. A fragmented content processing apparatus for harmful information, characterized by comprising:
the crawling module is used for crawling page content containing harmful information in the target website;
the extraction module is used for extracting fragment contents of harmful information in the page contents according to preset content characteristics;
wherein the content features comprise keywords.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the fragmented content processing method of harmful information according to any one of claims 1 to 7 when executing the program.
10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the fragmented content processing method for harmful information according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011354462.5A CN112632355A (en) | 2020-11-26 | 2020-11-26 | Fragment content processing method and device for harmful information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011354462.5A CN112632355A (en) | 2020-11-26 | 2020-11-26 | Fragment content processing method and device for harmful information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112632355A true CN112632355A (en) | 2021-04-09 |
Family
ID=75306443
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011354462.5A Pending CN112632355A (en) | 2020-11-26 | 2020-11-26 | Fragment content processing method and device for harmful information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112632355A (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101281521A (en) * | 2007-04-05 | 2008-10-08 | 中国科学院自动化研究所 | Method and system for filtering sensitive web page based on multiple classifier amalgamation |
CN101867932A (en) * | 2010-05-21 | 2010-10-20 | 武汉虹旭信息技术有限责任公司 | Harmful information filtration system based on mobile Internet and method thereof |
CN102880613A (en) * | 2011-07-14 | 2013-01-16 | 腾讯科技(深圳)有限公司 | Identification method of porno pictures and equipment thereof |
US20140115699A1 (en) * | 2006-07-10 | 2014-04-24 | Websense, Inc. | System and method for analyzing web content |
CN104899324A (en) * | 2015-06-19 | 2015-09-09 | 成都国腾实业集团有限公司 | Sample training system based on IDC (internet data center) harmful information monitoring system |
CN104951539A (en) * | 2015-06-19 | 2015-09-30 | 成都艾尔普科技有限责任公司 | Internet data center harmful information monitoring system |
CN106096366A (en) * | 2016-06-08 | 2016-11-09 | 北京奇虎科技有限公司 | A kind of information processing method, device and equipment |
CN107341159A (en) * | 2016-04-29 | 2017-11-10 | 广州市动景计算机科技有限公司 | Page key words methods of exhibiting and device |
CN109710825A (en) * | 2018-11-02 | 2019-05-03 | 成都三零凯天通信实业有限公司 | Webpage harmful information identification method based on machine learning |
-
2020
- 2020-11-26 CN CN202011354462.5A patent/CN112632355A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140115699A1 (en) * | 2006-07-10 | 2014-04-24 | Websense, Inc. | System and method for analyzing web content |
CN101281521A (en) * | 2007-04-05 | 2008-10-08 | 中国科学院自动化研究所 | Method and system for filtering sensitive web page based on multiple classifier amalgamation |
CN101867932A (en) * | 2010-05-21 | 2010-10-20 | 武汉虹旭信息技术有限责任公司 | Harmful information filtration system based on mobile Internet and method thereof |
CN102880613A (en) * | 2011-07-14 | 2013-01-16 | 腾讯科技(深圳)有限公司 | Identification method of porno pictures and equipment thereof |
CN104899324A (en) * | 2015-06-19 | 2015-09-09 | 成都国腾实业集团有限公司 | Sample training system based on IDC (internet data center) harmful information monitoring system |
CN104951539A (en) * | 2015-06-19 | 2015-09-30 | 成都艾尔普科技有限责任公司 | Internet data center harmful information monitoring system |
CN107341159A (en) * | 2016-04-29 | 2017-11-10 | 广州市动景计算机科技有限公司 | Page key words methods of exhibiting and device |
CN106096366A (en) * | 2016-06-08 | 2016-11-09 | 北京奇虎科技有限公司 | A kind of information processing method, device and equipment |
CN109710825A (en) * | 2018-11-02 | 2019-05-03 | 成都三零凯天通信实业有限公司 | Webpage harmful information identification method based on machine learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107547555B (en) | Website security monitoring method and device | |
CN101971591B (en) | System and method of analyzing web addresses | |
CN104766014B (en) | Method and system for detecting malicious website | |
US20150295942A1 (en) | Method and server for performing cloud detection for malicious information | |
CN104156490A (en) | Method and device for detecting suspicious fishing webpage based on character recognition | |
CN104158828B (en) | The method and system of suspicious fishing webpage are identified based on cloud content rule base | |
CN106713579B (en) | Telephone number identification method and device | |
CN102591965B (en) | Method and device for detecting black chain | |
CN104168293A (en) | Method and system for recognizing suspicious phishing web page in combination with local content rule base | |
CN104079559B (en) | A kind of website safety detection method, device and server | |
CN108566399A (en) | Fishing website recognition methods and system | |
CN109104421B (en) | Website content tampering detection method, device, equipment and readable storage medium | |
Yang et al. | Scalable detection of promotional website defacements in black hat {SEO} campaigns | |
CN113364753B (en) | Anti-crawler method and device, electronic equipment and computer readable storage medium | |
CN112818131A (en) | Method, system and storage medium for constructing graph of threat information | |
Wang et al. | Game of Missuggestions: Semantic Analysis of Search-Autocomplete Manipulations. | |
CN112118225A (en) | Webshell detection method and device based on RNN | |
CN108270754B (en) | Detection method and device for phishing website | |
CN116319089B (en) | Dynamic weak password detection method, device, computer equipment and medium | |
CN104077353B (en) | A kind of method and device of detecting black chain | |
CN113742785A (en) | Webpage classification method and device, electronic equipment and storage medium | |
CN103150406A (en) | Information query and notification method and device | |
CN111177518A (en) | Webpage purification method, system and computer readable storage medium | |
CN112632355A (en) | Fragment content processing method and device for harmful information | |
CN110825976B (en) | Website page detection method and device, electronic equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210409 |
|
RJ01 | Rejection of invention patent application after publication |