CN118332174B - Data crawling method, system and computer readable storage medium - Google Patents

Data crawling method, system and computer readable storage medium Download PDF

Info

Publication number
CN118332174B
CN118332174B CN202410760744.7A CN202410760744A CN118332174B CN 118332174 B CN118332174 B CN 118332174B CN 202410760744 A CN202410760744 A CN 202410760744A CN 118332174 B CN118332174 B CN 118332174B
Authority
CN
China
Prior art keywords
crawling
data
server
task
data source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410760744.7A
Other languages
Chinese (zh)
Other versions
CN118332174A (en
Inventor
杨政良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Honor Device Co Ltd
Original Assignee
Honor Device Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Honor Device Co Ltd filed Critical Honor Device Co Ltd
Priority to CN202410760744.7A priority Critical patent/CN118332174B/en
Publication of CN118332174A publication Critical patent/CN118332174A/en
Application granted granted Critical
Publication of CN118332174B publication Critical patent/CN118332174B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/972Access to data in other repository systems, e.g. legacy data or dynamic Web page generation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the application provides a data crawling method, a data crawling system and a computer readable storage medium, wherein the method is applied to the data crawling system, the data crawling system comprises a Web client and a server, and the method comprises the following steps: the Web client displays a first page, wherein the first page is used for configuring a data crawling task; the Web client receives a first configuration operation of the data crawling task through a first page, and sends configuration content of the data crawling task to a server for storage, wherein the first configuration operation at least comprises configuration of a crawling data source and a crawling time interval of the data crawling task; under the condition that the Web client receives the operation of starting the data crawling task, a task request is sent to a server, and the server is requested to execute the data crawling task; and the server periodically crawls the webpage data from the websites corresponding to the crawl data sources according to the crawling time interval. Thus, the data crawling efficiency can be improved.

Description

Data crawling method, system and computer readable storage medium
Technical Field
The application relates to the technical field of data acquisition, in particular to a data crawling method, a data crawling system and a computer readable storage medium.
Background
With the rapid development of information network technology, the amount of network information is increasing explosively, but there is often some redundant information or information that the user does not pay attention to in a huge amount of network information, so that it is necessary to acquire information required by the user from the huge amount of network information.
A web crawler (or crawler engine) is a program that automatically browses the web and crawls web page data, and can filter and crawl information required by a user from the web according to a certain search strategy. Because the webpage data are complex and the content is irregular, only a small part of the data in the webpage data are sometimes required to be grabbed, and then a great number of codes are required to be written by the research and development personnel to complete the grabbing process of the web crawler. However, this way of writing code consumes a lot of manpower and the efficiency of data crawling is relatively low.
Disclosure of Invention
The application provides a data crawling method, a data crawling system and a computer readable storage medium, which can improve the data crawling efficiency.
In a first aspect, the present application provides a data crawling method, applied to a data crawling system, where the data crawling system includes a Web client and a server, the method includes:
the Web client displays a first page, wherein the first page is used for configuring a data crawling task;
the Web client receives a first configuration operation of the data crawling task through the first page, and sends configuration content of the data crawling task to the server for storage, wherein the first configuration operation at least comprises configuration of a crawling data source and a crawling time interval of the data crawling task;
Under the condition that the Web client receives the operation of starting the data crawling task, sending a task request to the server to request the server to execute the data crawling task;
and the server periodically crawls the webpage data from the website corresponding to the crawl data source according to the crawl time interval.
The Web client and the server can communicate with each other, and a user (such as a developer) can log in and access a Web management page on the Web client and perform configuration of a data crawling task (namely, create the data crawling task) on the Web management page. After triggering the start of the data crawling task on the Web client, the Web client can send the task request to the server, and the server can execute the data crawling task in the background and store the data crawling result.
When the configuration of the data crawling task is performed on the Web client, the Web client may display a first page (i.e., a data crawling task management page), on which a user may configure information such as a crawling data source, a crawling time interval, etc. of the data crawling task, and may send configuration content to a server for storage, for example, the server stores the configuration content of the data crawling task in the first data table. In the process of data crawling by the server, webpage data can be continuously crawled at regular time according to the configured crawling time interval.
In one implementation, if the user does not configure a crawling time interval for the data crawling task at the Web client, the server may set a default crawling time interval to periodically persist to crawl Web page data according to the crawling time interval.
Therefore, the data crawling method can enable a user to dynamically configure the data crawling task on the Web client, reduce labor cost for writing codes, enable the server to continuously perform the data crawling process according to crawling time intervals, improve the data crawling efficiency and improve the flexibility of the data crawling process.
With reference to the first aspect, in some implementations of the first aspect, the first page includes a first control, and the Web client receives, through the first page, a first configuration operation for the data crawling task, including:
the Web client responds to clicking operation of a user on the first control, and displays a first input control for configuring the data crawling task;
and receiving the crawling data source and the crawling time interval which are input by the user on the first input control, and completing the first configuration operation.
The first page may include a first control (i.e., a new control), after the user clicks the first control, the Web client may display a first input control configured for the data crawling task, which may be used to input contents such as a filling name, a code, a description, a crawling data source, a crawling time interval, etc., and after the configuration contents are filled, the first configuration operation may be completed after the configuration contents are saved. Therefore, the user can complete the configuration of the data crawling task through a convenient webpage configuration process, and the data crawling efficiency is improved.
With reference to the first aspect, in some implementations of the first aspect, after the completing the first configuration operation, the method further includes:
and the Web client displays the configuration content of the data crawling task and a second control on the first page, wherein the second control is used for triggering the starting and stopping of the data crawling task.
After the configuration of the data crawling task is completed, a record of the configured data crawling task is displayed on the first page of the Web client, and a second control (i.e., an operation control) is corresponding to the record and is used for triggering the start and stop of the task, and can also be used for triggering the editing of the data crawling task, namely, changing some configuration contents and the like.
Then, correspondingly, the operation of starting the data crawling task is received by the Web client, which includes: and the Web client receives the click operation of the user on the second control, and triggers the starting of the data crawling task.
Therefore, a user can trigger and start a data crawling task by one key at the Web client, so that the labor cost for writing codes is reduced, and the efficiency is improved.
With reference to the first aspect, in some implementations of the first aspect, the method further includes:
The Web client displays a second page, wherein the second page is used for configuring the crawling data source;
And the Web client receives a second configuration operation on the crawling data source through the second page and sends configuration content of the crawling data source to the server for storage, wherein the second configuration operation at least comprises configuration on a website corresponding to the crawling data source.
Because the user can configure the crawling data source of the data crawling task on the first page of the Web client, the crawling data source is configured in advance for subsequent selection. When the configuration of the crawling data source (i.e., uniform resource locator (uniform resource location, URL) data source) is performed on the Web client, the Web client may display a second page (i.e., URL management page) on which the user may configure information such as the Web address of the URL data source, and may send the configuration content to the server for storage, e.g., the server stores the configuration content for the URL data source in the second data table.
Therefore, the user can dynamically configure the crawling data source on the Web client, the labor cost for writing codes is reduced, and the data extraction efficiency is improved.
With reference to the first aspect, in some implementations of the first aspect, the second page includes a third control, and the Web client receives, through the second page, a second configuration operation on the crawling data source, including:
The Web client responds to clicking operation of the third control by a user, and displays a second input control for configuring the crawling data source;
and receiving a website corresponding to the crawling data source input by the user on the second input control, and completing the second configuration operation.
The second page may include a third control (i.e., a new control), after the user clicks the third control, the Web client may display a second input control for configuring the crawling data source, which may be used for inputting contents such as a filling name, a code, a description, a URL address, a type of the crawling data source, and the second configuration operation may be completed after the configuration contents are filled in, and the second configuration operation may be saved. Therefore, the user can complete the configuration of the crawling data source through a convenient webpage configuration process, and the data crawling efficiency is improved.
With reference to the first aspect, in some implementations of the first aspect, after the completing the second configuration operation, the method further includes:
and the Web client displays configuration content of the crawling data source and a fourth control on the second page, wherein the fourth control is used for triggering the configuration of the crawling rule of the crawling data source.
After the configuration of the crawling data source is completed, the configured record of the crawling data source is displayed on the second page of the Web client, and the record is correspondingly provided with a fourth control (i.e. an operation control) for triggering operations such as editing, deleting, data extraction (including configuration of crawling rules) and the like on the crawling data source.
With reference to the first aspect, in some implementations of the first aspect, after the Web client displays the configuration content of the crawling data source and the fourth control on the second page, the method further includes:
the Web client receives click operation of the user on the fourth control, and displays a third page for configuring the crawling rule;
receiving clicking operation of the user on a fifth control on the third page, and acquiring and displaying original message information of a website corresponding to the crawling data source;
And receiving a rule expression input by the user on a third input control on the third page, and performing crawling test on the original message information according to the rule expression.
If the user clicks the data extraction in the fourth control, the Web client displays a third page for configuring the crawling rule, a fifth control (namely an original message acquisition control) is displayed on the page, and when the user clicks the original message acquisition control, the Web client can send a request to the server to request the server to inquire the original message of the website corresponding to the crawling data source, and display the data of the original message. Meanwhile, a third input control (i.e. a crawling rule input box) is further displayed on the third page, and the user can input a rule expression on the third input control, for example, the rule expression can be a regular expression or an XML path language. After the rule expression is input, the crawling test control can be clicked, so that the original message crawled according to the input rule expression can be tested to test whether the input rule expression is accurate or not. Therefore, the user can conveniently configure or update the crawling rules on the Web client, and labor cost for writing codes is reduced.
In some implementations, after the configuration of the crawling rules is completed on the Web client, information such as the crawling rules may also be sent to the server for saving, and the server may store the crawling rules configured for the crawling data source in the third data table.
With reference to the first aspect, in some implementations of the first aspect, the task request carries an identifier of the data crawling task, and after the Web client sends the task request to the server, the method further includes:
the server acquires a crawling data source corresponding to the data crawling task according to the identification of the data crawling task, and stores the crawling data source and crawling trigger time into a first queue, wherein the crawling trigger time is determined by the server according to the current system time and the crawling time interval.
After the data crawling task is configured, if the user triggers to start the data crawling task, the Web client sends a task request to the server, wherein the task request can carry an identifier corresponding to the data crawling task and is used for enabling the server to search a corresponding crawling data source and crawling data interval from the first data table according to the identifier. And the server can calculate the crawling trigger time according to the current system time and the crawling time interval, and store the crawling data source and the crawling trigger time into the first queue (namely the queue to be crawled) so as to read the crawling data source from the first queue for data crawling.
It can be understood that the first queue may store a plurality of crawling data sources corresponding to different data crawling tasks, and one crawling data source corresponding to one data crawling task may also be a plurality of crawling data sources.
With reference to the first aspect, in some implementations of the first aspect, the step of the server periodically crawling web page data from a web site corresponding to the crawling data source according to the crawling time interval includes:
the server acquires the crawling data source from the first queue, and crawls the webpage data according to a website corresponding to the crawling data source under the condition that the current system time reaches the crawling trigger time;
Deleting information of the crawling data source from the first queue, and determining next crawling trigger time of the crawling data source according to the crawling time interval;
And re-storing the crawling data source and the next crawling trigger time into the first queue so as to crawl the webpage data at fixed time.
In the process of data crawling, the server may first acquire a preset number of crawling data sources (i.e. serve as data sources to be crawled) from the first queue, where the preset number of crawling data sources includes crawling data sources corresponding to the configured data crawling task. If the current system time reaches the crawling trigger time corresponding to the crawling data source, the server can crawl the webpage data according to the website corresponding to the crawling data source. After that, the server can delete the crawled crawling data source from the first queue, calculate the next crawling trigger time, and re-store the crawling data source and the next crawling trigger time into the first queue, so as to realize the regular and continuous crawling of the webpage data, and improve the crawling efficiency of the data.
In some implementations, the server may save the crawled web page data to a storage medium for later reading when needed.
With reference to the first aspect, in some implementations of the first aspect, the crawling rule includes a filling-in setting for a field included in a crawling result, and after the obtaining the web page data, the method further includes:
and if the filling-necessary property corresponding to the first field in the webpage data is set to be filling-necessary, and the value of the first field is null, discarding the webpage data by the server.
That is, when the user configures the crawling rule on the Web client, the field included in the crawling result may be set to be necessary for filling, for example, a certain field is set to be necessary for filling, if the necessary for filling corresponding to the first field (i.e. any field) in the crawled Web page data is necessary for filling, but the value of the first field is a null value again, which indicates that the first field does not meet the requirement of the necessary for filling, the server may discard the Web page data, so that the crawled Web page data meets the requirement of the user.
With reference to the first aspect, in some implementations of the first aspect, crawling the web page data according to the web address corresponding to the crawling data source includes:
the server acquires a crawling rule corresponding to the crawling data source;
and crawling contents in the websites corresponding to the crawling data sources according to the crawling rules to obtain the webpage data.
The server stores the crawling rules corresponding to the crawling data sources in the third data table, so that the crawling rules corresponding to the crawling data sources can be obtained from the third data table, data crawling is performed according to the crawling rules, and the crawling webpage data are obtained.
With reference to the first aspect, in some implementations of the first aspect, the method further includes:
under the condition that the current system time does not reach the crawling trigger time, the server calculates waiting time according to the crawling trigger time and the current system time;
Under the condition that the waiting time is longer than a first preset time length, the server executes the local lock for the first preset time length, and acquires the crawling data source from the first queue again after releasing the local lock;
And under the condition that the waiting time is not longer than a first preset time length, executing the waiting time of the local lock by the server, and acquiring the crawling data source from the first queue again after releasing the local lock.
When the server acquires the crawling data source from the first queue, the current system time may not reach the crawling trigger time, and then waiting is needed to wait until the system time reaches the crawling trigger time. But also considers that a new data crawling task application can exist in the waiting time, and the server cannot execute the new data crawling task if the server is in the waiting state. Therefore, the server can set a first preset time length, if the calculated waiting time is longer than the first preset time length, the server executes the first preset time length of the local lock, so that resource waste caused by locking for a long time is avoided, and if the calculated waiting time is not longer than the first preset time length, the server executes the local lock waiting time to execute the next crawling task as soon as possible.
With reference to the first aspect, in some implementations of the first aspect, the data crawling system further includes an electronic device, and the method further includes:
the electronic equipment sends a data request to the server to request to acquire the crawled webpage data;
the electronic device receives and presents the web page data from the server.
If the electronic equipment held by the user is provided with the service application corresponding to the data crawling task, the service application can display the data crawling result and respond to the search request of the user. Then, the electronic device may request the stored data crawling result (i.e. the above web page data) from the server and display the data crawling result for the user to view, or the electronic device receives the search request of the user, searches the corresponding content from the data crawling result and displays the content. Therefore, the server can timely send the crawled webpage data to the electronic equipment, and the use perception of a user is improved.
In a second aspect, the present application provides a data crawling system, where the data crawling system includes a Web client and a server, and the data crawling system is configured to execute any one of the methods in the first aspect.
In a third aspect, the present application provides an apparatus for inclusion in a data crawling system, the apparatus having functionality to implement the above first aspect and possible implementations of the above first aspect. The functions may be realized by hardware, or may be realized by hardware executing corresponding software. The hardware or software includes one or more modules or units corresponding to the functions described above. Such as a receiving module or unit, a processing module or unit, etc.
In a fourth aspect, the present application provides a Web client, including: one or more processors, and memory;
the memory is coupled to the one or more processors, the memory is used for storing computer program code, the computer program code comprises computer instructions, and the one or more processors call the computer instructions to cause the Web client to execute the corresponding process in the technical solution of the first aspect.
In a fifth aspect, the present application provides a server comprising: one or more processors, and memory;
The memory is coupled with the one or more processors, the memory is used for storing computer program codes, the computer program codes comprise computer instructions, and the one or more processors call the computer instructions to cause a server to execute corresponding processes in the technical scheme of the first aspect.
In a sixth aspect, the present application provides a computer readable storage medium, where the computer readable storage medium includes instructions that, when executed on a data crawling system, cause the data crawling system to perform any one of the methods of the first aspect.
In a seventh aspect, the present application provides a computer program product comprising: computer program code which, when run on a data crawling system, causes the data crawling system to carry out any of the methods of the solutions of the first aspect.
Drawings
FIG. 1 is a schematic diagram of a system architecture of an exemplary data crawling method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an example of a Web management page on a Web client according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an example of a data crawling task management page on a Web client according to an embodiment of the present application;
FIG. 4 is a schematic diagram of another example of a task management page for data crawling on a Web client according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a data crawling task management page on a Web client according to another embodiment of the present application;
FIG. 6 is a schematic diagram of a data crawling task management page on a Web client according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a URL management page on a Web client according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a URL management page on another example Web client provided by an embodiment of the present application;
FIG. 9 is a schematic diagram of a URL management page on a Web client according to another embodiment of the present application;
FIG. 10 is a schematic diagram of a URL management page on a Web client according to another embodiment of the present application;
FIG. 11 is a flowchart illustrating an example of a data crawling method according to an embodiment of the present application;
FIG. 12 is a flowchart of another exemplary data crawling method according to an embodiment of the present application;
FIG. 13 is a schematic diagram of an example of a Web client according to an embodiment of the present application;
FIG. 14 is a schematic diagram of a server according to an embodiment of the present application;
Fig. 15 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. Wherein, in the description of the embodiments of the present application, unless otherwise indicated, "/" means or, for example, a/B may represent a or B; "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, in the description of the embodiments of the present application, "plurality" means two or more than two.
The terms "first," "second," "third," and the like, are used below for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", or a third "may explicitly or implicitly include one or more such feature.
In the current environment where the network information volume is explosively increased, it is a difficult thing for users to obtain information required by themselves from the huge network information volume, and for this reason, the concept of a search engine is proposed. The search engine is a portal for a user to access the world wide web and can assist the user to search for needed information, but the results returned by the search engine may also contain web content which is not concerned by the user in view of the fact that users in different fields and different backgrounds often have different search purposes and requirements. In order to solve the problem, a web crawler for directionally grabbing the webpage data is generated, the webpage data can be automatically downloaded, and then the information required by the user is filtered and grabbed from the webpage data according to a certain search strategy, so that the method has wider application.
In the process of crawling data, because webpage data is complex, content is irregular, sometimes only a small part of the data is required to be crawled, for example, when crawling the hot list content with higher user attention, only the title, author, hotness value and other data of the hot list content are required to be crawled, and other data are not required to be crawled. Therefore, the research staff needs to write codes and align regular expressions used in data crawling, and the codes are also required to be changed under the condition of different user demands, obviously, a great deal of manpower is consumed in the mode of writing the codes, and the release period after the codes are written is longer, so that the data crawling efficiency is lower.
In view of this, the embodiment of the application provides a data crawling method, which can enable research personnel to dynamically configure a data crawling process on a Web management page, reduce the labor cost for writing codes, and improve the flexibility of data crawling in the dynamic configuration process.
In the following, we first introduce an application scenario of the data crawling method of the present application, and in some embodiments, the data crawling method may be applied to a system architecture shown in fig. 1, where the system architecture may include a Web client, a server, and an electronic device, where communication may be performed between the Web client and the server, and communication may also be performed between the electronic device and the server.
The research and development personnel can log in and access a Web management page on a Web client, a crawler management function is provided on the Web management page, and functions such as data crawling task management, uniform resource locator (uniform resource location, URL) management and the like are provided under the crawler management function. The data crawling task management function may be used to create a data crawling task and control the start and stop of the data crawling task, and may set a crawling time interval to perform data crawling continuously and regularly. The URL management function may be used to create URL data sources and configure URL data sources, where the URL data sources created may be selected for use in creating data crawling tasks to select which data source to crawl (which URL web site may also be understood as crawling). After triggering the start of the data crawling task on the Web client, the Web client can send the task request to the server, and the server can execute the data crawling task in the background and store the data crawling result. In some implementations, the data crawling task may be performed continuously and the stored data crawling results are updated continuously.
For the electronic device (for example, may be a mobile phone), if a service application corresponding to the data crawling task is installed on the electronic device, the service application may display a data crawling result and may respond to a search request of a user. For example, the business application may be a smart search APP, and then the electronic device may request the stored data crawling result from the server and display the data crawling result for the user to view, or the electronic device receives the search request of the user, searches the corresponding content from the data crawling result and displays the corresponding content. In some implementations, the electronic device may request the latest data crawling results from the server in the event that the user performed a refresh operation through the business application.
Therefore, on the basis of the system architecture of fig. 1, the developer can dynamically configure the data crawling process on the Web management page, which is more convenient and faster than the traditional code writing method.
In other scenarios, the system architecture described above may also be used for the timed acquisition of some interface data, for example, corresponding interface data may be automatically acquired through configuration of interface addresses and configuration of acquisition content.
Next, we will introduce a procedure for configuring the data crawling procedure on the Web client. Taking a personal computer (personal computer, PC) as an example, as shown in fig. 2, the Web client may log in to access a Web management page, for example, the Web management page may be a smart search operation cloud page. On the intelligent search operation cloud page, a crawler management control 21 is presented on the left taskbar, after clicking the crawler management control 21, sub-functionality controls including, but not limited to, a data crawling task management control 22 and a URL management control 23 are displayed below the control, and after clicking one control, the right part of the intelligent search operation cloud page can display a corresponding configuration page.
Illustratively, in the event that a developer clicks on the data crawling task management control 22, a data crawling task management page as shown in FIG. 3 is presented on the Web client, with functionality provided on the page that is not limited to queries and additions. For the query function, the created data crawling task may be queried based on the name and/or code of the data crawling task, where the name may be a task name defined by a developer when the data crawling task is created (e.g., the name is a novel), and the code may be a code defined by a developer when the data crawling task is created (e.g., the code corresponding to the novel). Taking the example shown in fig. 3, after the "novel" is input in the name input box 31, the query control 32 is clicked, and the data crawling task of the created novel can be displayed on the current page. For the new function, a data crawling task can be newly created, taking fig. 4 as an example, clicking the new control 41, displaying the content required to be filled when the new data crawling task is performed on the current page, after the research and development personnel fill in the content such as name, code, description, crawling data source, crawling time interval, and the like, clicking the storage control 42, so that the data crawling task can be created successfully. The crawling time interval is a mode of timing data crawling, that is, after the first data crawling is completed, the second data crawling can be performed after the set time interval, and the like, so as to perform durable timing data crawling.
For the process of filling in crawling data sources when a data crawling task is newly added, as shown in fig. 5, the position of filling in crawling data sources can be presented in the form of a selection control 51, when the selection control 51 is clicked, a list of created URL data sources can be displayed on the current page, and then a developer can select URL data sources to crawl from the list. For example, in FIG. 5, URL data sources to be crawled are selected as being hot A and hot B.
After a data crawling task is successfully created, the created data crawling task can be displayed on a data crawling task management page, and taking the created data crawling task as a crawling task for a social hot list as an example, as shown in fig. 6, the record of the data crawling task is also corresponding to an operation control 61, and the operation control 61 is used for triggering different operations such as starting, stopping and editing of the task. If the research personnel clicks the task to start, namely the data crawling task is triggered to start to execute, the Web client side can send a task request to the server so that the server executes the corresponding data crawling task. After the task starts, if the developer clicks on stop again, the Web client sends a stop request to the server to trigger the data crawling task to stop executing. If the developer clicks on the edit, the page of fig. 4 is redisplayed, so that the developer can change the contents of names, codes, descriptions, crawling data sources, crawling time intervals and the like.
In some implementations, after creating a successful data crawling task, the Web client may further generate a unique identifier corresponding to the data crawling task, and send information such as the identifier, name, code, description, crawling data source, crawling time interval, and the like corresponding to the data crawling task to the server for storage. Optionally, the Web client may also use the name of the data crawling task as its corresponding unique identifier. Alternatively, the server may store information corresponding to the received data crawling task in the first data table.
As can be seen from the description of FIG. 5, when a crawling data source (i.e., URL data source) needs to be selected during a new data crawling task, the crawling data source needs to be configured in advance for subsequent selection. The process of configuring the crawling data source may be implemented through the page corresponding to the URL management control 23 in fig. 2, for example, in the case that the developer clicks on the URL management control 23, a URL management page as shown in fig. 7 is displayed on the Web client, and functions not limited to querying and adding are provided on the page. For the query function, the created URL data source may be queried based on the name and/or URL address of the URL data source, where the name may be a URL name defined by a developer when the URL data source is created (e.g., the name is a hot list A), and the URL address may be a corresponding address entered by the developer when the URL data source is created (e.g., the URL address corresponding to the hot list A is https:// rebang. Aaa). Taking the example shown in fig. 7, clicking on the query control 72 after entering "hot a" in the name entry box 71 may display URL data source information for the created hot a on the current page. For the new function, a URL data source can be created by adding, taking fig. 8 as an example, clicking the new control 81, displaying the content to be filled when the URL data source is added on the current page, after the developer fills the content such as name, code, description, URL address, belonging type, etc., clicking the save control 82, and then creating the URL data source successfully.
In some implementations, after creating the successful URL data source, the Web client may further generate a unique identifier corresponding to the URL data source, and send information such as the identifier, the name, the code, the description, the URL address, the type of the URL data source, and the like corresponding to the URL data source to the server for storing. Alternatively, the Web client may also use the name of the URL data source as its corresponding unique identifier. Alternatively, the server may store information corresponding to the received URL data source in the second data table.
After a URL data source is successfully created, the created URL data source may be displayed on the URL management page, and, as shown in fig. 9, the record of the URL data source further corresponds to an operation control 91, where the operation control 91 is used to trigger different operations such as editing, deleting, extracting data, crawling data, and querying data, taking the created URL data source as a data source of the leaderboard B as an example. If the developer clicks on editing, the Web client redisplays the page of fig. 8, so that the developer can change the contents such as name, code, description, URL, type and the like. If the developer clicks on delete, the Web client may be triggered to delete the URL data source. If the data is extracted, the crawling rule can be configured on the URL data source, and if the data crawling task is to crawl the URL data source, only part of the content in the corresponding URL address is required to be crawled, so that the crawling rule needs to be configured to crawl the required content.
The process of configuring the crawling rule on the URL data source may be shown in fig. 10, that is, after the developer clicks on the data extraction in the operation control 91 in fig. 9, the Web client may skip to display the page shown in fig. 10. On this page, the original source of the URL data source, the crawler protocol restrictions, and the corresponding URL address are displayed, along with the original message acquisition control 92. When the original message obtaining control 92 is clicked, the web client can send a request to the server, request the server to query the original message of the website corresponding to the URL data source, and display the data of the original message in the message information display frame 93; it can be understood that the message information displayed here is the whole content of the website corresponding to the URL data source, and after the crawling rule is configured, the required content can be crawled from the message information.
After obtaining the original message of the web site corresponding to the URL data source, the developer may enter a regular expression in the crawling rule input box 94, which may be a regular expression (regex) in some implementations, including a combination of common characters (e.g., letters between a and z) and special characters (referred to as "meta characters"), and may be an XML path language (XPath) which is a language that determines the location of a portion of the XML document, and the developer may select one of the regular expressions as desired. Illustratively, fig. 10 illustrates an input regular expression as an example, after the rule expression is input, the crawling test control 95 may be clicked, that is, the original message crawled according to the input rule expression may be tested to test whether the input rule expression is accurate. The information captured after the crawling test may be displayed in a message information display frame 96 after crawling, for example, as shown in fig. 10, where the crawled result includes a field name, a field value and an option of whether to fill, the field name (for example, title) may be automatically filled by the Web client according to a rule expression, or may be manually filled by a developer, the field value is title information of each piece of information in the crawled original message, whether the option of whether to fill the field value is represented by the field value, if yes is selected, the field value is represented by the field value is necessarily filled, and if the field value is empty in the subsequent data crawling process, the corresponding data crawling result may be discarded. Under the condition that the crawling test result is not abnormal, a developer can click the save control 97, and then the Web client can send the crawling rule (including rule expression and rule set on the setting result of whether to fill options or not) and the crawling test result to the server for saving, and it can be understood that the saved content and the identifier of the URL data source have a corresponding relationship, that is, the corresponding crawling rule can be found through the identifier of the URL data source for subsequent crawling of data. Alternatively, the server may store the crawling rules configured for the URL data sources in a third data table, i.e. the third data table stores the configured rule expressions and the setting result of whether options are necessary to fill. It can be further understood that, because the format and data of the original message corresponding to the web address by the URL data source may change, the rule expression can be adjusted at regular time to crawl accurate message information.
Through the above configuration process, the configuration of the URL data source and the data crawling task can be completed, and if the developer clicks the operation control 61 in fig. 6 to start the task, the Web client sends a task request to the server to execute the corresponding data crawling task by the server.
Next, we will introduce a detailed process of the server performing the data crawling task. As shown in fig. 11, the process may include:
S11, the server receives a task request of data crawling sent by the Web client.
The task request may carry an identifier corresponding to the data crawling task, for example, may be a task ID, a name of the data crawling task, and the like.
S12, the server starts a data crawling service and starts a persistence thread.
Because the data crawling task is created by setting the crawling time interval, that is, the data crawling needs to be continuously performed, the server needs to start a persistence thread for continuously acquiring the data source to be crawled from the queue to be crawled.
In some implementations, the server may launch the persistence thread using @ conponet and @ postconstruct annotations and loop the persistence thread through while (true) predicate logic.
S13, the server locks the data crawling service, and if the locking is successful, S15 is executed.
Because there may be multiple servers deployed in the actual scenario, different servers may be used to execute different service processes, if the data crawling service in the current server is to execute the current data crawling task, other servers or other services are prevented from repeatedly executing the current data crawling task, so that the current data crawling service may be locked.
In some implementations, the server may add a local lock and a distributed lock to the data crawling service, where the local lock is used to avoid that other services perform concurrent reading and writing on data corresponding to the data crawling task, so as to generate a result of inconsistent data, and the distributed lock is used to avoid that multiple services repeatedly perform the current data crawling task. Optionally, the server may use setnx the command to acquire the distributed lock, and set the expiration time of the lock to N, for example, N is 1 minute, where the expiration time is set to prevent deadlock, that is, if the current server is abnormal but holds the lock all the time, other servers will not execute the corresponding data crawling service, so that the data crawling task cannot be executed, and if the expiration time is set, the server may unlock after N time, so that the data crawling task continues to execute.
S14, the server acquires information of the data source to be crawled according to the identification corresponding to the data crawling task in the task request, and stores the information into the queue to be crawled.
It should be noted that, the process of acquiring the data source information to be crawled in S14 may be performed synchronously with the process of starting the data crawling service in S12-S13, that is, the server may perform multiple preparation tasks of the data crawling task at the same time.
In S14, because the task request received by the server carries the identifier corresponding to the data crawling task, and the Web client has sent the identifier, name, code, description, crawling data source, crawling time interval and other information corresponding to the data crawling task to the server for storage after creating the successful data crawling task, the server can find the corresponding crawling data source (i.e. as the data source to be crawled) according to the identifier corresponding to the data crawling task. After the successful URL data source is created, the Web client also sends the information such as the identification, the name, the code, the description, the URL address, the type and the like corresponding to the URL data source to the server for storage, so that the server can find the information such as the identification, the URL address and the like of the crawling data source corresponding to the data crawling task. The server may find a corresponding crawling data source (i.e. URL data source) according to the identifier of the data crawling task in the first data table, and then find a corresponding URL address according to the identifier of the URL data source in the second data table, so that the data crawling task, the URL data source, the URL address, and other information may be corresponding. Therefore, the server can acquire the information of the data source to be crawled according to the identification corresponding to the data crawling task in the task request, wherein the information of the data source to be crawled includes, but is not limited to, the identification of the data source, the URL address, the crawling trigger time and other information, and the information of the data source to be crawled is stored in the queue to be crawled. For example, the data crawling task is a crawling task for the social hotlist in fig. 6, and the data sources to be crawled, namely, the hotlist a and the hotlist B, corresponding to the crawling task, and the server may store information corresponding to the hotlist a and the hotlist B into the queue to be crawled.
The crawling trigger time here may be a time determined by the system time+crawling time interval. Illustratively, the system time when the server receives the task request is 10:00:00, the crawling time interval is 2 minutes, and the crawling trigger time of the corresponding data crawling task is 10:02:00, the server can perform data crawling according to the crawling trigger time.
In some implementations, the queue to be crawled may be a zset queue of redis, the zset queue is an ordered queue in the redis database, and when the server stores information of the data source to be crawled into the zset queue, the server may use an identifier of the data source as a value, use a crawling trigger time as a score, and store the data source in order according to the crawling trigger time. For example, the information of the data sources to be crawled may be stored in the zset queue according to the order of the crawling trigger time from the early to the late.
It can be understood that, because the research and development personnel can trigger and start a plurality of different data crawling tasks at different times at the Web client, the Web client needs to send a plurality of task requests to the server respectively to request to start the different data crawling tasks, so that the information of the data sources to be crawled corresponding to the different data crawling tasks can be stored in the queue to be crawled, and the data sources to be crawled corresponding to one data crawling task can also be a plurality of. Illustratively, the Web client starts 3 data crawling tasks, the 1 st data crawling task corresponds to 1 data source (a) to be crawled, and the crawling trigger time is 10:00:00, 2 nd data crawling task corresponds to 2 data sources (B and C) to be crawled, and crawling triggering time is 10:05:00, the 3 rd data crawling task corresponds to 2 data sources (D and E) to be crawled, and crawling triggering time is 10:10:00, the information of the data source A, B, C, D, E to be crawled is stored in the queue to be crawled.
S15, the server acquires a preset number of data sources to be crawled from the queue to be crawled.
Because the persistent thread started by the server can continuously perform data crawling, more and more data sources to be crawled can be stored in the queue to be crawled, if all the data sources to be crawled are crawled at the same time, the processing capacity of the server can be increased, and the load is too high, so that the server can acquire the preset number of data sources to be crawled from the queue to be crawled each time. For example, the preset number may be 20.
In some implementations, in the case that the queue to be crawled is zset queues, the server may acquire the first 20 data sources to be crawled in the zset queues each time, that is, the first 20 data sources to be crawled after being sequenced from early to late according to the crawling trigger time.
S16, the server determines whether the current system time reaches the crawling trigger time of the first data source to be crawled, if so, S17 is executed, and if not, S21 is executed.
The first data source to be crawled is any one of the preset number of data sources to be crawled.
As can be seen from the above description, each data source to be crawled corresponds to a crawling trigger time, and when the corresponding crawling trigger time needs to be reached, the server starts crawling the corresponding data source, so that the server can determine whether the current system time reaches the crawling trigger time of the first data source to be crawled.
In some implementations, the server may subtract the crawling trigger time of the first data source to be crawled from the current system time to obtain a time difference, and if the time difference is greater than or equal to 0, it indicates that the current system time has reached the crawling trigger time of the first data source to be crawled, and the server needs to start executing the data crawling process. If the time difference is less than 0, which indicates that the current system time has not reached the crawling trigger time of the first data source to be crawled, the server may continue waiting.
It can be understood that, because the server obtains the preset number of data sources to be crawled, it is required to determine whether the crawling trigger time of the data sources to be crawled is reached for each data source to be crawled, step S17 may be performed for a plurality of data sources reaching the crawling trigger time, and step S21 may be performed for a plurality of data sources not reaching the crawling trigger time.
S17, the server calculates first waiting time of the first data source to be crawled.
The first waiting time waittime may be obtained by using the crawling trigger time-the current system time, and the calculated first waiting time is used for subsequently controlling the time of the local lock.
It will be appreciated that since the current system time has reached the crawling trigger time for the first data source to be crawled, the first waiting time waittime may be less than or equal to 0, and the absolute value of waittime may be used subsequently to control the time of the local lock.
S18, the server stores the URL address corresponding to the first data source to be crawled into a data crawling queue for the crawler engine to crawl the data.
That is, for each first data source to be crawled that reaches the crawling trigger time, the server may store URL addresses corresponding to the data sources into the data crawling queue, so that the crawler engine sequentially reads URL addresses from the data crawling queue to crawl data.
The process of crawling data by the crawler engine is described in detail in fig. 12 below, and is not described in detail herein.
S19, the server deletes the information of the first data source to be crawled from the queue to be crawled, and determines the next crawling trigger time of the first data source to be crawled.
S20, the server stores the updated information of the first data source to be crawled into the queue to be crawled again, and then S22 is executed.
Because the server stores the URL address corresponding to the first data source to be crawled into the data crawling queue, the follow-up crawler engine can execute the data crawling task, so that the server can delete the information of the first data source to be crawled, and after the next crawling time is determined, the server stores the updated information of the first data source to be crawled into the data crawling queue again for the next crawling process, thereby realizing the continuous automatic timing data crawling process. It will be appreciated that the update information of the first data source to be crawled includes, but is not limited to, information such as identification of the data source, URL address, and next crawling trigger time.
In some implementations, the manner in which the server determines the next crawling trigger time for the first data source to be crawled may include: and acquiring the crawling time interval of the data crawling task corresponding to the first data source to be crawled, and then using the current system time plus the crawling time interval corresponding to the current system time to obtain the next crawling trigger time. Thus, the above-described continuous data crawling process may also be understood as a timed data crawling process, with the time duration being the crawling time interval.
It will also be appreciated that the server will perform the steps S18-S20 described above for each first data source to be crawled.
In some implementations, if the crawling time interval is not configured on the Web client for the data crawling task, the server may also set a default crawling time interval, so that the data crawling process is continuously performed at the set crawling time interval.
S21, the server calculates second waiting time of the first data source to be crawled.
The first data source to be crawled at this time is a data source to be crawled which does not reach the crawling trigger time, and the second waiting time waittime can be obtained by using the crawling trigger time-the current system time, and the calculated second waiting time is used for subsequently controlling the time of the local lock.
S22, the server deletes the distributed lock.
The purpose of deleting the distributed lock by the server is to prevent deadlock caused by the fact that the current service always occupies the distributed lock, for example, when a data crawling task which does not reach the crawling trigger time waits for being executed, the corresponding service occupies the distributed lock, and if a new data crawling task comes in again, the new data crawling task may not be executed. Thus, the server may delete the distributed lock first, wait until the next round of data crawling and then re-lock.
S23, the server judges whether the first waiting time or the second waiting time is larger than a first preset duration, if so, S24 is executed, and if not, S25 is executed.
S24, the server executes the local lock for a first preset time period, and then releases the lock and returns to S13.
S25, the server executes the first waiting time or the second waiting time of the local lock, and then the lock is released and returns to S13.
That is, in the process of waiting for the next data crawling, the server executes the local lock according to the first waiting time or the second waiting time, in order to avoid that the time of the next execution is later than the crawling trigger time, whether the waiting time is longer than a first preset time length (for example, the first preset time length is 1 minute) is judged, if so, the first preset time length is locally locked, resource waste caused by long-time locking is avoided, and if not longer than the first preset time length, the first waiting time or the second waiting time is locally locked, so that the next crawling task is executed as soon as possible. Thus, after executing the local lock and releasing the lock, the server returns to step S13 to automatically start the next round of crawling task.
Illustratively, assume that the crawling trigger time for data crawling task a is 10:35:00, the current system time is 10:30:30, namely the current system time does not reach the crawling trigger time, and calculating to obtain a second waiting time of 4 minutes and 30 seconds. And under the condition that the first preset duration is 1 minute, the second waiting time is longer than 1 minute, and the server is locally locked for 1 minute. The lock is then released back to execution S13, i.e. the next data crawling process is restarted. Because the data crawling task a is not executed yet, the next data crawling process will also determine the data crawling task a, and after the local locking is performed for 1 minute, the current system time should reach 10:31:30, if the crawling trigger time is not reached, calculating to obtain a second waiting time of 3 minutes and 30 seconds, and still needing to be locally locked for 1 minute, and so on. Until the current system time comes to 10:34:30, calculating a second waiting time of 30 seconds, wherein the second waiting time is smaller than 1 minute, locking the server locally for 30 seconds, and releasing the lock to return to executing S13. The current time of the system reaches 10:35:00, when the crawling trigger time is reached, the server can store the URL address corresponding to the data crawling task A into the data crawling queue so as to enable the crawler engine to perform data crawling.
In some implementations, the above steps performed by the server may be performed by a timed task engine running within the server.
After the server stores the URL address corresponding to the second data source to be crawled into the data crawling queue in S18, the crawler engine (running in the server) recognizes that the data crawling queue stores data, and then the data crawling can be started. As shown in fig. 12, taking a server as an execution body as an example, the data crawling process may include:
s31, the server starts a persistence thread.
It should be noted that, the persistent thread opened here is not the same thread as the persistent thread in S12, where the persistent thread in S12 is used to continuously obtain the data source to be crawled from the queue to be crawled, so as to store the URL address corresponding to the data source to be crawled reaching the crawling trigger time into the data crawling queue, and the persistent thread in S31 is used to continuously crawl data according to the URL address.
S32, initializing a webpage downloading device and a downloading thread pool by the server.
The web page downloader is a data downloader, and is used for downloading corresponding web page data according to URL websites, and the server can initialize the web page downloader according to specific data crawling service or use a default web page downloader. The downloading thread pool is used for downloading webpage data in a concurrent mode, namely webpage data corresponding to different URL websites can be downloaded in a concurrent mode, and the server can initialize the downloading thread pool or create a default downloading thread pool according to specific data crawling services.
S33, the server judges whether the state of the download thread pool is running, if so, S34 is executed, and if not, the flow is ended.
That is, in the case that the state of the download thread pool is normal (running), the server may perform a subsequent data crawling process, and if the state of the download thread pool is not running (an abnormality may occur or the running is stopped), the current flow is ended.
S34, the server acquires the URL address corresponding to the first data source to be crawled from the data crawling queue.
S35, judging whether the acquired URL address is empty, executing S36 if the URL address is empty, and executing S37 if the URL address is not empty.
Because the server stores the URL address corresponding to the first data source to be crawled into the data crawling queue, the URL address corresponding to the first data source to be crawled can be obtained from the data crawling queue at the moment. Considering some abnormal factors, such as reading and writing abnormality and other factors, the URL acquired by the server may be empty, so that the server may determine whether the acquired URL is empty, if not, the data crawling process may be performed normally, and if so, the data crawling of the current round may not be performed, and the next round of process may be performed. For example, the URL address obtained by the server is the address of the above-mentioned hot list a.
S36, the server executes the local lock for a second preset time period, and then releases the lock to return to S33.
And under the condition that the acquired URL address is empty, the server can execute the local lock for a second preset time length so as to avoid resource waste caused by frequent next round of circulation. In some implementations, the second preset time period may be 1 second.
It will be appreciated that when the server releases the lock back to S33 and then retrieves the URL address from the data crawling queue, the next new URL address is retrieved.
S37, the server submits a downloading task through the downloading thread pool and downloads the webpage data corresponding to the URL.
And under the condition that the acquired URL address is not empty, the server can trigger to download the webpage data corresponding to the URL address, and the downloading task can be submitted through the downloading thread pool to download the webpage data corresponding to the URL address.
In some implementations, the maximum number of threads in the download thread pool may be 2c+1, the minimum number of threads may be 2c+1, c is the number of CPU cores in the server, where the same maximum number of threads and the minimum number of threads are set to be maximum use of CPU resources, and the maximum length of the download task queue in the download thread pool may be M (e.g., M is 1000) to ensure that the download tasks all commit successfully as much as possible. If the download task fails to be submitted, the server may record the corresponding download task, for example, in a data table, and may reprocess the download task in the data table.
In some implementations, the process of downloading the web page data corresponding to the URL address may be performed by a web page downloader, and the specific process may include:
A: the web page downloader obtains the download linker.
Wherein the download linker is a long connection associated with the web page domain name, through which the data to be acquired can be downloaded, e.g. the download linker can be implemented using an http connection pool.
B: an http request message is set.
The http request message may carry information such as a request header, a request body, a timeout time (socketTimeout) for configuring a socket, a connection timeout time (connectTimeout) for configuring, etc., where the request header is used to describe metadata of the http request message, and includes information such as a request method, a request protocol, etc., and the request body is used to transfer parameters or contents required by the http request to the web server.
C: and requesting webpage data from a webpage server according to the http request message.
D: acquiring a web page data stream and converting the data stream into a text character string.
That is, after the http request message is set, the web page downloader may request the web page server for corresponding web page data, where in general, the web page data that is initially acquired is in a web page data stream format, and here, the data stream may be converted into a text string (string).
S38, the server acquires a crawling rule corresponding to the first data source to be crawled.
Since the crawling rule of the URL data source is configured in fig. 10, the first data source to be crawled is one of the configured URL data sources, and then the server may acquire the crawling rule corresponding to the first data source to be crawled. In some implementations, if the Web client stores information such as crawling rules of the URL data source in the third data table after sending the information to the server, the server may obtain the crawling rules corresponding to the first data source to be crawled from the third data table.
S39, judging whether the crawling rule is empty, ending the flow if the crawling rule is empty, and executing S40 if the crawling rule is not empty.
Considering some situations that the crawling rules of the read-write exception or URL data source are not fully configured, if the crawling rules acquired by the server are empty, the process can be ended, and if the crawling rules are not empty, the subsequent rule matching process can be executed.
And S40, the server matches the webpage data according to the crawling rules.
Wherein the server may match the acquired web page data according to a rule expression in the crawling rule (e.g., the regular expression input in fig. 10 described above).
S41, judging whether the crawling rule is matched or not, ending the flow if the crawling rule is not matched, and executing S42 if the crawling rule is matched.
That is, it is determined whether or not the crawling rule used in S40 matches the data, and if so, the crawling data is subjected to subsequent processing.
S42, the server circularly acquires matching data and acquires data corresponding to the necessary filling options according to the crawling rules.
Because the data volume of the webpage data corresponding to one URL website is relatively large, the server can circularly acquire the matching data so as to reduce the probability of missing the data. In some implementations, after the server obtains the matching data, the matching data may be spliced according to the order of the expressions in the rule expression, for example, the matching data may be spliced according to the order of title and source. Then, the server may further obtain a result set on whether to fill the item in the crawling rule configured in fig. 10, and obtain data corresponding to the filling item in the matching data.
S43, judging whether the matching data is empty, if so, ending the flow, and if not, executing S44.
S44, judging whether the data corresponding to the necessary filling item is empty or not, ending the flow if the data is empty, and executing S45 if the data is not empty.
That is, the server may determine whether the obtained matching data and the data corresponding to the padding item are empty, and if the obtained matching data and the data corresponding to the padding item are empty, the server may delete the record, and end the current flow. If the matching data and the data corresponding to the necessary padding items are not empty, the following steps are executed.
S45, the server performs format processing on the matched data and stores the matched data in a storage medium.
For example, in a scenario where the matching data is a text string, the server may convert it to json format for storage in the storage medium. In some implementations, the server may further fill and splice the matching data into a complete sentence, and store the complete sentence, for example, for the data crawling of the social leaderboard, information such as title, source, etc. may fill and splice the matching data into the complete sentence.
In addition, in some implementations, after the server crawls the web page data corresponding to the current URL address, it may further determine whether there is child node information under the URL address, that is, whether there is a child page, if there is a child page, the server may further obtain the URL address of the child page, and continue crawling the data of the child page, and store the crawled data in the storage medium.
In some implementations, the process of analyzing the downloaded web page data in S38-S45 described above may be performed by a web page parser in a server.
It will be appreciated that after completing the data crawling process of the present round, the server may return to S33 to continue the data crawling process of obtaining URL addresses from the data crawling queue for persistence.
It can be further understood that, according to the system architecture shown in fig. 1, after the server finishes data crawling, if a service application corresponding to the data crawling task is installed on an electronic device used by the user, the service application may display a data crawling result and may respond to a search request of the user.
According to the data crawling method, the data crawling task can be dynamically configured on the Web client, the labor cost for writing codes is reduced, meanwhile, the server can continuously perform the data crawling process according to the crawling time interval, the data crawling efficiency is improved, and the flexibility of the data crawling process can be improved.
In some scenarios, the above-mentioned process of dynamically configuring the data crawling task may be performed on a server, in addition to the Web client, that is, the developer directly opens a Web management page on the server to configure. In other scenarios, the process of crawling data described above may also be performed on a Web client with some computational processing power.
As for the above-mentioned Web client, it may be a notebook computer, a PC, or the like, and fig. 13 is an exemplary schematic structural diagram of an example of the Web client according to an embodiment of the present application. The Web client may include a processor 210, a memory 220, and a communication module 230, among others.
Processor 210 may include, among other things, one or more processing units, memory 220 for storing program codes and data. In an embodiment of the application, processor 210 may execute computer-executable instructions stored in memory 220.
The communication module 230 may be used for communication between various internal modules of the Web client, communication between the Web client and other devices, or the like. By way of example, if the Web client communicates with other devices by way of a wired connection, the communication module 230 may include an interface or the like, such as a universal serial bus (universal serial bus, USB) interface, which may be an interface conforming to the USB standard specification, specifically a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface can be used for connecting a charger to charge the Web client, can also be used for transmitting data between the Web client and other equipment, and can also be used for connecting a headset, playing audio through the headset and the like.
Or the communication module 230 may include an audio device, a radio frequency circuit, a bluetooth chip, a wireless fidelity (WIRELESS FIDELITY, wi-Fi) chip, a near-field communication technology (NFC) module, etc., and interaction between the Web client and other devices may be implemented in a variety of different manners.
In addition, the Web client may further include a display screen 240, and the display screen 240 may display images or videos in the human-computer interaction interface. For example, the Web management page may be displayed for a developer to configure the data crawling task.
Optionally, the Web client may also include a peripheral device 250, such as a mouse, keyboard, speaker, microphone, etc.
It should be understood that the structure of the Web client is not particularly limited in the embodiment of the present application, except for the various components or modules listed in fig. 13. In other embodiments of the application, the Web client may also include more or fewer components than shown, or may combine certain components, or split certain components, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
For the above servers, which may be a single server or a server cluster, fig. 14 is a schematic structural diagram of an example of a server according to an embodiment of the present application. The server includes a processor, memory, and a network interface connected by a system bus. Wherein the processor of the server is configured to provide computing and control capabilities. The memory of the server includes nonvolatile storage medium and internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the server may be used to store configuration data for the data crawling task described above, configuration data for URL data sources, and the like. The network interface of the server may be used to communicate with external devices via a network connection.
For the electronic device, which may be a mobile phone, a tablet computer, etc. used by a user, fig. 15 is a schematic structural diagram of an example of the electronic device according to the embodiment of the present application. Taking the example of the electronic device being a cell phone, the electronic device may include a processor 110, an external memory interface 120, an internal memory 121, a usb interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, an indicator 192, a camera 193, a display 194, and a subscriber identity module (subscriber identity module, SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.
The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (IMAGE SIGNAL processor, ISP), a controller, a memory, a video codec, a digital signal processor (DIGITAL SIGNAL processor, DSP), a baseband processor, and/or a neural Network Processor (NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.
The controller can be a neural center and a command center of the electronic device. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.
A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it may be called directly from memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.
The mobile communication module 150 may provide a solution for wireless communication including 2G/3G/4G/5G, etc. applied on an electronic device. The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (low noise amplifier, LNA), etc. The mobile communication module 150 may receive electromagnetic waves from the antenna 1, perform processes such as filtering, amplifying, and the like on the received electromagnetic waves, and transmit the processed electromagnetic waves to the modem processor for demodulation. The mobile communication module 150 can amplify the signal modulated by the modem processor, and convert the signal into electromagnetic waves through the antenna 1 to radiate. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the processor 110. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be provided in the same device as at least some of the modules of the processor 110.
The wireless communication module 160 may provide solutions for wireless communication including wireless local area network (wireless local area networks, WLAN) (e.g., wireless fidelity (WIRELESS FIDELITY, wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation SATELLITE SYSTEM, GNSS), frequency modulation (frequency modulation, FM), near field communication (NEAR FIELD communication, NFC), infrared (IR), etc., as applied to electronic devices. The wireless communication module 160 may be one or more devices that integrate at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, modulates the electromagnetic wave signals, filters the electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, frequency modulate it, amplify it, and convert it to electromagnetic waves for radiation via the antenna 2.
The electronic device implements display functions via a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
The display screen 194 is used to display images, videos, and the like. The display 194 includes a display panel. The display panel may employ a Liquid Crystal Display (LCD) CRYSTAL DISPLAY, an organic light-emitting diode (OLED), an active-matrix organic LIGHT EMITTING diode (AMOLED), a flexible light-emitting diode (FLED), miniled, microLed, micro-oLed, a quantum dot LIGHT EMITTING diode (QLED), or the like. In some embodiments, the electronic device may include 1 or N display screens 194, N being a positive integer greater than 1.
The internal memory 121 may be used to store computer-executable program code that includes instructions. The processor 110 executes various functional applications of the electronic device and data processing by executing instructions stored in the internal memory 121. The internal memory 121 may include a storage program area and a storage data area. The storage program area may store an application program (such as a sound playing function, an image playing function, etc.) required for at least one function of the operating system, etc. The storage data area may store data created during use of the electronic device (e.g., audio data, phonebook, etc.), and so forth. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (universal flash storage, UFS), and the like.
It should be understood that the structure illustrated in the embodiments of the present application does not constitute a specific limitation on the electronic device. In other embodiments of the application, the electronic device may include more or less components than illustrated, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
The above describes in detail an example of the data crawling method provided by the embodiment of the present application. It will be appreciated that the Web client, server and electronic device, in order to achieve the above-described functionality, comprise corresponding hardware and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application in conjunction with the embodiments, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The embodiment of the application can divide the functional modules of the Web client, the server and the electronic device according to the method example, for example, the functional modules can be divided into the functional modules corresponding to the functions, for example, a detection unit, a processing unit, a display unit and the like, and two or more functions can be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation.
It should be noted that, all relevant contents of each step related to the above method embodiment may be cited to the functional description of the corresponding functional module, which is not described herein.
In case an integrated unit is employed, the electronic device may further comprise a processing module, a storage module and a communication module. The processing module can be used for controlling and managing the actions of the electronic equipment. The memory module may be used to support the electronic device to execute stored program code, data, etc. And the communication module can be used for supporting the communication between the electronic device and other devices.
Wherein the processing module may be a processor or a controller. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. A processor may also be a combination that performs a computational function, such as a combination comprising one or more microprocessors, a combination of a digital signal processor and a microprocessor, and so forth. The memory module may be a memory. The communication module can be a radio frequency circuit, a Bluetooth chip, a Wi-Fi chip and other equipment which interact with other electronic equipment.
The embodiment of the application also provides a computer readable storage medium, in which a computer program is stored, which when executed by a processor, causes the processor to execute the data crawling method of any of the above embodiments. The storage medium may include: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The embodiment of the application also provides a computer program product, which when run on a computer, causes the computer to execute the relevant steps to realize the data crawling method in the embodiment.
In addition, embodiments of the present application also provide an apparatus, which may be embodied as a chip, component or module, which may include a processor and a memory coupled to each other; the memory is used for storing computer-executed instructions, and when the device is operated, the processor can execute the computer-executed instructions stored in the memory, so that the chip executes the data crawling method in the method embodiments.
The Web client, the server, the electronic device, the computer readable storage medium, the computer program product, or the chip provided in this embodiment are all configured to execute the corresponding method provided above, so that the beneficial effects achieved by the method can refer to the beneficial effects in the corresponding method provided above, and are not repeated herein.
It will be appreciated by those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims (13)

1. A data crawling method, wherein the method is applied to a data crawling system, the data crawling system comprising a Web client and a server, the method comprising:
The Web client displays a second page, wherein the second page is used for configuring a crawling data source;
The Web client receives a second configuration operation on the crawling data source through the second page, and sends configuration content of the crawling data source to the server for storage, wherein the second configuration operation at least comprises configuration on a website corresponding to the crawling data source;
the Web client displays a first page, wherein the first page is used for configuring a data crawling task;
The Web client receives a first configuration operation on the data crawling task through the first page, and sends configuration content of the data crawling task to the server for storage, wherein the first configuration operation at least comprises configuration on a crawling data source and a crawling time interval of the data crawling task, and the configuration on the crawling data source of the data crawling task comprises the following steps: selecting at least one crawling data source from the crawling data sources configured by the second configuration operation;
Under the condition that the Web client receives the operation of starting the data crawling task, a task request is sent to the server to request the server to execute the data crawling task, and the task request carries the identification of the data crawling task;
The server acquires a crawling data source corresponding to the data crawling task according to the identification of the data crawling task, and stores the crawling data source and crawling trigger time into a first queue, wherein the crawling trigger time is determined by the server according to the current system time and the crawling time interval;
the server acquires the crawling data source from the first queue, and crawls webpage data according to a website corresponding to the crawling data source under the condition that the current system time reaches the crawling trigger time;
Deleting information of the crawling data source from the first queue, and determining next crawling trigger time of the crawling data source according to the crawling time interval;
Re-storing the crawling data source and the next crawling trigger time into the first queue so as to crawl the webpage data at fixed time;
under the condition that the current system time does not reach the crawling trigger time, the server calculates waiting time according to the crawling trigger time and the current system time;
Under the condition that the waiting time is longer than a first preset time length, the server executes the local lock for the first preset time length, and acquires the crawling data source from the first queue again after releasing the local lock;
And under the condition that the waiting time is not longer than a first preset time length, executing the waiting time of the local lock by the server, and acquiring the crawling data source from the first queue again after releasing the local lock.
2. The method of claim 1, wherein the first page includes a first control, wherein the Web client receives a first configuration operation for the data crawling task via the first page, comprising:
the Web client responds to clicking operation of a user on the first control, and displays a first input control for configuring the data crawling task;
and receiving the crawling data source and the crawling time interval which are input by the user on the first input control, and completing the first configuration operation.
3. The method of claim 2, wherein after the completion of the first configuration operation, the method further comprises:
and the Web client displays the configuration content of the data crawling task and a second control on the first page, wherein the second control is used for triggering the starting and stopping of the data crawling task.
4. The method of claim 3, wherein the Web client receiving an operation to initiate the data crawling task comprises:
and the Web client receives the click operation of the user on the second control, and triggers the starting of the data crawling task.
5. The method of claim 1, wherein the second page includes a third control, the Web client receiving a second configuration operation for the crawling data source through the second page, comprising:
The Web client responds to clicking operation of the third control by a user, and displays a second input control for configuring the crawling data source;
and receiving a website corresponding to the crawling data source input by the user on the second input control, and completing the second configuration operation.
6. The method of claim 5, wherein after said completing said second configuration operation, said method further comprises:
and the Web client displays configuration content of the crawling data source and a fourth control on the second page, wherein the fourth control is used for triggering the configuration of the crawling rule of the crawling data source.
7. The method of claim 6, wherein after the Web client displays the configuration content of the crawling data source and a fourth control on the second page, the method further comprises:
the Web client receives click operation of the user on the fourth control, and displays a third page for configuring the crawling rule;
receiving clicking operation of the user on a fifth control on the third page, and acquiring and displaying original message information of a website corresponding to the crawling data source;
And receiving a rule expression input by the user on a third input control on the third page, and performing crawling test on the original message information according to the rule expression.
8. The method according to any one of claims 1 to 7, wherein crawling the web page data according to the web address corresponding to the crawling data source comprises:
the server acquires a crawling rule corresponding to the crawling data source;
and crawling contents in the websites corresponding to the crawling data sources according to the crawling rules to obtain the webpage data.
9. The method of claim 8, wherein the crawling rules include a imperfection setting for fields contained in a crawling result, the method further comprising, after the obtaining the web page data:
and if the filling-necessary property corresponding to the first field in the webpage data is set to be filling-necessary, and the value of the first field is null, discarding the webpage data by the server.
10. The method of any one of claims 1 to 7, wherein after crawling web page data from a web site corresponding to the crawling data source, the method further comprises:
the server stores the web page data in a storage medium.
11. The method of any of claims 1-7, wherein the data crawling system further comprises an electronic device, the method further comprising:
the electronic equipment sends a data request to the server to request to acquire the crawled webpage data;
the electronic device receives and presents the web page data from the server.
12. A data crawling system, characterized in that it comprises a Web client and a server, said data crawling system being adapted to perform the method of any of claims 1 to 11.
13. A computer-readable storage medium comprising instructions that, when run on a data crawling system, cause the data crawling system to perform the method of any of claims 1 to 11.
CN202410760744.7A 2024-06-13 2024-06-13 Data crawling method, system and computer readable storage medium Active CN118332174B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410760744.7A CN118332174B (en) 2024-06-13 2024-06-13 Data crawling method, system and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410760744.7A CN118332174B (en) 2024-06-13 2024-06-13 Data crawling method, system and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN118332174A CN118332174A (en) 2024-07-12
CN118332174B true CN118332174B (en) 2024-10-29

Family

ID=91772895

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410760744.7A Active CN118332174B (en) 2024-06-13 2024-06-13 Data crawling method, system and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN118332174B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109213824A (en) * 2017-06-29 2019-01-15 北京京东尚科信息技术有限公司 Data grabber system, method and apparatus
CN113934912A (en) * 2021-11-11 2022-01-14 北京搜房科技发展有限公司 Data crawling method and device, storage medium and electronic equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7403945B2 (en) * 2004-11-01 2008-07-22 Sybase, Inc. Distributed database system providing data and space management methodology
US7917520B2 (en) * 2006-12-06 2011-03-29 Yahoo! Inc. Pre-cognitive delivery of in-context related information
US7672938B2 (en) * 2007-10-05 2010-03-02 Microsoft Corporation Creating search enabled web pages
CN115185673B (en) * 2022-05-17 2023-10-31 贝壳找房(北京)科技有限公司 Distributed timing task scheduling method, system, storage medium and program product

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109213824A (en) * 2017-06-29 2019-01-15 北京京东尚科信息技术有限公司 Data grabber system, method and apparatus
CN113934912A (en) * 2021-11-11 2022-01-14 北京搜房科技发展有限公司 Data crawling method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN118332174A (en) 2024-07-12

Similar Documents

Publication Publication Date Title
CN113805747B (en) Information reminding method, electronic equipment and computer readable storage medium
US8312451B2 (en) Computing system for providing software components on demand to a mobile device
CN110569130B (en) Cross-process communication method, device and equipment
US20120254352A1 (en) Application providing system and application providing method
WO2013001146A2 (en) Method and apparatus for real-time processing of data items
WO2013023095A2 (en) Smart thin client server
CN115629884B (en) Thread scheduling method, electronic equipment and storage medium
CN113420051A (en) Data query method and device, electronic equipment and storage medium
CN113760453A (en) Container mirror image distribution system and container mirror image pushing, pulling and deleting method
CN105847446B (en) Method, device and system for acquiring network data
WO2021185352A1 (en) Version upgrade method and related apparatus
WO2022127743A1 (en) Content display method and terminal device
CN118332174B (en) Data crawling method, system and computer readable storage medium
US11636035B2 (en) Persisted data cache service
CN111367996B (en) KV index-based thermal data increment synchronization method and device
CN114385382A (en) Light application access method and device, computer equipment and storage medium
CN115113989B (en) Transaction execution method, device, computing equipment and storage medium
WO2023274025A1 (en) Message processing method and related apparatus
CN112433779B (en) Application site preloading method, device and storage medium based on ERP system
CN112316416B (en) Data searching method, device, computer equipment and storage medium
US20240357185A1 (en) Content management method, electronic device, and system
WO2024159994A1 (en) Search event context display method and generation method, and electronic device
WO2023125832A1 (en) Image sharing method and electronic device
CN117707753B (en) Resource update control method, electronic equipment and chip system
WO2023179454A1 (en) Service calling method and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant