CN113743082B - Data processing method, system, storage medium and electronic equipment - Google Patents

Data processing method, system, storage medium and electronic equipment Download PDF

Info

Publication number
CN113743082B
CN113743082B CN202111087141.8A CN202111087141A CN113743082B CN 113743082 B CN113743082 B CN 113743082B CN 202111087141 A CN202111087141 A CN 202111087141A CN 113743082 B CN113743082 B CN 113743082B
Authority
CN
China
Prior art keywords
data
type
feature
result
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111087141.8A
Other languages
Chinese (zh)
Other versions
CN113743082A (en
Inventor
吴东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingdong Technology Information Technology Co Ltd
Original Assignee
Jingdong Technology Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdong Technology Information Technology Co Ltd filed Critical Jingdong Technology Information Technology Co Ltd
Priority to CN202111087141.8A priority Critical patent/CN113743082B/en
Publication of CN113743082A publication Critical patent/CN113743082A/en
Application granted granted Critical
Publication of CN113743082B publication Critical patent/CN113743082B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data processing method, a system, a storage medium and electronic equipment, wherein text data to be processed is acquired, data type recognition is carried out on the text data to be processed to obtain a data type result, a corresponding feature configuration list is determined based on the data type result, a corresponding extraction rule is acquired according to the data type result, feature data is extracted from the feature configuration list based on the extraction rule, and when the feature data meets preset conditions, text abstract data is generated based on preset abstract rules and the feature data. According to the scheme, under the condition that the complex data structure including non-special characters and the like is included, the feature extraction processing is carried out on different data types, so that the corresponding feature data are obtained, the requirements of automatic type identification, automatic feature extraction, automatic text abstract generation and the like under the complex data structure are met, and the accuracy of obtaining the text abstract data is improved. In addition, the text abstract data is identified by a similarity algorithm, so that the accuracy of a similarity calculation result is improved.

Description

Data processing method, system, storage medium and electronic equipment
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data processing method, a system, a storage medium, and an electronic device.
Background
In a natural language processing task, judging whether the two documents are similar or not, and calculating the similarity degree of the two documents through a similarity algorithm. For example, when a microblog hot topic is found based on a clustering algorithm, the similarity of the content of each text needs to be measured, and then the microblogs with sufficiently similar content are clustered into a cluster; in preprocessing the text, repeated text is selected and deleted based on the similarity of the text.
Before the similarity algorithm is calculated, the data is usually preprocessed, and the currently commonly used preprocessing method is to remove special characters (such as punctuations, brackets, labels and the like) in the data, but when the data is faced with complex data objects (such as structural data), the data is interfered by non-special characters (letters, numbers and Chinese characters), so that repeated texts cannot be selected and deleted from the complex structural data, and the generated text summary data contains the repeated texts, thereby reducing the accuracy of generating the text summary data.
Thus, the existing generation of text excerpt data is low in accuracy.
Disclosure of Invention
In view of this, the application discloses a data processing method, a system, a storage medium and an electronic device, which aim to meet the requirements of automatic type identification, automatic feature extraction, automatic text abstract generation and the like under a complex data structure, and improve the accuracy of acquiring text abstract data.
In order to achieve the above purpose, the technical scheme disclosed by the method is as follows:
the first aspect of the application discloses a data processing method, which comprises the following steps:
acquiring text data to be processed; the text data to be processed is acquired according to the user demand;
performing data type identification on the text data to be processed to obtain a data type result, and determining a corresponding feature configuration list based on the data type result;
acquiring a corresponding extraction rule according to the data type result, and extracting feature data from the feature configuration list based on the extraction rule;
when the characteristic data accords with the preset condition, generating text abstract data based on a preset abstract rule and the characteristic data which accords with the preset condition; the preset summary rule is determined by a summary rule field of the feature configuration list.
Preferably, the identifying the data type of the text data to be processed to obtain a data type result, and determining a corresponding feature configuration list based on the data type result, includes:
carrying out data type identification on the text data to be processed;
when the data type of the text data to be processed is a String type, a String type result is generated, and the text data to be processed is analyzed based on the String type result to obtain a String type feature configuration list;
when the data type of the text data to be processed is XML type, generating an XML type result, and analyzing the text data to be processed based on the XML type result to obtain an XML type feature configuration list;
when the data type of the text data to be processed is a JSON type, a JSON type result is generated, and the text data to be processed is analyzed based on the JSON type result to obtain a JSON type feature configuration list.
Preferably, the obtaining a corresponding extraction rule according to the data type result, and extracting feature data from the feature configuration list based on the extraction rule, includes:
judging the data type result;
When the data type result is the String type result, carrying out regular matching on the information of the String type feature configuration list through a preset regular expression and a preset feature field, and extracting first feature data conforming to the regular matching;
when the data type result is the XML type result, calculating information of the XML type feature configuration list through a preset attribute expression to obtain a first calculation result, and determining second feature data based on the first calculation result and the preset feature field;
and when the data type result is the JSON type result, calculating the information of the JSON type feature configuration list through the preset attribute expression to obtain a second calculation result, and determining third feature data based on the second calculation result and the preset feature field.
Preferably, when the feature data meets a preset condition, generating text summary data based on a preset summary rule and feature data meeting the preset condition includes:
when the first characteristic data is not null, a first abstract rule field is obtained from the String type characteristic configuration list, and first text abstract data is generated based on the first abstract rule field and the first characteristic data;
When the second feature data is not null, a second abstract rule field is obtained from the XML type feature configuration list, and second text abstract data is generated based on the second abstract rule field and the second feature data;
and when the third feature data is not null, acquiring a third abstract rule field from the JSON type feature configuration list, and generating third text abstract data based on the third abstract rule field and the third feature data.
Preferably, the method further comprises:
and if the characteristic data is null, returning to the step of acquiring the text data to be processed.
A second aspect of the present application discloses a data processing system, the system comprising:
the acquisition unit is used for acquiring text data to be processed; the text data to be processed is acquired according to the user demand;
the determining unit is used for carrying out data type identification on the text data to be processed to obtain a data type result, and determining a corresponding feature configuration list based on the data type result;
the extraction unit is used for acquiring corresponding extraction rules according to the data type result and extracting feature data from the feature configuration list based on the extraction rules;
The generating unit is used for generating text abstract data based on a preset abstract rule and feature data meeting preset conditions when the feature data meets the preset conditions; the preset summary rule is determined by a summary rule field of the feature configuration list.
Preferably, the determining unit includes:
the identification module is used for carrying out data type identification on the text data to be processed;
the first acquisition module is used for generating a String type result when the data type of the text data to be processed is String type, and analyzing the text data to be processed based on the String type result to obtain a String type feature configuration list;
the second acquisition module is used for generating an XML type result when the data type of the text data to be processed is an XML type, and analyzing the text data to be processed based on the XML type result to obtain an XML type feature configuration list;
and the third acquisition module is used for generating a JSON type result when the data type of the text data to be processed is the JSON type, and analyzing the text data to be processed based on the JSON type result to obtain a JSON type feature configuration list.
Preferably, the extraction unit includes:
the judging module is used for judging the data type result;
the first extraction module is used for carrying out regular matching on the information of the String type feature configuration list through a preset regular expression when the data type result is the String type result, and extracting first feature data conforming to the regular matching;
the second extraction module is used for calculating the information of the XML type feature configuration list through a preset attribute expression to obtain second feature data when the data type result is the XML type result;
and the third extraction module is used for calculating the information of the JSON type feature configuration list through the preset attribute expression when the data type result is the JSON type result, so as to obtain third feature data.
A third aspect of the present application discloses a storage medium, wherein the storage medium includes stored instructions, where the instructions, when executed, control a device in which the storage medium is located to perform a data processing method according to any one of the first aspects.
A fourth aspect of the application discloses an electronic device comprising a memory, and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by one or more processors to perform the data processing method according to any of the first aspects.
According to the technical scheme, the application discloses a data processing method, a system, a storage medium and electronic equipment, and text data to be processed are obtained; the text data to be processed is acquired according to the user requirements, the text data to be processed is subjected to data type recognition to obtain a data type result, a corresponding feature configuration list is determined based on the data type result, corresponding extraction rules are acquired according to the data type result, feature data are extracted from the feature configuration list based on the extraction rules, and when the feature data meet preset conditions, text abstract data are generated based on preset abstract rules and the feature data; the preset abstract rule is determined by a feature configuration list and a pre-acquired feature rule field; the pre-acquired feature rule field is obtained based on the feature configuration list. According to the scheme, under the condition that the complex data structure including non-special characters and the like is included, the feature extraction processing is carried out on different data types, so that the corresponding feature data are obtained, the requirements of automatic type identification, automatic feature extraction, automatic text abstract generation and the like under the complex data structure are met, and the accuracy of obtaining the text abstract data is improved. In addition, the text abstract data is identified by a similarity algorithm, so that the accuracy of a similarity calculation result is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.
FIG. 1 is a schematic flow chart of a data processing method disclosed in an embodiment of the present application;
fig. 2 is a schematic flow chart of determining a corresponding feature configuration list based on a data type result obtained by data type recognition according to an embodiment of the present application;
fig. 3 is a schematic flow chart of extracting feature data from a feature configuration list based on an extraction rule according to a data type result, which is disclosed in the embodiment of the present application;
fig. 4 is a schematic flow chart of generating text summary data based on a preset summary rule and feature data when the feature data meets a preset condition according to the embodiment of the present application;
FIG. 5 is a schematic diagram of a data processing system according to an embodiment of the present application;
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As known from the background art, before the similarity algorithm is calculated, when a complex data object (such as structural data) is preprocessed, the complex data object is interfered by non-special characters (letters, numbers and Chinese characters), so that repeated texts cannot be selected and deleted from the complex structural data, and the generated text summary data contains repeated texts, thereby reducing the accuracy of generating the text summary data.
In order to solve the problems, the embodiment of the application discloses a data processing method, a system, a storage medium and electronic equipment, which are used for carrying out feature extraction processing on different data types under the condition of containing complex data structures such as non-special characters and the like to obtain respective corresponding feature data, so that the requirements of automatic type identification, automatic feature extraction, automatic text abstract generation and the like under the complex data structures are met, and the accuracy of acquiring text abstract data is improved. In addition, the text abstract data is identified by a similarity algorithm, so that the accuracy of a similarity calculation result is improved. The specific implementation is illustrated by the following examples.
Referring to fig. 1, a flow chart of a data processing method disclosed in an embodiment of the present application is shown, where the data processing method mainly includes the following steps:
S101: acquiring text data to be processed; and acquiring the text data to be processed according to the user demand.
In S101, the text data to be processed is text data input by a user, or acquired according to a user requirement.
S102: and carrying out data type identification on the text data to be processed to obtain a data type result, and determining a corresponding feature configuration list based on the data type result.
In S102, the data type results include a String type result, an object profile (JavaScript ObjectNotation, JSON) type result, and an extensible markup language (EXtensible Markup Language, XML) type result, where the String type result is a plain text type result.
And judging whether the data type is an XML type or a JSON type according to an XML of the object-oriented programming language JAVA and a JSON related application program interface (ApplicationProgramming Interface, API), and judging that the data type is String type processing if the data type is neither the XML type nor the JSON type. The data type identification flow is shown as A1-A5.
A1: and judging whether the data type is JSON type.
If the data type is JSON type, A2 is executed, and if the data type is not JSON type, A3 is executed.
A2: the returned data type is JSON type and the data type identification is ended.
A3: and judging whether the data type is in an XML format.
If the data type is in XML format, A4 is executed, and if the data type is not in XML format, A5 is executed.
A4: the returned data type is an XML type and the data type identification is ended.
A5: the returned data type is String type, and the data type identification is ended.
Specifically, the text data to be processed is subjected to data type recognition to obtain a data type result, and a corresponding characteristic configuration list determining process is shown as B1-B4 based on the data type result.
B1: and carrying out data type identification on the text data to be processed.
B2: when the data type of the text data to be processed is String type, a String type result is generated, and the text data to be processed is analyzed based on the String type result, so that a String type feature configuration list is obtained.
When the data type of the text data to be processed is String type, analyzing the text data to be processed into a String type feature configuration list in the form of KEY-Value.
B3: when the data type of the text data to be processed is XML type, an XML type result is generated, and the text data to be processed is analyzed based on the XML type result to obtain an XML type feature configuration list.
When the data type of the text data to be processed is XML type, the text data to be processed is analyzed into an XML type feature configuration list in the form of KEY-Value.
B4: when the data type of the text data to be processed is a JSON type, a JSON type result is generated, and the text data to be processed is analyzed based on the JSON type result to obtain a JSON type feature configuration list.
When the data type of the text data to be processed is JSON type, analyzing the text data to be processed into a JSON type feature configuration list in a KEY-Value form.
S103: and acquiring corresponding extraction rules according to the data type result, and extracting feature data from the feature configuration list based on the extraction rules.
In S103, if the data type result is a String type result, the extraction rule is a rule extracted by a preset regular expression, and if the data type result is an XML type result or a JSON type result, the extraction rule is a rule extracted by a preset attribute expression.
If the data type result is a String type result, the extraction rule is a preset regular expression.
If the data type result is an XML type result or a JSON type result, the extraction rule is a preset attribute expression, the preset attribute expression supports >, <, =, > or +, ++ +| (or) symbols, feature names are wrapped in%. Such as: % a% >100 &%b% = 123, meaning that a eigenvalue is greater than 100 and B eigenvalue is equal to 123, namely, the eigenvalue meets the eigenvalue and extracts, the extraction rule is a >100, and b=123; and the following steps: feature names wrap between multiple features in%and are separated by a%number. Such as: % A% and% B% and the extraction rule is the value of the extracted data features A and B.
The process of acquiring the corresponding extraction rule according to the data type result and extracting the feature data from the feature configuration list based on the extraction rule is shown as C1-C4.
C1: and judging the data type result.
C2: when the data type result is a String type result, carrying out regular matching on the information of the String type feature configuration list through a preset regular expression and a first preset feature field, and extracting first feature data conforming to the regular matching.
When the data type result is the String type result, the extraction condition is regular, the preset feature field is also regular, and the information of the String type feature configuration list is subjected to regular matching through the preset regular expression and the preset feature field, the first feature data is extracted if the matching is successful, and the first feature data is not extracted if the matching is unsuccessful.
The preset feature field may be a name field, a property field, etc., and the determination of the specific preset feature field is set by a technician according to actual situations, which is not specifically limited in this application.
The first feature data is used for indicating feature data obtained by regular matching of information of the String type feature configuration list.
And C3: and when the data type result is an XML type result, calculating the information of the XML type feature configuration list through a preset attribute expression to obtain a first calculation result, and obtaining second feature data based on the first calculation result and a preset feature field.
And when the data type result is an XML type result, extracting all feature data of the information of the XML type feature configuration list, extracting part of feature data (preset feature fields) from all feature data, substituting the part of feature data into the attribute expression for calculation, and when the attribute expression is satisfied, configuring the extracted feature data as second feature data according to the preset feature fields.
The second feature data is used for indicating that the information of the XML type feature configuration list satisfies the feature data of the attribute expression.
For convenience of understanding, when the data type result is an XML type result, the process of calculating the information of the XML type feature configuration list through the preset attribute expression to obtain a first calculation result, and obtaining second feature data based on the first calculation result and the preset feature field is illustrated herein:
for example, the personal information is XML type data, firstly, information of an XML type feature configuration list is calculated through a preset attribute expression, and after the preset attribute expression is satisfied, all feature data names, sexes, ages and provinces of the information of the XML type feature configuration list are extracted, and preset feature fields are as follows: % province% = Sichuan, name, gender, age, the second feature data is the feature data of the name, gender, age extracting all the Sichuan province personal information.
And C4: and when the data type result is a JSON type result, calculating the information of the JSON type feature configuration list through a preset attribute expression to obtain a second calculation result, and determining third feature data based on the second calculation result and the preset feature field.
And when the data type result is a JSON type result, extracting the characteristic data of the information of the JSON type characteristic configuration list, substituting the extracted characteristic data into the attribute expression for calculation, and when the attribute expression is satisfied, configuring the extracted characteristic data as third characteristic data according to a preset characteristic field.
The third feature data is used for indicating that the information of the JSON type feature configuration list satisfies the feature data of the attribute expression.
For convenience in understanding, when the data type result is a JSON type result, calculating information of the JSON type feature configuration list through a preset attribute expression to obtain a second calculation result, and obtaining third feature data based on the second calculation result and a preset feature field, which is illustrated here:
for example, the personal information is JSON type data, firstly, the information of the JSON type feature configuration list is calculated through a preset attribute expression, and after the preset attribute expression is satisfied, all feature data names, sexes, ages and provinces of the information of the JSON type feature configuration list are extracted, and preset feature fields are as follows: % province% = Yunnan, name, gender, then the third feature data is the feature data of the name, gender that extracts all Yunnan province personal information.
Alternatively, the feature data of the same data type result may be a plurality of pieces.
After processing 1 String type result in the String type feature configuration list, if 2 String type results are still included in the String type feature configuration list, acquiring corresponding extraction rules according to the remaining 2 String type results, and extracting feature data from the feature configuration list based on the extraction rules.
S104: when the feature data accords with the preset condition, generating text abstract data based on a preset abstract rule and the feature data which accords with the preset condition; the preset summary rule is determined by the summary rule field of the feature configuration list.
In S104, if the first feature data is not null, it is determined that the first feature data meets the preset condition, if the second feature data is not null, it is determined that the second feature data meets the preset condition, and if the third feature data is not null, it is determined that the third feature data meets the preset condition.
The feature configuration list comprises a String type feature configuration list, an XML type feature configuration list and a JSON type feature configuration list.
When the data type result is a String type result, the summary rule field is a regular expression.
When the data type result is an XML type result or a JSON type result, the abstract rule field is a preset attribute expression. The preset attribute expression supports data operations (+, -, /) and logical operations (++, |), feature names are wrapped in%. Such as: setting the abstract rule field to be%A+%B, the abstract rule field indicates that the extracted feature A value is added with the B value.
The determination of the abstract rule field is set by a technician according to actual conditions, and the application is not particularly limited.
Specifically, when the characteristic data accords with the preset condition, the process of generating the text abstract data is shown as D1-D4 based on the preset abstract rule and the characteristic data which accords with the preset condition.
D1: it is determined whether the feature data (first feature data, second feature data, or third feature data) is a null value.
And D2 is executed when the feature data is the first feature data and the first feature data is not null, D3 is executed when the feature data is the second feature data and the second feature data is not null, and D4 is executed when the feature data is the third feature data and the third feature data is not null.
And when the feature data is null, returning to the source data, namely returning to the step of acquiring the text data to be processed.
D2: and acquiring a first abstract rule field from the String type feature configuration list, and generating first text abstract data based on the first abstract rule field and the first feature data.
The first abstract rule field may be a name field, an age field, etc., and the determination of the specific first abstract rule field is set by a technician according to actual situations, which is not specifically limited in the present application.
And calculating a result obtained by splicing the first abstract rule field and the first characteristic data through a preset message abstract algorithm to obtain first text abstract data.
The preset message digest algorithm may be an MD5 algorithm or a SHA-256 algorithm, and the specific preset message digest algorithm is set by a technician according to actual conditions, which is not specifically limited in this application.
D3: and acquiring a second abstract rule field from the XML type feature configuration list, and generating second text abstract data based on the second abstract rule field and the second feature data.
The second summary rule field may be a name field, an age field, etc., and the determination of the specific second summary rule field is set by a technician according to an actual situation, which is not specifically limited in the present application.
And calculating a result obtained by splicing the second abstract rule field and the second characteristic data through a preset message abstract algorithm to obtain second text abstract data.
D4: and acquiring a third abstract rule field from the JSON type feature configuration list, and generating third text abstract data based on the third abstract rule field and the third feature data.
The third abstract rule field may be a name field, an age field, etc., and the determination of the specific third abstract rule field is set by a technician according to actual situations, which is not specifically limited in the present application.
And calculating a result obtained by splicing the third abstract rule field and the third characteristic data through a preset message abstract algorithm to obtain third text abstract data.
For convenience in understanding the process of generating text summary data based on preset summary rules and feature data meeting preset conditions, the following is exemplified herein:
for example, the personal information is JSON type data, firstly, the information of the JSON type feature configuration list is calculated through a preset attribute expression, and after the preset attribute expression is satisfied, all feature names, sexes, ages, provinces and the like of the information of the JSON type feature configuration list are extracted, wherein the extraction conditions are as follows: % province% = shandong, the extraction features are: the name, sex and age of the Shandong province personal information are extracted as extraction results, and the name, sex and age characteristics are added and spliced to generate text abstract data by using an MD5 algorithm.
In the embodiment of the application, under the condition of containing complex data structures such as non-special characters, the characteristic extraction processing is carried out on different data types to obtain the respective corresponding characteristic data, so that the requirements of automatic type identification, automatic characteristic extraction, automatic text abstract generation and the like under the complex data structures are met, and the accuracy of acquiring the text abstract data is improved. In addition, the text abstract data is identified by a similarity algorithm, so that the accuracy of a similarity calculation result is improved.
Referring to fig. 2, in the step S102, the process of performing data type recognition on the text data to be processed to obtain a data type result and determining a corresponding feature configuration list based on the data type result mainly includes the following steps:
s201: and carrying out data type identification on the text data to be processed.
S202: when the data type of the text data to be processed is String type, a String type result is generated, and the text data to be processed is analyzed based on the String type result, so that a String type feature configuration list is obtained.
S203: when the data type of the text data to be processed is XML type, an XML type result is generated, and the text data to be processed is analyzed based on the XML type result to obtain an XML type feature configuration list.
S204: when the data type of the text data to be processed is a JSON type, a JSON type result is generated, and the text data to be processed is analyzed based on the JSON type result to obtain a JSON type feature configuration list.
The execution principle of S201-S204 is identical to that of S102 described above, and reference is made thereto, and details thereof will not be repeated.
In the embodiment of the application, the data type identification is performed on the text data to be processed, when the data type of the text data to be processed is String type, XML type or JSON type, the corresponding type results are generated, and the text data to be processed is analyzed based on the type results, so that the purpose of obtaining the feature configuration list of each type is achieved.
Referring to fig. 3, in the step S103, a process of acquiring a corresponding extraction rule according to a data type result and extracting feature data from a feature configuration list based on the extraction rule mainly includes the following steps:
s301: and judging the data type result. When the data type result is the String type result, S302 is performed, S303 is performed, and when the data type result is the JSON type result, S304 is performed.
S302: and carrying out regular matching on the information of the String type feature configuration list through a preset regular expression and a preset feature field, and extracting first feature data conforming to the regular matching.
S303: and calculating the information of the XML type feature configuration list through a preset attribute expression to obtain a first calculation result, and determining second feature data based on the first calculation result and a preset feature field.
S304: and calculating the information of the JSON type feature configuration list through a preset attribute expression to obtain a second calculation result, and determining third feature data based on the second calculation result and a preset feature field.
The execution principle of S301-S304 is identical to that of S103 described above, and reference is made thereto, and details thereof will not be repeated.
In the embodiment of the application, a data type result is judged, when the data type result is a String type result, information of a String type feature configuration list is subjected to regular matching through a preset regular expression, first feature data which accords with the regular matching is extracted, when the data type result is an XML type result, information of the XML type feature configuration list is calculated through a preset attribute expression, second feature data is obtained, and when the data type result is a JSON type result, information of the JSON type feature configuration list is calculated through a preset attribute expression, and third feature data is obtained. Therefore, the purpose of obtaining the characteristic data corresponding to different types of results according to the different types of results is achieved.
Referring to fig. 4, in the step S104, a process of generating text summary data based on a preset summary rule and feature data when the feature data meets a preset condition is referred to as the following steps:
s401: the feature data is determined, when the feature data is the first feature data and the first feature data is not null, S402 is performed, when the feature data is the second feature data and the second feature data is not null, S403 is performed, and when the feature data is the third feature data and the third feature data is not null, S404 is performed.
S402: and acquiring a first abstract rule field from the String type feature configuration list, and generating first text abstract data based on the first abstract rule field and the first feature data.
S403: and acquiring a second abstract rule field from the XML type feature configuration list, and generating second text abstract data based on the second abstract rule field and the second feature data.
S404: and acquiring a third abstract rule field from the JSON type feature configuration list, and generating third text abstract data based on the third abstract rule field and the third feature data.
The execution principle of S401 to S404 is identical to that of S104 described above, and reference is made thereto, and details thereof will not be repeated.
In the embodiment of the application, when the feature data accords with the preset condition, the purpose of generating the text abstract data is achieved based on the preset abstract rule and the feature data.
Based on the data processing method disclosed in fig. 1 of the foregoing embodiment, the embodiment of the present application further correspondingly discloses a data processing system, as shown in fig. 5, where the data processing system mainly includes an obtaining unit 501, a determining unit 502, an extracting unit 503, and a generating unit 504.
An obtaining unit 501, configured to obtain text data to be processed; and acquiring the text data to be processed according to the user demand.
The determining unit 502 is configured to perform data type identification on text data to be processed, obtain a data type result, and determine a corresponding feature configuration list based on the data type result.
The extracting unit 503 is configured to obtain a corresponding extraction rule according to the data type result, and extract feature data from the feature configuration list based on the extraction rule.
A generating unit 504, configured to generate text summary data based on a preset summary rule and the feature data when the feature data meets a preset condition; the preset summary rule is determined by the summary rule field of the feature configuration list.
Further, the determining unit 502 includes an identifying module, a first acquiring module, a second acquiring module, and a third acquiring module.
And the identification module is used for carrying out data type identification on the text data to be processed.
The first acquisition module is used for generating a String type result when the data type of the text data to be processed is String type, and analyzing the text data to be processed based on the String type result to obtain a String type feature configuration list.
The second acquisition module is used for generating an XML type result when the data type of the text data to be processed is an XML type, and analyzing the text data to be processed based on the XML type result to obtain an XML type feature configuration list.
The third acquisition module is used for generating a JSON type result when the data type of the text data to be processed is the JSON type, and analyzing the text data to be processed based on the JSON type result to obtain a JSON type feature configuration list.
Further, the extraction unit 503 includes a determination module, a first extraction module, a second extraction module, and a third extraction module.
And the judging module is used for judging the data type result.
The first extraction module is used for carrying out regular matching on the information of the String type feature configuration list through a preset regular expression when the data type result is the String type result, and extracting first feature data which accords with the regular matching.
And the second extraction module is used for calculating the information of the XML type feature configuration list through a preset attribute expression to obtain second feature data when the data type result is an XML type result.
And the third extraction module is used for calculating the information of the JSON type feature configuration list through a preset attribute expression when the data type result is the JSON type result, so as to obtain third feature data.
In the embodiment of the application, under the condition of containing complex data structures such as non-special characters, the characteristic extraction processing is carried out on different data types to obtain the respective corresponding characteristic data, so that the requirements of automatic type identification, automatic characteristic extraction, automatic text abstract generation and the like under the complex data structures are met, and the accuracy of acquiring the text abstract data is improved. In addition, the text abstract data is identified by a similarity algorithm, so that the accuracy of a similarity calculation result is improved.
The embodiment of the application also provides a storage medium, which comprises stored instructions, wherein the equipment where the storage medium is controlled to execute the data processing method when the instructions run.
The embodiment of the present application further provides an electronic device, whose structural schematic diagram is shown in fig. 6, specifically including a memory 601, and one or more instructions 602, where the one or more instructions 602 are stored in the memory 601, and configured to be executed by the one or more processors 603 to perform the above-mentioned data processing method by executing the one or more instructions 602.
The specific implementation and derivative manner of each embodiment are all within the protection scope of the application.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for a system or system embodiment, since it is substantially similar to a method embodiment, the description is relatively simple, with reference to the description of the method embodiment being made in part. The systems and system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application and are intended to be comprehended within the scope of the present application.

Claims (8)

1. A method of data processing, the method comprising:
acquiring text data to be processed; the text data to be processed is acquired according to the user demand;
performing data type identification on the text data to be processed to obtain a data type result, and determining a corresponding feature configuration list based on the data type result;
acquiring a corresponding extraction rule according to the data type result, and extracting feature data from the feature configuration list based on the extraction rule;
when the characteristic data accords with the preset condition, generating text abstract data based on a preset abstract rule and the characteristic data which accords with the preset condition; the preset abstract rule is determined by an abstract rule field of the feature configuration list;
the obtaining the corresponding extraction rule according to the data type result, and extracting the feature data from the feature configuration list based on the extraction rule, includes:
Judging the data type result;
when the data type result is a String type result, carrying out regular matching on information of a String type feature configuration list through a preset regular expression and a preset feature field, and extracting first feature data conforming to the regular matching;
when the data type result is an XML type result, calculating information of an XML type feature configuration list through a preset attribute expression to obtain a first calculation result, and determining second feature data based on the first calculation result and the preset feature field;
and when the data type result is a JSON type result, calculating the information of the JSON type feature configuration list through the preset attribute expression to obtain a second calculation result, and determining third feature data based on the second calculation result and the preset feature field.
2. The method according to claim 1, wherein the performing data type recognition on the text data to be processed to obtain a data type result, and determining the corresponding feature configuration list based on the data type result, includes:
carrying out data type identification on the text data to be processed;
When the data type of the text data to be processed is a String type, a String type result is generated, and the text data to be processed is analyzed based on the String type result to obtain a String type feature configuration list;
when the data type of the text data to be processed is XML type, generating an XML type result, and analyzing the text data to be processed based on the XML type result to obtain an XML type feature configuration list;
when the data type of the text data to be processed is a JSON type, a JSON type result is generated, and the text data to be processed is analyzed based on the JSON type result to obtain a JSON type feature configuration list.
3. The method of claim 1, wherein generating text summary data based on a preset summary rule and feature data meeting a preset condition when the feature data meets a preset condition comprises:
when the first characteristic data is not null, a first abstract rule field is obtained from the String type characteristic configuration list, and first text abstract data is generated based on the first abstract rule field and the first characteristic data;
When the second feature data is not null, a second abstract rule field is obtained from the XML type feature configuration list, and second text abstract data is generated based on the second abstract rule field and the second feature data;
and when the third feature data is not null, acquiring a third abstract rule field from the JSON type feature configuration list, and generating third text abstract data based on the third abstract rule field and the third feature data.
4. The method as recited in claim 1, further comprising:
and if the characteristic data is null, returning to the step of acquiring the text data to be processed.
5. A data processing system, the system comprising:
the acquisition unit is used for acquiring text data to be processed; the text data to be processed is acquired according to the user demand;
the determining unit is used for carrying out data type identification on the text data to be processed to obtain a data type result, and determining a corresponding feature configuration list based on the data type result;
the extraction unit is used for acquiring corresponding extraction rules according to the data type result and extracting feature data from the feature configuration list based on the extraction rules;
The generating unit is used for generating text abstract data based on a preset abstract rule and feature data meeting preset conditions when the feature data meets the preset conditions; the preset abstract rule is determined by an abstract rule field of the feature configuration list;
the extraction unit includes:
the judging module is used for judging the data type result;
the first extraction module is used for carrying out regular matching on the information of the String type feature configuration list through a preset regular expression and a preset feature field when the data type result is the String type result, and extracting first feature data conforming to the regular matching;
the second extraction module is used for calculating information of an XML type feature configuration list through a preset attribute expression to obtain a first calculation result when the data type result is an XML type result, and determining second feature data based on the first calculation result and the preset feature field;
and the third extraction module is used for calculating the information of the JSON type feature configuration list through the preset attribute expression to obtain a second calculation result when the data type result is the JSON type result, and determining third feature data based on the second calculation result and the preset feature field.
6. The system according to claim 5, wherein the determining unit comprises:
the identification module is used for carrying out data type identification on the text data to be processed;
the first acquisition module is used for generating a String type result when the data type of the text data to be processed is String type, and analyzing the text data to be processed based on the String type result to obtain a String type feature configuration list;
the second acquisition module is used for generating an XML type result when the data type of the text data to be processed is an XML type, and analyzing the text data to be processed based on the XML type result to obtain an XML type feature configuration list;
and the third acquisition module is used for generating a JSON type result when the data type of the text data to be processed is the JSON type, and analyzing the text data to be processed based on the JSON type result to obtain a JSON type feature configuration list.
7. A storage medium comprising stored instructions, wherein the instructions, when executed, control a device in which the storage medium is located to perform the data processing method of any one of claims 1 to 4.
8. An electronic device comprising a memory and one or more instructions, wherein the one or more instructions are stored in the memory and configured to perform a data processing method according to any one of claims 1 to 4 by one or more processors.
CN202111087141.8A 2021-09-16 2021-09-16 Data processing method, system, storage medium and electronic equipment Active CN113743082B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111087141.8A CN113743082B (en) 2021-09-16 2021-09-16 Data processing method, system, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111087141.8A CN113743082B (en) 2021-09-16 2021-09-16 Data processing method, system, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN113743082A CN113743082A (en) 2021-12-03
CN113743082B true CN113743082B (en) 2024-04-05

Family

ID=78739337

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111087141.8A Active CN113743082B (en) 2021-09-16 2021-09-16 Data processing method, system, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN113743082B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114416092A (en) * 2022-03-29 2022-04-29 北京锐融天下科技股份有限公司 Data processing method and device, electronic equipment and storage medium
CN116842394A (en) * 2023-09-01 2023-10-03 苏州高视半导体技术有限公司 Algorithm parameter file generation method, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016060551A1 (en) * 2014-10-13 2016-04-21 Kim Seng Kee A method for mining electronic documents and system thereof
KR20180084580A (en) * 2017-01-17 2018-07-25 경북대학교 산학협력단 Device and method to generate abstractive summaries from large multi-paragraph texts, recording medium for performing the method
CN110287272A (en) * 2019-06-27 2019-09-27 南京冰鉴信息科技有限公司 A kind of configurable real-time feature extraction method, apparatus and system
CN112613293A (en) * 2020-12-29 2021-04-06 北京中科闻歌科技股份有限公司 Abstract generation method and device, electronic equipment and storage medium
CN112800194A (en) * 2021-01-15 2021-05-14 亿企赢网络科技有限公司 Interface change identification method, device, equipment and storage medium
CN112925749A (en) * 2021-02-20 2021-06-08 北京同邦卓益科技有限公司 Data processing method and device, electronic equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8892579B2 (en) * 2012-04-26 2014-11-18 Anu Pareek Method and system of data extraction from a portable document format file
US20140025650A1 (en) * 2012-07-18 2014-01-23 Microsoft Corporation Abstract relational model for transforming data into consumable content

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016060551A1 (en) * 2014-10-13 2016-04-21 Kim Seng Kee A method for mining electronic documents and system thereof
KR20180084580A (en) * 2017-01-17 2018-07-25 경북대학교 산학협력단 Device and method to generate abstractive summaries from large multi-paragraph texts, recording medium for performing the method
CN110287272A (en) * 2019-06-27 2019-09-27 南京冰鉴信息科技有限公司 A kind of configurable real-time feature extraction method, apparatus and system
CN112613293A (en) * 2020-12-29 2021-04-06 北京中科闻歌科技股份有限公司 Abstract generation method and device, electronic equipment and storage medium
CN112800194A (en) * 2021-01-15 2021-05-14 亿企赢网络科技有限公司 Interface change identification method, device, equipment and storage medium
CN112925749A (en) * 2021-02-20 2021-06-08 北京同邦卓益科技有限公司 Data processing method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XML文本自动文摘研究综述;刘德喜;吴世汉;万常选;;计算机应用研究(第11期);全文 *
基于网络爬虫的内容资源评价研究;胡博;中国优秀硕士学位论文全文数据库 信息科技辑(第11期);全文 *

Also Published As

Publication number Publication date
CN113743082A (en) 2021-12-03

Similar Documents

Publication Publication Date Title
US10620945B2 (en) API specification generation
US8595615B2 (en) System and method for automatic stylesheet inference
US20220121813A1 (en) Web Element Rediscovery System and Method
CN102662966B (en) Method and system for obtaining subject-oriented dynamic page content
CN113743082B (en) Data processing method, system, storage medium and electronic equipment
US10489493B2 (en) Metadata reuse for validation against decentralized schemas
US20080243905A1 (en) Attribute extraction using limited training data
CN105205080B (en) Redundant file method for cleaning, device and system
US8938668B2 (en) Validation based on decentralized schemas
US7904406B2 (en) Enabling validation of data stored on a server system
CN114817811B (en) Website analysis method and device
CN113392356A (en) File adaptation method, device and storage medium
US9390073B2 (en) Electronic file comparator
CN114398138A (en) Interface generation method and device, computer equipment and storage medium
CN109408577B (en) ORACLE database JSON analysis method, system, device and storable medium
CN114626337B (en) System, method, and medium for generating style sheets during runtime
CN109241501A (en) Document analysis method and apparatus
CN105677827B (en) A kind of acquisition methods and device of list
US10185706B2 (en) Generating web browser views for applications
CN115146070A (en) Key value generation method, knowledge graph generation method, device, equipment and medium
US20150324333A1 (en) Systems and methods for automatically generating hyperlinks
CN114372265A (en) Malicious program detection method and device, electronic equipment and storage medium
US20130311489A1 (en) Systems and Methods for Extracting Names From Documents
WO2024197728A1 (en) Method and apparatus for determining similarity between webpages, method and apparatus for identifying network assets, and device and medium
CN115150349B (en) Message processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant