CN118886008A

CN118886008A - Method, device, electronic equipment and medium for extracting sensitive information in application program package

Info

Publication number: CN118886008A
Application number: CN202410915441.8A
Authority: CN
Inventors: 向瑜强; 陈晖�; 刘海飞; 王凯明; 张雨; 王世和; 张虎; 林港; 黄锐峰; 翟泽雨; 郭淑香; 黄浩楠
Original assignee: Shanghai Honglian Network Technology Co ltd
Current assignee: Shanghai Honglian Network Technology Co ltd
Priority date: 2024-07-09
Filing date: 2024-07-09
Publication date: 2024-11-01

Abstract

The application provides a method, a device, electronic equipment and a medium for extracting sensitive information in an application program package, wherein the method comprises the following steps: identifying multiple types of files in the target application package, including application code files and manifest files; reading the package name and the application label of the application program from the manifest file; determining the range of character strings to be analyzed in the application code file according to the application package name and the application label; and analyzing the determined character string to be analyzed to identify and extract the sensitive information therein. The key sensitive information can be efficiently extracted from the application program package, and the analysis efficiency and response speed are greatly improved. By comprehensively analyzing various file types and characteristics in the application program package, accurate sensitive information extraction results can be provided, and misjudgment and missed judgment conditions are reduced.

Description

Method, device, electronic equipment and medium for extracting sensitive information in application program package

Technical Field

The disclosure relates to the technical field of information network security, and in particular relates to a method, a device, electronic equipment and a medium for extracting sensitive information in an application program package.

Background

The wide spread of mobile devices and the explosive development of applications in today's society greatly facilitates people's daily lives, but also brings about the growth of new criminal activities. These criminal activities are no longer limited to traditional offline modes, but rather exploit vulnerabilities in applications and leakage of user information to conduct a range of high-technology-content illegal actions, including, but not limited to, phishing, information theft, phishing, etc. Because of the specificity of such crimes, they often involve complex technical means and encryption methods, which greatly increase the difficulty of case forensic detection.

Currently, investigation work for such crimes faces significant challenges. On the one hand, criminal activities have strong concealment, and criminals usually have a certain technical background, and the criminals use means such as anonymous networks, encrypted communication and the like to conceal identities and positions, so that tracking and evidence obtaining become extremely difficult. On the other hand, criminal programs are often designed with self-destruction mechanisms or become inaccessible after a short time, resulting in rapid loss of key evidence. For basic-level policemen, they often lack the necessary expertise training and experience accumulation, face complex and varied forms of cyber crimes, and have difficulty in effectively identifying, analyzing and processing relevant sensitive information in a short period of time. The large number of cases are thus backlogged and concentrated in the hands of a few professionals, increasing their workload and reducing the overall case handling efficiency.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, an object of the present disclosure is to provide a method, an apparatus, an electronic device, and a medium for extracting sensitive information in an application package, which aim to improve the recognition speed and accuracy of the sensitive information in the application package, and help an analyst save time and effort.

A first aspect of the present disclosure provides a method for extracting sensitive information in an application package, including: identifying multiple types of files in the target application package, including application code files and manifest files; reading the package name and the application label of the application program from the manifest file; determining the range of character strings to be analyzed in the application code file according to the application package name and the application label; and analyzing the determined character string to be analyzed to identify and extract the sensitive information therein.

In an embodiment of the first aspect, the sensitive information includes one or more combinations of API keys, IP addresses, URL links, mailbox addresses, user credentials, encrypted data, system commands, hash values.

In an embodiment of the first aspect, the parsing the determined character string to be analyzed to identify and extract sensitive information therein includes: reading the class name of the character string to be analyzed; judging whether the character string to be analyzed is a suspicious character string or not based on the class name; if yes, analyzing a specific instruction defining the suspicious character string in the application program code file to determine the purpose of the suspicious character string; judging whether the suspicious character string is a sensitive character string according to the application of the suspicious character string; and if yes, extracting the sensitive information in the sensitive character string.

In an embodiment of the first aspect, the determining, according to the use of the suspicious string, whether the suspicious string is a sensitive string includes: performing format verification on the suspicious character strings and predefined sensitive information types; and determining whether the suspicious character string is a sensitive character string based on a verification result.

In an embodiment of the first aspect, the determining, based on the verification result, whether the suspicious string is a sensitive string includes: if the format check is passed, judging that the suspicious character string is a sensitive character string; and recording the content of the sensitive character string and the corresponding sensitive information type.

In an embodiment of the first aspect, the determining, based on the verification result, whether the suspicious string is a sensitive string includes: if the format check is not passed, judging whether the suspicious character string is encrypted or not; if yes, decrypting the suspicious character string, and performing format verification again; if the decrypted format check passes, judging that the suspicious character string is a sensitive character string, and recording the content of the sensitive character string, the used encryption algorithm and the sensitive information type.

In an embodiment of the first aspect, the extracting the sensitive information in the sensitive character string includes: determining the assigned variable name of the sensitive character string; and extracting sensitive information in the sensitive character string based on the variable name.

In an embodiment of the first aspect, the extracting the sensitive information in the sensitive string based on the variable name includes: word segmentation processing is carried out on the variable names; and extracting the sensitive information in the sensitive character string based on the variable names after word segmentation.

A second aspect of the present disclosure discloses an apparatus for extracting sensitive information in an application package, including: the identification module is used for identifying various types of files in the target application program package, including application program code files and manifest files; the reading module is used for reading the package name and the application label of the application program from the manifest file; the range determining module is used for determining the range of the character string to be analyzed in the application code file according to the application package name and the application label; and the extraction module is used for analyzing the determined character string to be analyzed so as to identify and extract the sensitive information in the character string.

A third aspect of the present disclosure discloses an electronic device, the electronic device comprising: a processor and a memory; wherein the memory is used for storing a computer program; the processor is configured to execute the computer program stored in the memory, so that the electronic device performs the method for extracting sensitive information in an application package according to any one of the first aspect.

A fourth aspect of the present disclosure discloses a computer-readable storage medium, on which a computer program is stored, which program, when executed by an electronic device, implements the method for extracting sensitive information in an application package according to any one of the first aspects.

As described above, the method, apparatus, electronic device and medium for extracting sensitive information in an application package provided by the present disclosure at least include the following technical effects:

(1) The key sensitive information such as the API key, the IP address, the URL link and the like can be efficiently extracted from the application program package, and the analysis efficiency and the response speed are greatly improved.

(2) By comprehensively analyzing various file types and characteristics in the application program package, accurate sensitive information extraction results can be provided, and misjudgment and missed judgment conditions are reduced.

(3) The method simplifies the process of extracting the sensitive information, reduces the requirements of staff on technology and experience, and enables non-professionals to quickly get hands on and effectively execute the task of extracting the sensitive information.

(4) The application program with the sensitive information can be quickly identified, so that an analyst can be helped to quickly identify and hit crime activities performed by using the mobile application program, and the efficiency and accuracy of hitting crimes are improved.

Drawings

FIG. 1 shows a flow diagram of a method for extracting sensitive information in an application package in an embodiment of the present disclosure.

Fig. 2 shows a flow chart of parsing a character string and extracting sensitive information in an embodiment of the disclosure.

FIG. 3 is a flow chart of a method for determining a sensitive character string according to an embodiment of the disclosure.

Fig. 4 shows a flow chart for determining a sensitive character string according to another embodiment of the disclosure.

Fig. 5 shows a flow diagram of a method of extracting sensitive information in an application package in another embodiment of the present disclosure.

Fig. 6 shows a schematic diagram of a specific example of the present disclosure.

FIG. 7 shows a schematic diagram of an apparatus for extracting sensitive information in an application package in an embodiment of the disclosure.

Fig. 8 shows a schematic circuit structure of an electronic device in an embodiment of the disclosure.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.

It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.

Today, mobile applications have become an important component of our daily lives, but at the same time present new security challenges. Criminals are increasingly utilizing these applications to conduct illegal activities, such as phishing, data theft, etc., with serious threats to user privacy and property security. In the prior art system, security defense measures are mainly focused on utilizing data in an intelligent terminal to perform post-hoc analysis, and after suspicious behaviors occur, security personnel collect and analyze the data in the intelligent terminal to identify and track criminal suspects. However, this reactivity strategy has significant drawbacks: it can only take action after a crime has occurred, resulting in a delay in the reaction and the occurrence of criminal consequences. In addition, the collection and fixing of crime evidence becomes more difficult and consumes a lot of resources, which is inefficient. More importantly, existing methods can violate user privacy and are difficult to deal with criminals using high-tech means.

The present disclosure aims to throttle crimes in the bud phase, i.e. actively identify and hit applications for crimes by monitoring and analyzing sensitive information in mobile applications in real time. The method can reduce the occurrence of crime cases and improve the overall safety of society. Compared with the prior art, the preventive measure disclosed by the invention can realize early discovery and rapid intervention, and effectively overcomes the defect of post-treatment.

The following describes the technical solution in the embodiment of the present invention in detail with reference to the drawings in the embodiment of the present invention.

As shown in fig. 1, a flowchart illustrating a method for extracting sensitive information from an application package according to a first embodiment of the present disclosure includes steps S10-S40, wherein,

Step S10: multiple types of files in the target application package are identified, including application code files and manifest files.

Specifically, before extracting sensitive information in an application package (APK file), the APK file needs to be decompressed first to access the file therein. The decompression process may be accomplished through the ZipInputStream class of Java, which is able to read APK files and decompress their contents into a temporary directory. In the decompressed directory, all files may be traversed and their extensions or file headers checked to determine if they are files of a particular type. For example, different types of files may be identified by examining file extensions (e.g.,. Dex, html, js, json, etc.) or file headers (e.g., specific magic numbers).

For application code files, i.e. files containing executable code, such as DEX files (byte codes of Android applications), a code restore operation is required. Code restoration is the process of converting compiled code back to source code. This typically involves the use of decompilation tools, such as DEX2JAR and jd-gui, which can convert a DEX file into a JAR file, which is then converted into Java source code using a decompiler. By restoring the source code, the logic and functionality of the application can be more clearly understood, thereby more easily identifying and extracting sensitive information.

For files containing scripts, such as HTML, javaScript and JSON files, their content can be directly viewed to determine their structure and function. These files may contain information such as user interface elements of the application, interaction logic, and data exchange formats. By analyzing these files, the interaction pattern and data processing flow of the application program can be known, thereby identifying and extracting sensitive information related to the user data.

In addition to code and script files, APK files may also contain other types of resource files, such as images, audio, video, and the like. These resource files are typically located under the res directory, and their location and use can be determined by analyzing the manifest file (android management. Xml). The manifest file is one of the most important files in the APK file, and contains basic information and rights declaration of the application program. By parsing the manifest file, information such as the name, version number, required rights, etc. of the application can be obtained. In addition, the main components of the application (such as activities, services, broadcast receivers, etc.) and their configuration information may be extracted therefrom. Such information helps to understand the overall structure and functionality of the application, thereby better identifying and extracting sensitive information.

In some embodiments, the sensitive information includes one or more combinations of API keys, IP addresses, URL links, mailbox addresses, user credentials, encrypted data, system commands, hash values.

In particular, in an application package, sensitive information refers to information that may be utilized by a malicious attacker to gain improper benefit or to compromise system security. The following are some common types of sensitive information and detailed descriptions thereof:

API key: an API key is a credential for accessing a particular service, typically provided by a service provider. They may be used to access cloud services, third party libraries, databases, etc. Revealing API keys may result in unauthorized data access, misuse of service resources, or execution of malicious operations. In an application package, the API key may be stored in plain text or encrypted form, requiring careful analysis to determine its security and potential risk.

IP address: the IP address is an identifier assigned to a network device for communication over the network. If the application package contains an IP address, the location of the server and network topology may be exposed, increasing the risk of attack. Leakage of IP addresses may lead to network attacks, data leakage, or other security issues.

URL link: URL links point to resources on a network, such as web pages, files, or API endpoints. If URL links are included in the application package, an attacker can use these links to access sensitive data or perform malicious operations. Leakage of URL links may lead to data leakage, malware propagation, or other security issues.

Mailbox address: mailbox addresses are one of the personal contact information of users, typically used to receive notifications, verify identities, etc. Revealing mailbox addresses may result in spam, phishing attacks, or other forms of fraud. Disclosure of mailbox addresses may lead to privacy disclosure, account hijacking, or other security issues.

User credentials: the user credentials include a user name, password, token, session ID, etc. for verifying the user identity and authorizing access to a particular resource. Revealing user credentials may result in unauthorized access, account hijacking, or other security issues. In application packages, user credentials may be stored in plain text or encrypted form, requiring careful analysis to determine their security and potential risk.

Encrypting data: encrypted data refers to data that is protected using an encryption algorithm. If the application package contains a key or decryption logic to encrypt the data, an attacker may be able to decrypt the data and access the information therein. Leakage of encrypted data may result in data leakage, information leakage, or other security issues.

System command: system commands are a set of functions provided by the operating system for performing a particular task. If system commands are included in the application package, an attacker may be able to use these commands to perform malicious operations, such as deleting files, modifying system settings, etc. Leakage of system commands may lead to system corruption, data loss, or other security issues.

Hash value: a Hash Value (Hash Value) is a process of converting arbitrary length data into a fixed length unique identifier by a specific algorithm. This process is irreversible, i.e. the original data cannot be directly restored from the hash value. However, if hash values are included in the application package and are generated based on sensitive information (e.g., passwords, keys, etc.), an attacker may attempt to crack the hash values by various means to recover the original data. Hash values are commonly used in applications to store user passwords, API keys, sensitive configuration information, etc. to prevent direct exposure of such sensitive data. However, if the hash value is obtained by an attacker, they may attempt to break Jie Haxi the value using a hash collision attack, a rainbow table attack, or a brute force attack, etc., to obtain the original sensitive information.

Step S20: and reading the package name and the application label of the application program from the manifest file.

Specifically, a manifest file (Android management. Xml) is a core configuration file of an Android application program, and contains metadata, rights declarations, component definitions of the application program, and behavior rules of the application program. In parsing the manifest file, the nodes and attributes in the XML file may be read and converted into an operational data structure using a Java XML parser.

To parse an XML file, a Java-supplied XML parser, such as SAX (SimpleAPI for XML), DOM (Document Object Model), or StAX (StreamingAPI for XML), may be used. These parsers are able to read nodes and properties in an XML file and convert them into Java objects or data structures for further manipulation and analysis.

In parsing the manifest file, it is necessary to find specific nodes and attributes to extract the package name and application tag of the application. The application package name is a unique identifier of the application, typically specified by the developer at the time the application is created. For example, a typical Android application package name might be "com. The application package name is typically located in the node's package attribute. For example, in a node, a packet name is defined as "package=" com. Application tags are typically used to describe the function or purpose of an application, located in the android: label attribute of a node. For example, an Email application may have a tag such as "Email" and a social media application may have a tag such as "Social Networking".

And analyzing the list file by using an XML analyzer, positioning the list file to the nodes, extracting the package name of the application program from the package attribute of the nodes, and extracting the label of the application program from the android: label attribute of the nodes.

According to the extracted application package name and the application label, the range of the character string to be analyzed can be determined. For example, if the tag of an application program indicates that it is a social application, code portions associated with social functions may be of greater concern.

In combination with other analysis techniques, such as code reduction, script parsing, etc., to comprehensively analyze sensitive information in the application. For example, a decompilation tool may be used to convert a DEX file to a JAR file, which is then converted to Java source code using a decompiler to more clearly understand the logic and functionality of the application.

Step S30: and determining the range of the character string to be analyzed in the application code file according to the application package name and the application label.

In particular, the use of the file system search function is a critical step in extracting sensitive information in an application package. The purpose of this step is to find all files associated with a given application package name, which may include source code files, resource files, etc. This step may be implemented using a command line tool (e.g., find command) or a file manipulation function in a programming language.

Using the file system search function, all files associated with a given application package name can be located. These files may contain the source code of the application, configuration files, resource files, etc. The file system search function may be implemented by a command line tool (e.g., find command) or a file manipulation function in a programming language.

After finding the relevant files, the files to be analyzed can be further screened out according to the application program labels. For example, if the application tag is "Email," a file containing an "Email" key may be selected for further analysis. The screening process may be implemented by text search algorithms or regular expressions to improve the accuracy and efficiency of the search.

In the screened file, specific character string ranges are found, and the character string ranges may include variable names, method names, notes and the like. The specific string type that needs to be focused on is determined from the application tag. For example, if the application tag is "Email," it may be desirable to focus on character strings associated with the Email, such as Email addresses, topics, and so forth.

Deep analysis is performed on the found string range to identify and extract sensitive information therein. For example, it may be checked whether sensitive keywords, such as "password", "apiKey", etc., are included in the variable names. It is also possible to check whether the method name and the annotation contain sensitive information, such as "getEmail", "SENDEMAIL", etc.

In the above embodiment, by reading the application manifest file, the classes and resources that may be related to the sensitive information are quickly identified. The step utilizes the structure information of the application program, avoids checking the whole application program package one by one, and greatly improves the speed of initial screening.

Step S40: and analyzing the determined character string to be analyzed to identify and extract the sensitive information therein.

Specifically, after determining the ranges of strings, the system may perform an in-depth analysis on strings within those ranges. The system parses the strings in the application code file, including their assignment and use cases. By analyzing the assignment and usage of strings, it can be determined whether they contain sensitive information such as API keys, user credentials, etc. For example, if a string is assigned to a variable named "apiKey" and that variable is used during the login process of an application, then this string is likely to be an API key.

By deeply analyzing the assignment and usage of the string, the system can more accurately identify and extract sensitive information. For example, if a string is used for the URL of the web request or as a file name, these use cases may further confirm its sensitivity. By carrying out deep analysis on assignment and use conditions of the character strings, the system can improve accuracy of identifying and extracting sensitive information. For example, by analyzing the assignment and usage of strings, the system can more accurately identify and extract sensitive information such as API keys, user credentials, and the like.

In some embodiments, as shown in fig. 2, the step S40 includes steps S41-S45, wherein,

Step S41: and reading the class name of the character string to be analyzed.

Specifically, in the application code file, each string is typically associated with a particular class. Therefore, it is first necessary to read the class name to which the character string to be analyzed belongs. This may be accomplished by parsing the structure of the code file, for example using a reflection mechanism in the programming language or a parser library to obtain the class name where the string is located.

In order to determine the class name to which the string belongs, the system needs to parse the structure of the code file. Code files are typically generated by a compiler or interpreter of a programming language that contains elements such as classes, methods, variables, and the like. The system may parse the structure of the code file using a reflection mechanism or parser library provided by the programming language.

The reflection mechanism is a mechanism provided by the programming language that allows a program to check and modify its own structure at runtime. A parser library is a set of functions or classes for parsing a particular type of file or data format. By using a reflection mechanism or a parser library, the system can obtain the class name where the character string is located.

After parsing the structure of the code file, the system may read the class name of the string to be parsed. For example, if the string to be analyzed appears in a method of a class named "MyClass", the system may read the class name "MyClass" to which the string belongs.

Step S42: and judging whether the character string to be analyzed is a suspicious character string or not based on the class name.

In particular, a series of predefined patterns or rules may be used to examine a class name to determine if it contains potentially sensitive information. These patterns or rules may include specific keywords, regular expressions, or other matching logic. For example, a list containing sensitive words, such as "password", "secret", etc., may be created. The class name to be analyzed is then compared to each keyword in this list. If any sensitive vocabulary is included in the class name, it can be marked as a suspicious string.

Another approach is to define more complex patterns using regular expressions. For example, a regular expression may be written to find a string containing a combination of numbers and letters, which may suggest the presence of a password or key. By applying these patterns, suspicious strings can be more accurately identified.

In addition, other factors may be considered to determine whether the class name is suspicious. For example, the length, character distribution, or other characteristics of the class name may be checked to further confirm whether it has potentially sensitive information.

Once it is determined that the class name is suspicious, step S43 may be performed.

Step S43: and analyzing specific instructions defining the suspicious character strings in the application program code file to determine the purposes of the suspicious character strings.

In particular, a parser library of a programming language may be used to analyze the code file and extract relevant information. By analyzing these instructions, the purpose and context of the suspicious string in the application can be understood.

First, a parser library is selected that is appropriate for the target programming language. These libraries typically provide the functionality of parsing and lexically analyzing source code, enabling the accurate identification and processing of different elements in the code. For example, for the Java language, libraries such as ANTLR or JavaParser can be used; for Python language, python self-contained ast module and the like can be used.

Once the appropriate parser library is selected, it may be applied to the code file to parse it. The parsing process will generate an Abstract Syntax Tree (AST) containing all elements in the code file and their relationships. By traversing the AST, each node can be accessed and information about the suspicious string extracted.

In traversing an AST, attention is required to instructions related to suspicious strings. This may include variable declarations, assignment statements, function calls, and the like. It may be checked whether these instructions directly reference the suspicious string or whether the suspicious string is used in context. In addition, other factors need to be considered, such as whether the suspicious string is used as a parameter transfer function, whether it is associated with other sensitive information, etc.

By analyzing these instructions, the purpose and context of the suspicious string in the application can be understood. For example, if the suspect string appears in a password verification function, it may be inferred that it may be a variable for storing or verifying the user's password. If the suspect string is used as part of an API key, it can be inferred that it may be a credential for accessing an external service.

It should be noted that parsing a code file may involve complex syntactic and nested structures, and thus may require some programming knowledge and experience to properly parse and understand the code. In addition, different programming languages may have different grammar rules and characteristics, so that in practical applications it is necessary to select an appropriate parser library according to the particular programming language.

Step S44: and judging whether the suspicious character string is a sensitive character string according to the application of the suspicious character string. If yes, go to step S45.

Specifically, whether the suspicious character string is a sensitive character string is judged according to the purpose of the suspicious character string. In particular, the purpose of the suspicious string may be compared to known patterns or rules of sensitive information.

First, a set of patterns or rules of sensitive information needs to be defined. These patterns may include specific keywords, regular expressions, or other matching logic for identifying strings that may contain sensitive information. For example, a list may be created that contains sensitive words such as "password", "secret", "api_key", and the like. The usage of the suspicious string is then compared to these patterns.

If the purpose of the suspicious string matches any pattern of sensitive information, it may be marked as a sensitive string. This may be accomplished by setting a boolean variable or using other identifiers. Once it is determined that the suspicious string is sensitive, step S45 may be performed, i.e. appropriate measures are taken to process the string.

It should be noted that the definition of sensitive information may vary from application to application and context to context. Thus, in actual practice, further optimization and tuning of these modes and methods may be required to ensure optimal detection results and safety.

Illustratively, as shown in fig. 3, the step S44 includes steps S441-S442, wherein,

Step S441: and carrying out format verification on the suspicious character string and a predefined sensitive information type.

In particular, a set of formatting rules for sensitive information types needs to be defined. These rules may include format patterns for common sensitive information such as email addresses, phone numbers, credit card numbers, etc. For example, for email addresses, regular expressions may be used to match common email formats; for telephone numbers, we can use regular expressions to match the format of international or local telephone numbers.

The suspicious strings are then compared to these formatting rules. If the suspicious string conforms to the format of any one of the sensitive information types, it may be marked as a sensitive string. This may be accomplished by setting a boolean variable or using other identifiers.

Step S442: and determining whether the suspicious character string is a sensitive character string based on a verification result.

In particular, if the suspicious string passes the format check, i.e. meets the format requirements of the predefined sensitive information type, it may be marked as a sensitive string. Otherwise, it may be considered as a non-sensitive string.

For example, if a regular expression is used to match the format of the email address in step S441 and the suspicious string passes this check, the string may be marked as a sensitive string. Likewise, if other text matching techniques are used to examine the formats of telephone numbers or credit card numbers, etc., and suspicious strings meet these format requirements, they may also be marked as sensitive strings.

It should be noted that even if the suspicious string passes the format verification, further analysis is still required to determine whether it actually contains sensitive information. For example, while a string may conform to the format of an email address, it may simply be a common public mailbox address, not an actual user mailbox address. Therefore, after the format verification is performed, the content and context of the suspicious string needs to be further analyzed to determine whether it actually contains sensitive information.

Further, as shown in fig. 4, the step S442 includes steps a) -d), wherein,

Step a): judging whether the format check passes or not, if so, entering the step b); if not, go to step c).

Specifically, when the format verification is performed, the matching result is based on the structural characteristics of the character string and a predetermined pattern, wherein the predetermined pattern comprises but is not limited to a regular expression, a length requirement, a character combination rule and the like.

Step b): and judging the suspicious character string as a sensitive character string, and recording the content of the sensitive character string and the corresponding sensitive information type.

Specifically, if the suspicious string meets the format requirements of a predefined sensitive information type (such as credit card number, social security number, etc.), the string is explicitly determined to be a sensitive string. Once the string is determined to be a sensitive string, its content and the type of sensitive information it corresponds to should be recorded. This step may be accomplished by storing the sensitive string and its type in a secure data management system. The recorded information should include the value of the sensitive character string, the type of sensitive information detected (e.g., "credit card number", "social security number", etc.), the time of detection, the name and location of the associated code file, etc. This information will be used for subsequent security audits, reports or remedial action to ensure that all sensitive information is used, stored and transmitted in compliance with relevant privacy regulations and security standards.

In the above embodiment, the potentially sensitive information is analyzed specifically according to predefined rules and patterns, such as specific keywords or formats. The method reduces the interference of irrelevant data and improves the pertinence and the efficiency of analysis.

Step c): and judging whether the suspicious character string is encrypted or not.

Specifically, for a suspicious string that fails the format verification, it is necessary to further determine whether it is encrypted. This may be determined by checking whether the string conforms to a known encryption format (e.g., a particular character distribution, encryption identifier, etc.) or by analyzing the code context (e.g., whether encryption/decryption function calls exist). If yes, go to step d).

Step d): decrypting the suspicious character string, and returning to the step a) to perform format verification.

Specifically, if it is determined that the suspicious string is encrypted, it needs to be decrypted to obtain the original data. This typically involves using the same decryption algorithm used in the application, which may be a symmetric encryption algorithm, an asymmetric encryption algorithm, or any other suitable encryption method. The decryption process should be performed in a secure environment to ensure that sensitive data is not compromised. After decryption, a format check should be performed again to determine if it is a sensitive string.

And (c) returning to the step (b) if the decrypted format check passes.

Specifically, if the decrypted string passes the format check, it should be determined as a sensitive string. At the same time, its content, the encryption algorithm used and the corresponding sensitive information type should be recorded.

In the above embodiment, the present application provides the steps of decrypting and re-verifying the encrypted sensitive information, ensuring that even the encrypted sensitive information can be accurately extracted and identified.

Step S45: and extracting the sensitive information in the sensitive character string.

Specifically, after the character string is determined to be the sensitive character string, the next step is to extract the sensitive information therein. This may involve parsing specific sensitive data from the string, such as a credit card number, password, etc. The extraction process should ensure compliance with appropriate data protection and privacy standards.

In some embodiments, the step S45 includes: determining the assigned variable name of the sensitive character string; and extracting sensitive information in the sensitive character string based on the variable name.

Specifically, in the code review process, it is first necessary to determine to which variable the sensitive string is assigned. This typically involves analysis of the code to identify which variables accept the values of the sensitive string. The names of these variables may provide a direct clue as to the use of the sensitive data, helping to more accurately identify and categorize the sensitive information. With the determined variable name, sensitive information associated with the variable may be more accurately extracted from the code. For example, if a variable is named userPassword, it is apparent that it stores user password information, then any string assigned to that variable should be considered sensitive.

To further refine the extraction process of sensitive information, word segmentation processing may be performed on variable names. This means that the variable names are broken down into smaller semantic units, which helps to better understand the purpose of the variable. For example, variable name userCredentials may be segmented into user and CREDENTIALS to more clearly indicate that the variable is used to store credential information for the user.

The accuracy of extracting the sensitive information from the code can be further improved by using the variable names after word segmentation. By analyzing the semantic units after word segmentation, the purpose of the variable can be accurately judged, and relevant sensitive information can be extracted according to the purpose. This helps to reduce false positives and false negatives, ensuring the efficiency and accuracy of the sensitive information management system.

Based on the foregoing disclosure of all the embodiments, as shown in fig. 5, a complete flowchart of a method for extracting sensitive information from an application package in an embodiment of the disclosure is shown.

For a better illustration of the method of extracting sensitive information in application packages presented in this disclosure, a specific example is listed below. As shown in fig. 6, a schematic diagram of the present example is shown.

In this example, assume that a base layer policeman received a case that is suspected of criminal activity with a mobile application. The criminals have made an application named "EASYCREDENTIALS" that can steal the user's password, bank card number, API key, etc. In order to quickly lock criminals, the technical scheme is used by police to extract sensitive information from EASYCREDENTIALS application programs.

The police first identified and analyzed various types of files in the "EASYCREDENTIALS" application package, including application code files and manifest files, using the present solution. From the manifest file, the polices extract the package name and application tag of the application, i.e., "EASYCREDENTIALS". According to the application program package name and the application program label, the police determines the range of the character string to be analyzed.

And analyzing the determined character string to be analyzed to identify and extract the sensitive information therein. For example, a police officer found a string "password123," which may be the user's password.

The police reads the class name of the character string to be analyzed, and judges whether the character string is a suspicious character string according to the class name. For example, if the string appears in a class associated with network communications, it may be a suspicious string. If yes, analyzing a specific instruction defining the character string in the application program code file to determine the purpose of the character string. For example, the police find that the string is assigned a variable named "userPassword".

According to the use of the suspicious character string, the police judges whether the character string is a sensitive character string and extracts sensitive information in the character string. For example, if the string is used for a login operation, it is likely to be sensitive information.

The policeman performs a format check on the suspicious string with the predefined sensitive information type to determine if it is a sensitive string. For example, a policeman checks whether the character string conforms to a common cryptographic format.

If the format check is not passed, judging whether the character string is encrypted or not. For example, the police find that the string is encrypted. If yes, decrypting the character string, and performing format verification again. For example, a police officer decrypts the string and finds that its content is "password123".

If the decrypted format check passes, the character string is judged to be a sensitive character string, and the content, the used encryption algorithm and the sensitive information type are recorded. For example, the police records that the string is the user's password, using the AES encryption algorithm.

The civil police determines the assigned variable name of the sensitive character string, and extracts sensitive information in the sensitive character string based on the variable name. For example, the police find "userPassword" variable stores the user's password.

And performing word segmentation processing on the variable names so as to improve the accuracy of sensitive word recognition. For example, a police officer may word "userPassword" as "user" and "Password". And extracting sensitive information in the sensitive character string based on the variable names after word segmentation. For example, a civil police confirmation "Password" is a Password of the user.

Through the steps, the police successfully extracts the sensitive information in the EASYCREDENTIALS application program, including the user password, the bank card number, the API key and the like by using the technical scheme.

It should be particularly noted that the flow or method representations of the flow chart representations of the above embodiments of the present disclosure can be understood as representing modules, segments, or portions of code which include one or more sets of executable instructions configured to implement particular logical functions or steps of a process. And the scope of the preferred embodiments of the present disclosure includes additional implementations in which functions may be performed in a substantially simultaneous manner or in an opposite order from that shown or discussed, including in accordance with the functions that are involved.

As shown in fig. 7, the device 70 for extracting the sensitive information in the application package of the present disclosure is shown, and it is to be noted that the principle and technical implementation of the device for extracting the sensitive information in the application package can refer to the method embodiment (for example, fig. 1) for extracting the sensitive information in the application package in the previous embodiment, so that the description is not repeated in this embodiment.

Specifically, the device 70 for extracting the sensitive information in the application package includes: an identification module 71, a reading module 72, a range determination module 73, an extraction module 74, wherein,

The identifying module 71 is configured to identify a plurality of types of files in the target application package, including an application code file and a manifest file.

The reading module 72 is configured to read a package name and an application tag of an application from the manifest file.

The range determining module 73 is configured to determine a range of the character string to be analyzed in the application code file according to the application package name and the application tag.

The extraction module 74 is configured to parse the determined character string to be analyzed to identify and extract the sensitive information therein.

It should be noted that, in particular, each functional module in the embodiment of fig. 7 may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a program instruction product. The program instruction product comprises one or a set of program instructions. When the program instructions are loaded and executed on a computer, the processes or functions in accordance with the present disclosure are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The program instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another.

Moreover, the apparatus disclosed in the embodiment of fig. 7 may be implemented by other module division manners. The above-described embodiments of the apparatus are merely illustrative, and the division of modules, for example, is merely a logical division of functionality, and may be implemented in alternative ways, such as a combination of modules or modules may be combined or may be dynamic to another system, or some features may be omitted, or not implemented. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, indirect coupling or communication connection of devices or modules, and may be in electrical or other forms.

In addition, each functional module and sub-module in the embodiment of fig. 7 may be dynamically in one processing component, or each module may exist alone physically, or two or more modules may be dynamically in one component. The dynamic components described above may be implemented in hardware or in software functional modules. The dynamic components described above, if implemented in the form of software functional modules and sold or used as a stand-alone product, may also be stored in a computer-readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

As shown in fig. 8, a schematic structural diagram of an electronic device in an embodiment of the disclosure is shown.

The electronic device may execute the method as in any of fig. 1 by running computer program instructions. The electronic device may be a server group/server, a desktop, a notebook, etc., or may be a cloud server/server group, a distributed computing node system, etc. that communicates with the local terminal remotely.

The electronic device 80 comprises a bus 81, a processor 82, a memory 83. The processor 82 and the memory 83 may communicate with each other via the bus 81. The memory 83 may have stored therein program instructions. The processor 72 implements the method steps of the previous embodiments, such as the method of any of fig. 1, by executing program instructions in the memory 83.

Bus 81 may be a Peripheral component interconnect standard (PCI) carbon monoxide mponent Inter carbon monoxide nnect bus, an extended industry standard architecture (Extended Industry StandardArchitecture, EISA) bus, or the like. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, although only one thick line is shown in the figures, only one bus or one type of bus is not shown.

In some embodiments, the processor 82 may be a central processing unit (Central Processing Unit, CPU), a micro-processing unit (MCU), a System On Chip (System On Chip), or a field programmable logic array (FPGA), or the like. The Memory 83 may include Volatile Memory (RAM) for temporary use of data during running of the program, such as random access Memory (RandomAccess Memory).

The Memory 83 may also include non-volatile Memory (ROM), flash Memory, a hard drive (HARD DISKDRIVE, HDD) or Solid-state disk (Solid-STATE DISK, SSD) for data storage.

In some embodiments, the electronic device 80 may also include a communicator 84. The communicator 84 is used for communicating with the outside. In particular examples, the communicator 84 may comprise one or a set of wired and/or wireless communication circuit modules. For example, the communicator 84 may comprise one or more of a wired network card, a USB module, a serial interface module, etc. The wireless communication protocol followed by the wireless communication module includes: such as one or more of near field carbon monoxide mmunication (NFC) technology, infrared (Infared, IR) technology, global system for mobile communications (Global System for Mobile carbon monoxide mmunications, GSM), general packet Radio Service (GENERAL PACKET Radio Service, GPRS), code Division multiple access (carbon monoxide de Division MultipleAccess, CDMA), wideband Code Division Multiple Access (WCDMA) code Division multiple access (Time-Division carbon monoxide de Division MultipleAccess, TD-SCDMA), long term evolution (Long Term Evolution, LTE), blueTooth (BlueTooth, BT), global navigation satellite system (Global Navigation SATELLITE SYSTEM, GNSS), etc.

Embodiments of the present disclosure may also provide a computer readable storage medium storing program instructions that are executed to perform a method of extracting sensitive information in an application package, such as the embodiment of fig. 1.

That is, the steps of the method in the above-described embodiments are implemented as software or computer code storable in a recording medium such as a CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium and to be stored in a local recording medium downloaded through a network, so that the method represented herein may be processed by such software stored on a recording medium using a general-purpose computer, a special-purpose processor, or programmable or dedicated hardware (such as an ASIC or FPGA).

In summary, the method, the device, the electronic device and the medium for identifying the multilingual text for extracting the sensitive information in the application program package can quickly identify the class and the resource possibly related to the sensitive information by reading the application program list file, utilize the structural information of the application program, avoid checking the whole application program package one by one, and greatly improve the initial screening speed. In addition, according to predefined rules and modes, such as specific keywords or formats, the potentially sensitive information is subjected to targeted analysis, so that the interference of irrelevant data is reduced, and the targeted and efficient analysis is improved. And for the encrypted sensitive information, the steps of decryption and rechecking are provided, so that the sensitive information even through encryption processing can be accurately extracted and identified.

The above embodiments are merely illustrative of the principles of the present disclosure and its efficacy, and are not intended to limit the disclosure. Modifications and variations may be made to the above-described embodiments by those of ordinary skill in the art without departing from the spirit and scope of the present disclosure. Accordingly, it is intended that all equivalent modifications and variations which a person having ordinary skill in the art would accomplish without departing from the spirit and technical spirit of the present disclosure be covered by the claims of the present disclosure.

Claims

1. A method for extracting sensitive information from an application package, comprising:

Identifying multiple types of files in the target application package, including application code files and manifest files;

Reading the package name and the application label of the application program from the manifest file;

determining the range of character strings to be analyzed in the application code file according to the application package name and the application label;

and analyzing the determined character string to be analyzed to identify and extract the sensitive information therein.

2. The method of claim 1, wherein the sensitive information comprises one or more of an API key, an IP address, a URL link, a mailbox address, a user credential, encrypted data, a system command, a hash value.

3. The method for extracting sensitive information from an application package according to claim 1, wherein the parsing the determined character string to be analyzed to identify and extract the sensitive information therein comprises:

reading the class name of the character string to be analyzed;

judging whether the character string to be analyzed is a suspicious character string or not based on the class name;

if yes, analyzing a specific instruction defining the suspicious character string in the application program code file to determine the purpose of the suspicious character string;

judging whether the suspicious character string is a sensitive character string according to the application of the suspicious character string;

and if yes, extracting the sensitive information in the sensitive character string.

4. A method for extracting sensitive information from an application package according to claim 3, wherein said determining whether the suspicious string is a sensitive string according to the use of the suspicious string comprises:

performing format verification on the suspicious character strings and predefined sensitive information types;

and determining whether the suspicious character string is a sensitive character string based on a verification result.

5. The method for extracting sensitive information from an application package as claimed in claim 4, wherein said determining whether the suspicious string is a sensitive string based on the verification result comprises:

if the format check is passed, judging that the suspicious character string is a sensitive character string;

and recording the content of the sensitive character string and the corresponding sensitive information type.

6. The method for extracting sensitive information from an application package as claimed in claim 4, wherein said determining whether the suspicious string is a sensitive string based on the verification result comprises:

If the format check is not passed, judging whether the suspicious character string is encrypted or not;

if yes, decrypting the suspicious character string, and performing format verification again;

If the decrypted format check passes, judging that the suspicious character string is a sensitive character string, and recording the content of the sensitive character string, the used encryption algorithm and the sensitive information type.

7. A method of extracting sensitive information in an application package according to claim 3, wherein said extracting sensitive information in said sensitive character string comprises:

Determining the assigned variable name of the sensitive character string;

and extracting sensitive information in the sensitive character string based on the variable name.

8. The method for extracting sensitive information from an application package of claim 7, wherein said extracting sensitive information from said sensitive string based on said variable name comprises:

Word segmentation processing is carried out on the variable names;

and extracting the sensitive information in the sensitive character string based on the variable names after word segmentation.

9. An apparatus for extracting sensitive information from an application package, comprising:

The identification module is used for identifying various types of files in the target application program package, including application program code files and manifest files;

The reading module is used for reading the package name and the application label of the application program from the manifest file;

The range determining module is used for determining the range of the character string to be analyzed in the application code file according to the application package name and the application label;

And the extraction module is used for analyzing the determined character string to be analyzed so as to identify and extract the sensitive information in the character string.

10. An electronic device, the electronic device comprising:

A processor and a memory;

Wherein the memory is used for storing a computer program;

The processor is configured to execute the computer program stored in the memory, so that the electronic device performs the method for extracting sensitive information in an application package according to any one of claims 1 to 8.

11. A computer readable storage medium having stored thereon a computer program, which when executed by an electronic device implements the method of extracting sensitive information in an application package according to any of claims 1 to 8.