1. Introduction
Malicious app developers constantly seek ways to evade security measures and detection techniques through stealthy attack vectors that lure mobile users into downloading malicious apps onto their devices and that prolong the malware’s lifetime once on the device. One way Android malware achieves stealth is by disguising its activities as legitimate. This not only enables the spread of malware on the official Google Playstore [
1,
2,
3] but also
allows the malware to evade detection on the victim’s device, resulting in devastating effects (such as the unauthorized transfer of funds from legitimate banking apps), whenever attacks are noticed only from the consequences of their successful execution [
4,
5,
6,
7,
8,
9]. Established attack vectors such as accessibility [
10] and several others [
11,
12,
13,
14,
15,
16] allow malware to attain stealth by leveraging living-off-the-land tactics that enable malware to offload critical attack steps to benign legitimate app functionality. For instance, the benign functionality of sensitive app categories such as messaging and financial apps can be instrumental to stealthy attacks aiming to hijack this functionality to offload attack steps such as malware propagation through messaging or seemingly legitimate fund transfers.
Due to their similarity with benign functionality, attacks that hijack this benign functionality render threat detection mechanisms useless. Furthermore, this level of stealth typically leads to victims raising the alarm and initiating an investigation process when the consequences of the attack are evident (e.g., missing funds), which occurs way after the attack steps have been carried out (late detection). Incident responders and security operations center (SOC) analysts investigating such incidents must derive the covert nature of these stealthy attacks from their deliberately small footprint [
17,
18]. Regardless of the stealthiness of an attack, however, its execution must occur in memory [
19,
20]. Therefore, in stealthy attack scenarios, memory forensics becomes crucial to recover key artefacts in memory that may disclose stealthy attack steps and provide a better context for investigators to reconstruct the attack steps. Specifically, for the class of stealthy attacks that hijack benign app functionality, resulting in late detection due to their stealth, previous research showed that the associated in-memory evidence is ephemeral and requires
timely memory collection [
21].
The standard enterprise threat solution is endpoint detection and response (EDR). EDR tools monitor and record events occurring on endpoints (devices, PCs, servers, etc.), providing security teams with the necessary visibility to investigate and mitigate threats through advanced threat detection, investigation, and response capabilities [
22]. EDRs typically leverage known malicious tactics, techniques, and procedures (TTPs) or behavioral analytics to detect unusual attack-related behavior. However, this form of threat detection falls short when dealing with novel stealthy attacks whose tactics are unknown or cannot be distinguished from legitimate benign app interactions. Even if EDRs cannot detect stealthy attacks that leverage benign app functionality, they can provide a fallback through live forensics. EDRs can collect evidence from the device and applications using functionality the underlying OS exposes through Android APIs. However, EDRs must rely on third-party application logs, which may not contain the necessary evidence to disclose and reconstruct attack steps. While memory forensics could compensate for this limitation, this presents a challenge, due to the restrictions on unrooted devices and the non-extendibility of stock Android kernels in mobile devices.
Just-in-time memory forensics (JIT-MF) [
21] is an experimental technique that uses app repackaging as an alternative for dumping evidence from memory using stock Android devices. While avoiding the need for device rooting, JIT-MF still requires significant reverse engineering effort for app repackaging, which is time-consuming and renders the technique invasive and infeasible when considering the large number of sensitive apps that could be hijacked. Its feasibility regarding the customization needed for each hijack-targeted app poses another challenge for adoption. Furthermore, while previous work [
21,
23,
24] demonstrated how JIT-MF could uniquely collect the activity of hijacked apps directly from volatile memory, its value in an investigation setting for attack step detection has not been shown.
In this paper, we present VEDRANDO (i.e., Volatile-memory-enhanced EDR for ANDrOid) an enhanced EDR for Android that allows the timely collection of challenging volatile memory artefacts and the detection of stealthy attacks that hijack benign applications. VEDRANDO has two main components: an events collector, and an attack detector. The events collector component collects elusive evidence of stealthy attacks that is not found in other forensic sources, by employing a state-of-the-art Android EDR tool with experimental memory forensics (JIT-MF), thus improving the state of the art for EDR Android tools by allowing the timely collection of forensic sources from memory. This component addresses existing feasibility and implementation challenges by leveraging JIT-MF infrastructure-based drivers and app virtualization techniques to ease the burden of app-specific JIT-MF driver development, while avoiding device rooting and app repackaging. The attack detector component uses a detection algorithm that, given the additional evidence collected from the events collector component, can detect and expose the hidden, attack-related behavior of benign app hijack attacks using standard anomaly detection methods, resulting in complete and accurate attack step reconstruction.
Our evaluation extends previous work [
25], showing that JIT-MF infrastructure-based drivers ensure the feasibility of our approach, as these drivers are reusable over 92.2% of the 550 most popular Android apps (ranked by all-time number of downloads on AppBrain [
26] according to GooglePlay statistics) when leveraging SQLite libraries. We assessed the performance overheads of VEDRANDO by conducting a runtime evaluation of the solution using theUI Exerciser Monkey tool to simulate normal traffic on a set of apps from the most popular 100 apps in the Google PlayStore (as listed on AppBrain), which were not previously installed on the phone or manufacturer-specific (33 apps in total). Our results showed that VEDRANDO worked with 84.8% of the apps, with negligible overheads (less than a 5% CPU usage increase), while being feasible and minimally invasive by avoiding device rooting, app repackaging, and additional app reverse-engineering overheads. Finally, we demonstrated the value of our solution through a series of investigation case study setups, involving stealth messaging hijack attacks targeting ten popular instant messaging (IM) apps. The results from the ten case studies showed that VEDRANDO can uniquely collect critical evidence from memory. Furthermore, VEDRANDO’s attack detector could fully reconstruct stealthy attack steps, including the malware entry point in all case studies, for a combination of anomaly detection methods and inputs. In summary, we make the following contributions:
We introduce VEDRANDO, a novel Android EDR that addresses the challenge of the timely collection of volatile memory artefacts and the detection of stealth attacks that hijack benign applications. VEDRANDO leverages JIT-MF for memory forensics capabilities. It addresses existing feasibility challenges regarding JIT-MF drivers’ app-specific customization and installation through generic infrastructure-based JIT-MF drivers and app-level virtualization, resulting in a solution that avoids app reverse-engineering, device rooting and app repackaging;
We conducted a runtime evaluation of VEDRANDO’s events collector across 33 apps, achieving an 84.8% success rate in running apps within VEDRANDO’s setup and introducing an average increment up to 4.9 percentage points (pp) in CPU usage and 0.7 pp in consumed memory compared to the app’s typical performance;
We demonstrated the value of VEDRANDO in the context of ten messaging hijack attack investigations of popular Android IM apps. Our results showed that VEDRANDO could disclose evidence of attack steps not collected by state-of-the-art EDRs for all case studies. Once collected, VEDRANDO could detect these events as anomalous using existing anomaly detection methods and reconstruct all attack steps.
4. Proposed Solution
We propose VEDRANDO, an enhanced EDR for Android that accomplishes the challenging timely collection of volatile memory artefacts, along with the detection of a class of stealthy attacks that hijack benign apps, and which meets the requirements listed below:
- R1
Timely collection of app artefacts from memory. The solution should include a special runtime to access app memory, thus being able to collect evidence of stealthy attacks that are not collected by state-of-the-art EDRs;
- R2
Extensible. The techniques and technology enablers must create a generalized solution that works across multiple apps and attack scenarios, which would render the solution feasible to deploy;
- R3
Minimally Invasive. The solution should be acceptable in an enterprise environment using stock Android devices and apps, thus not requiring device rooting or app repackaging and consequential reverse-engineering;
- R4
Detection of malware entry point and attack steps reconstruction. Given the timely evidence collected from the memory, the solution should be able to reconstruct all the attack steps of a stealthy benign app hijack attack and detect the malware entry point using standard anomaly detection and correlation techniques.
Figure 4 gives an overview of the VEDRANDO architecture, which consists of two main components: the events collector and the attack detector. The events collector addresses the feasibility and implementation challenges of the timely collection of elusive evidence from memory, which discloses the attack steps of stealthy benign app hijack attacks (
R1–
R3) by extending the standard Android EDR with JIT-MF and leveraging app-level virtualization. The attack detector detects and reconstructs the attack steps of stealthy benign app hijack attacks (
R4) through a detection methodology that applies standard anomaly detection and correlation techniques.
4.1. Events Collector
The events collector component in
Figure 4 illustrates a high-level view of our proposed memory forensics-enhanced EDR setup, comprising the following components: an EDR server, an EDR client (mobile app), and trusted app-level virtualization containers that each host a sensitive app that may be targeted by stealthy attacks, to hijack their functionality. While the makeup of each container is the same, different sensitive apps are hosted in different containers, to maintain the application sandbox protections that Android offers between apps out-of-the-box.
In the following sections, we describe how JIT-MF drivers were used to allow for the timely collection of artefacts from memory (R1) that can contribute to the stealth attack steps of a benign app hijack, while ensuring extensibility (R2) by moving away from app-specific JIT-MF drivers that render the technique infeasible at a large scale. We also describe how our solution leverages app-level virtualization, to remove the need for app repackaging, yet still functions on stock Android devices (R3). Finally, we illustrate and describe the complete setup of the enhanced Android EDR, along with implementation considerations.
4.1.1. Infrastructure-Centric JIT-MF Drivers
We recall research in previous work [
25] that laid the groundwork for the feasibility of JIT-MF driver development by addressing the limitation that requires JIT-MF drivers to be specific to the targeted app and attack scenario at hand. This limitation meant that JIT-MF driver development required app reverse-engineering, which rendered the development process of JIT-MF driver development impractical.
The JIT-MF driver development process must be practical to ensure our solution is feasible (addressing
R2). Infrastructure-centric JIT-MF drivers render the JIT-MF driver development process feasible by ensuring that a single JIT-MF driver can remain relevant across app versions and stay functional across different applications. The overarching idea of this type of JIT-MF driver calls for a modified driver development approach that leverages the common subset of the applications’ codebase that interacts with commonly-used infrastructure, rather than a application-specific codebase, for
and
selection. As shown in
Figure 5, the underlying infrastructure is generally more stable and widespread across different applications and versions, allowing infrastructure-based JIT-MF drivers to remain usable across different applications and versions (unlike application-specific drivers, which need to be developed from scratch for every application and version). A crucial step for infrastructure-centric JIT-MF driver development involves identifying the key application events that may be hijacked and the commonly used, readily-available infrastructure that enables these events [
25].
4.1.2. Android App-Level Virtualization
We extended the VirtualApp framework [
40] to develop an enhanced container that collects artefacts in a timely manner from the memory of plugin apps that run inside it. The new VirtualApp container contains an additional library, to serve as the JIT-MF driver runtime, implemented using Frida’s Gadget shared library (
https://frida.re/docs/gadget/ accessed on 30 June 2023), and JIT-MF drivers are implemented as Javascript code interpreted by that library. Sensitive stock Android apps that require monitoring and logging of artefacts from memory due to the potential for hijack are placed inside external storage and picked up by the VirtualApp container to be installed as plugin apps. The JIT-MF driver runtime is loaded when the plugin app starts, which enables the timely collection of artefacts from the plugin app memory (addressing
R1) without requiring app repackaging, thus also addressing
R3.
4.1.3. Working Prototype
We implemented the events collector component of VEDRANDO by extending ReLF [
27], the only open-source EDR tool available for mobile phones to the best of our knowledge, with the ability to collect critical evidence of sensitive app events found in memory produced by JIT-MF drivers. ReLF extends GRR [
51], an open-source, scalable system developed by Google for remote live forensics and incident response and enables forensic investigations of Android devices by acquiring various forensic artefacts from devices (as many as any other such forensic tools, see
Table 1). As with typical EDRs, the setup involves having ReLF clients on mobile phones, from which events are collected and sent to a ReLF server. The ReLF client may be built and deployed as a user or system app. The latter has access to more forensic sources (see sources marked with ∗ in
Table 1) but requires root access. ReLF client apps built as user apps interact with the underlying system through Android APIs or the low-level ReLF native service using inter-process communication (IPC) [
27]. JIT-MF drivers in different containers may be the same if the sensitive apps (1, 2, and 3 in
Figure 4) use a common infrastructure (which the evaluation results demonstrate is very likely the case). In the specific case of our working prototype, the EDR client and server were the ReLF client app (built as a user app to comply with
R3) and server, respectively.
Artefact Collection
While the app is in use, JIT-MF logs are populated continuously with
from memory, upon the invocation of the specified
in the JIT-MF driver of the container. When the alarm is raised, the ReLF server can invoke artefact collection flows, instructing the ReLF client to collect any pending logs not yet collected through continuous monitoring, to be sent back to the server as part of evidence collection to aid the ongoing investigation. As shown in
Figure 4, the ReLF client leverages the Android API to collect all Android forensic sources, including logs containing the in-memory evidence collected by the JIT-MF driver deployed within VirtualApp. For logs generated by JIT-MF drivers containing evidence from app memory, the client uses the Android API to search for files on the device with a
*.jitmflog extension.
Other Implementation Considerations
For the prototype described above, the JIT-MF driver and logs generated are placed in the temporary directory (
/data/local/tmp) and external storage (
/storage/emulated/0), respectively, to enable ease of automation. Furthermore, we assume the container can be trusted [
52].
EDR tools (including the events collector component of VEDRANDO) deployed in enterprise settings must comply with standard security measures for which existing implementation solutions exist. Therefore, in a realistic environment, scoped storage would need to be used to appropriately store JIT-MF logs and drivers, thus ensuring secure access to these critical contents. If these were to fall prey to a malicious actor, then critical evidence might be lost or remain uncollected. Similarly, to verify JIT-MF drivers, digital signatures can be used and approved by the device owner, app developer, device manufacturer, or a combination thereof. This would significantly reduce the threat of deploying malicious JIT-MF drivers and safeguard against any privacy concerns of the device owner. All sensitive apps being monitored (plugin apps) should be automatically updated with the latest changes published by app vendors. In so doing, the events collector avoids the need for re-installation/sign-up. Furthermore, they should retain the security features provided by Android out-of-the-box, mainly that only authorized access to the app should be allowed.
4.2. Attack Detector
Anomaly detection of logs is commonly used to detect anomalous behavior. In the case of stealthy benign app attacks, existing log sources, such as third-party application logs, do not provide enough context to enable anomaly detectors to detect a specific app event as anomalous. The additional JIT-MF logs containing evidence from memory collected by VEDRANDO’s events collector provide the necessary additional context with which standard anomaly detection methods can observe a difference between normal app usage and benign app hijack, thus enabling the detection of anomalous events as hijacked benign app events, even in the case of stealth attacks. While attack steps from hijacked apps can be detected through the logs produced by the events collector component, stealth attacks may consist of several steps, whose footprints are dispersed across many separate logs on different victims’ devices.
The attack detector component of VEDRANDO comprises the detection algorithm outlined in Algorithm 1. The algorithm uses an existing, standard anomaly detection method to detect anomalies in the JIT-MF logs, then correlates anomalies with events collected from other logs found on the device, to reconstruct all the attack steps, including the malware entry point (addressing
R4). The algorithm takes as input the logs produced by the events collector component, comprising JIT-MF logs with evidence objects from app memory and other logs found on the device (see
Table 1), and a user-defined configuration
. The configuration variable
holds settings related to generating the anomaly detection model. Namely, it comprises: (i) the anomaly detection method (
a); (ii) associated features selected (
); (iii) the anomaly threshold value
t, which will be used to identify data points as anomalous; and (iv) a list of app-specific regex keywords (
) used during the correlation of events. The algorithm outputs a list of events
attributed to the complete attack steps.
4.2.1. Anomaly Detection
All the entries from different log sources are parsed (line 1 in Algorithm 1), so each entry has three main fields: (i) timestamp, (ii) log source, and (iii) activity. JIT-MF logs are filtered to remove duplicates. Furthermore, in the case of both the third-party app and JIT-MF logs, we further filter the logs so that only sources of evidence related to the evidence object are considered. For instance, in a messaging hijack attack, where the evidence object is a message sent from the user’s phone, the evidence collected from the app is its database, comprising many tables and possibly also containing data unrelated to messaging, e.g., app themes, which may cloud the investigation. The function is then called, with the following parameters: (i) parsed JIT-MF logs (); (ii) logs from other sources (); and (iii) configuration settings ().
The function
first generates a machine learning anomaly detection model based on the machine learning method and features defined in the user-inputted configuration (
line 4). The model
is then applied on the parsed and filtered set of JIT-MF logs
using the threshold defined in the configuration settings (
line 5). The anomalous JIT-MF log entries revealed by the anomaly detection model are considered anomalous JIT-MF events. These are then correlated (
line 6) with other log events (JIT-MF logs and logs from other sources) using the app-specific correlation regex keywords given parameter (
). The function returns the result of the
function.
Algorithm 1 Anomaly detection and correlation algorithm |
- Input:
JITMF Logs J, Other Logs O, Config c= {Anomaly Detection Method a, Anomaly Detection Features [f1…fn], Anomaly Detection Threshold t, Correlation Keywords Regex [p1…pn]} - Output:
Correlated_Events e={Ø} -
- 1:
- 2:
-
- 3:
function GETANOMALIES(JITMF Logs J, Other Logs O, Config c) - 4:
- 5:
- 6:
- 7:
return e - 8:
end function - 9:
- 10:
function Correlate(, JITMF Logs J, Other Logs O, Correlation Keywords Regex ) - 11:
Events e={Ø} - 12:
for each do - 13:
if
then - 14:
- 15:
else - 16:
- 17:
end if - 18:
- 19:
- 20:
- 21:
for each do - 22:
if then - 23:
- 24:
- 25:
end if - 26:
end for - 27:
end for - 28:
- 29:
- 30:
- 31
- 32:
- 33:
- 34:
return e - 35:
end function
|
4.2.2. Correlation
JIT-MF log events include a timestamp and metadata of the definition as described in the JIT-MF driver. Regardless of the driver implementation or the app, the contents of the can be parsed to derive relevant keywords used during the attack step. App-specific correlation keyword regex retrieves the relevant keywords from anomalous JIT-MF log entries. The ’s makeup is app-specific; therefore, the regex pattern used to retrieve this metadata or identifier from a log entry must also be app-specific. That said, there are cases where the keyword regex is the same across apps, due to formatting standards, e.g., the object ID may be in UUID format, which is a standard format.
The function accepts as input the anomalies detected, JIT-MF logs, logs from other sources, and the app-specific correlation regex keywords. If the anomaly is a timestamp (as is the case with time-based anomaly detection), JIT-MF logs at that time are retrieved (lines 12–17 in Algorithm 1). The of the anomalous JIT-MF log entry is retrieved (lines 18) and used to perform correlation, as follows: The algorithm correlates anomalous JIT-MF log events with events in other log sources, based on two mechanisms: (i) feature-based correlation and (ii) time-based correlation. It is unlikely that normal events have identical keywords in their . However, in the case of malware, especially during propagation, s containing matching keywords are expected. Therefore, events in other log sources that contain identical keywords to those found in anomalous JIT-MF log events (lines 21–26) are considered further attack steps, and are added to the list of . This is referred to as feature-based correlation. Any other attack steps performed in the attack are assumed to have happened in the period within which the correlated list of attack steps occurred. Therefore time-based correlation is used to search for other events that occurred in other logs when the JIT-MF log anomalies were detected (lines 30–33). This ensures any attack steps carried outside the app functionality are also disclosed. Any log entries found through correlation are entered into a set of correlated events and returned to the analyst or investigator as the complete list of the attack steps carried out.
5. Experimental Evaluation and Results
We evaluated the feasibility of VEDRANDO’s events collector component based on the JIT-MF driver development effort required and the compatibility with app-level virtualization. Experiments supporting this evaluation involved (i) carrying out a coverage analysis of the most popular 550 apps (this is the maximum number of apps returned by AppBrain statistics) on Google Playstore, to find the most commonly used underlying infrastructure libraries (using statistics obtained from AppBrain [
26]) that can be leveraged for JIT-MF driver development (
Section 5.1), and (ii) executing popular apps on Google Playstore within the events collector setup, to evaluate their compatibility with VirtualApp containers equipped with infrastructure-based JIT-MF drivers and to determine the introduced runtime overhead (
Section 5.2). The apps considered in these experiments spanned more than 39 categories, including the messaging and finance categories, which are the primary targets for damaging stealth attacks [
53] and for which VEDRANDO could be a solution.
Finally, we evaluated the effectiveness of VEDRANDO’s attack detector component by simulating ten stealthy instant messaging (IM) hijack case studies, targeting ten of the most popular Android IM apps. Our results showed that, given timely captured evidence from memory collected by VEDRANDO’s events collector, anomaly detection and correlation techniques could be used to reconstruct all attack steps of the stealthy benign messaging hijack attacks targeting popular Android messaging apps.
5.1. JIT-MF Driver Setup
This analysis aimed to identify the infrastructure handling core app functionality (see
Section 4.1.1) that is commonly used among popular apps in the general, messaging, and finance categories, enabling a more feasible, generic JIT-MF driver that can be used across apps and app versions.
We
extended existing results [
25] and considered the data provided by AppBrain, a service that provides statistics on the Android application ecosystem, including library adoption by different apps in different categories. AppBrain categorizes libraries used in Android applications using tags, depending on the functionality provided by the library. Out of the 41 possible categories, we identified the database (storage) and network libraries as critical infrastructures that typically handle data in sensitive events. Database functionality allows data to be stored and retrieved on the devices where the app is installed, and network functionality handles the data to be transferred over the network.
Coverage Analysis
Figure 6 shows the usage distribution of the database and network libraries by the 550 most popular apps. The percentage of apps covered suggests that database libraries are more widely adopted than network libraries across all app categories. The graphs show that, overall, usage of database libraries is common across 93.8% and 97.1% of messaging and finance apps, respectively, and 95.1% of all apps. Whereas network libraries are much less prominent, being adopted in only 50.8% and 64.3% of messaging and finance apps, respectively, using the most popular network libraries. The figure is even lower, 44.2%, for popular apps in general.
Furthermore, the database library usage graph shows a much steeper incline, meaning that a large number of apps use the same small number of database libraries. Specifically, the most widely adopted database infrastructure was Android architecture components, with 93.1% and 97.1% adoption among the most popular 550 messaging and finance apps, respectively, and adopted among 92.2% of all apps. At its most native level (see
Section 4.1.1), Android architecture components refers to storage management through an SQLite Database (
https://developer.android.com/training/data-storage/sqlite accessed on 30 June 2023). While keeping in mind that these values were obtained via static analysis of apps, this still bodes well for the extensibility of JIT-MF drivers and the feasibility of JIT-MF driver development, since one SQLite-based JIT-MF driver could potentially be successful on an extensive range of apps.
5.2. Runtime Evaluation
We evaluated the feasibility of VEDRANDO’s events collector VirtualApp and JIT-MF driver setup, in terms of its compatibility with Android apps and the resulting performance overheads. To do this, we selected a set of apps from the 100 most popular apps in Google PlayStore in February 2022 (as listed on AppBrain), which had not previously been installed on the phone and were not manufacturer-specific, resulting in a total of 33 apps.
A stock (unrooted) Google Pixel 3a physical phone, with eight processors and 4 GB RAM, was used, which runs on arm64-v8a CPU architecture and Android version 9 (as required by VirtualApp). The apps selected were downloaded from APKPure (
https://apkpure.com/ accessed on 30 October 2022) using
apkeep (
https://github.com/EFForg/apkeep accessed on 30 October 2022) to ensure that the APKs downloaded complied with the architecture and Android version. We used the
UI Exerciser Monkey tool (
https://developer.android.com/studio/test/other-testing-tools/monkey accessed on 30 October 2022) to exercise each app’s functionality by injecting 20 random UI events with a throttle of 30 s, which allowed the virtual environment to spawn the app, but which could not be reset to execute the rest of the events due to limitations of
UI Exerciser Monkey. A seed value was used to ensure that the same app events could be repeated in the case of multiple runs.
We installed and executed the 33 apps directly on the device, to check typical resource usage. We ran the apps in a standard VirtualApp container, to evaluate their compatibility with the virtual environment. In this step, we collected the overhead introduced by the Android virtualization regarding CPU and memory usage. At the end of this phase, we identified five apps that triggered an exception due to incompatibility with the virtual environment. Thus, we discarded such apps from the rest of the experiments. The remaining 28 apps were executed three times: (i) directly on the device, (ii) inside a simple VirtualApp container, and (iii) inside a VirtualApp container equipped with an SQLite-based JIT-MF driver (as implemented in VEDRANDO’s Events Collector). The results were averaged over ten runs.
Results
Table 2 shows the minimum, average, and maximum overhead values expressed in percentage points (pp). In the first column, we compared the execution of apps in a plain VirtualApp with the traditional execution method (no virtualization). We computed the overhead for the VirtualApp container as implemented in VEDRANDO’s events collector component (i.e., second column) compared to the execution in a plain VirtualApp environment. Since the execution of an app under virtualization is composed of two processes (the container and plugin), the overall amount of CPU and memory is given by the sum of the overhead of these two processes.
The results from
Table 2 show that when introducing virtualization through VirtualApp, there was an average increase of 2.83 pp in CPU usage and 0.56 pp in memory usage. The results for the container as implemented in the events collector component, using an SQLite-based JIT-MF driver, show that the additional average overhead introduced was negligible, i.e., an increase of 2.06 pp for the CPU usage and 0.18 pp for the memory. We concluded that this increase was caused by the overhead required by JIT-MF drivers to capture memory dumps for trigger points performed through instrumenting methods. The overall additional CPU usage incurred by VEDRANDO’s
events collector
component when using SQLite-based JIT-MF drivers was on average 4.89 pp, rendering it feasible in terms of runtime performance in a real-world scenario. This, however, may vary depending on the type of JIT-MF driver used in the VirtualApp container.
5.3. Attack Investigation Case Studies
We evaluated the effectiveness of VEDRANDO’s attack detector by measuring its ability to reveal attack steps related to stealthy benign app hijack attacks in a realistic scenario. Rather than assessing the performance of existing anomaly detection models on a large dataset, we aimed to demonstrate how the anomaly detection methods typically available to SOC analysts through their SIEM setup can be used to detect anomalous events related to benign app hijack attacks when provided with JIT-MF logs produced by VEDRANDO’s events collector component. To this end, similarly to other related works [
54,
55,
56], we presented a qualitative case study for a benign instant messaging (IM) hijack attack targeting Android’s ten most popular IM apps, as shown in
Table 3, following the threat model described in
Section 3.2.
In
Section 5.3.1 and
Section 5.3.2, we describe the case study setup and the settings used for the proposed anomaly detection and correlation algorithm (Algorithm 1), respectively.
Section 5.3.3 shows the summarized results concerning the reconstructed attack steps obtained for the ten case studies.
5.3.1. Case Study Setup
Figure 7 shows the experiment setup and flow, comprising an implementation of the working prototype for VEDRANDO, shown previously in
Figure 4, and the investigation flow indicated by arrows. A stock (unrooted) Google Pixel 3a physical phone was used, on which an implementation of VEDRANDO’s events collector component was deployed, using an SQLite-based JIT-MF driver (
https://gitlab.com/bellj/vedrando/-/tree/main/sqlite-jitmf-driver.js), given that the popularity of this infrastructure among messaging apps has already been established (see
Figure 6a). Normal traffic on each app consisted of loading and sending instant messages. This was simulated using
AndroidViewClient (
https://github.com/dtmilano/AndroidViewClient accessed on 30 October 2022), assuming that the user messages random contacts from his list of contacts, waiting a random amount of seconds (between one and ten) before sending the message. We acknowledge that this simulation of normal traffic may be a threat to validity. However, we claim that the simulated traffic generated within the case study time window was a sufficiently realistic representation to provide a basis for our study.
Benign IM App Hijack Simulation
Table 4 shows the ground truth timeline of events carried out to simulate the attack scenario for this case study. A malicious message is received containing a link to a malicious app (
Table 4 ❷). Once the user clicks on the link, the app (a fake app called
demo.apk) is silently installed (
Table 4 ❸) and propagates to the user’s contacts via the default IM app installed (in this case, the apps in
Table 3). To attain stealth, the simulated attack deletes the sent messages from the victim’s phone (
Table 4 ❺) and hides by removing the malicious app icon from the home screen so that the victim is unaware of the malicious app and it goes unnoticed by the victim for longer. ❻ in
Table 4 is a
Trigger Event; that is, it alerts the user that a suspicious event has possibly occurred, which initiates an investigation process. It is typical for realistic malware aiming to be stealthy to wait until it is the right time to execute [
57]. In this case, the malware waits until the hijacked app is not in use, so as not to alert the user of abnormal behavior. Due to these stealth measures and additional ones that the malware uses to hide its attack steps, the trigger event occurs long after (in this case, almost an hour later) the attack, meaning that the malware would have hidden its tracks, leading to delayed detection.
Investigation Setup
We assumed the role of an SOC analyst in an enterprise and started an investigation process by invoking commands from the GRR ReLf server (step 2 in
Figure 7) to collect evidence artefacts from the victim’s phone, including JIT-MF logs produced by the JIT-MF driver (step 3). SOC analysts are typically equipped with SIEM services that provide access to out-of-the-box anomaly detection tools. The GRR ReLF server has bindings to Google BigQuery (
https://cloud.google.com/bigquery accessed on 15 April 2023), a service that enables scalable analysis over petabytes of data and provides machine learning capabilities including anomaly detection, which we use as a SIEM equivalent. During the investigation procedure followed in this evaluation, the artefacts collected by the ReLF client were sent back to the GRR ReLF server (step 4) and saved in Google BigQuery datasets (step 5). VEDRANDO’s attack detector detection and correlation algorithm used Google BigQuery’s machine learning API to detect anomalies in the collected JIT-MF logs. These anomalies were then correlated to events from other logs, to reconstruct the attack steps executed by the stealth benign messaging app hijack attack (step 6).
5.3.2. Detection and Correlation Configuration
The investigation procedure outlined above was carried out after the attack hijack scenario was executed on each targeted messaging app shown in
Table 3. Once the logs for each case study had been retrieved, we implemented and executed the detection and correlation algorithm (Algorithm 1) using the configuration described below.
Anomaly Detection Models
Google BigQuery ML [
58] provides anomaly detection capabilities through four machine learning model types: ARIMA_PLUS, K-means, PCA, and Autoencoder. All these models are unsupervised and can therefore detect anomalies without needing labeled data. ARIMA_PLUS detects anomalies in time series data, while the others detect anomalies in independent and identically distributed random variables. For our evaluation, we used these models with selected applicable parameters and features as configuration input to our detection algorithm (Algorithm 1) to measure the algorithm’s effectiveness in detecting and reconstructing attack steps. All collected logs (including JIT-MF logs) were processed in BigQuery, and the preprocessed log content was used for building the different models. Hyperparameter tuning is commonly used to improve model performance, by searching for optimal hyperparameters. During our evaluation, we used the default and recommended Vertex AI Vizier algorithm to tune the hyperparameters
https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-hyperparameter-tuning.
Dataset
The evaluation of machine learning algorithms typically involves using large, established datasets. However, this evaluation aimed to demonstrate the value that JIT-MF logs bring to the incident response process, by showing that evidence in these logs enables existing machine-learning anomaly detection models to detect anomalies in benign app activity related to an app hijack, which can help reconstruct attack steps. Therefore, the dataset used to train the anomaly detection models in our evaluation was similar to what an SOC would have available in such an incident. This comprised logs typically collected by EDRs (shown in
Table 1) and JIT-MF logs that were populated during the case study (which involved both the attack and normal traffic) and collected as part of the investigation process by VEDRANDO’s events collector component.
The VirtualApp container used by the event collector was built and deployed to the phone in debug mode, and therefore its app data could be retrieved. VirtualApp app data houses the data produced by plugin apps, and therefore relevant third-party app forensic sources could also be accessed and collected as forensic sources. When working with a VirtualApp container app that is not running in debug mode, forensic analysts can opt to use app features such as “backup” or collaborate with the device owner to collect the evidence that is present in the app. Once the sources were collected, relevant data were extracted related to the app’s main functionality (in this case, messaging), converted into logs, and transferred to our Google BigQuery dataset.
Evidence collected from the app (both JIT-MF logs and app-specific logs) comprised its database, consisting of many tables possibly also containing data unrelated to messaging (e.g., app themes etc.), which do not contribute to the main app functionality. Therefore, logs were filtered to include only evidence related to messaging activity. The timestamp, forensic source, and activity fields of the log entries for each source were identified, parsed, and used to build anomaly detection models.
Features
Table 5 shows the features used per anomaly detection method to generate the anomaly detection models during the execution of Algorithm 1. Features were selected based on the anomaly detection method and the knowledge that JIT-MF logs may contain evidence of offloaded attack steps that are not visible in other forensic sources. Log entries from multiple sources were parsed, so that each had a timestamp, forensic source, and activity. However, the format of the content inside the activity field differed from one forensic source to another, both across sources and in the case of app-specific logs and JIT-MF logs, and even across apps. Rather than parsing each log type individually for each app and forensic source, we used derived features, in the form of log entry amounts per feature grouped by a time window.
Feature 1 represents the discrepancy in the log entry amount between that produced by all forensic sources (excluding JIT-MF logs) and the amount found in JIT-MF logs.
Feature 2 represents the total amount of log entries collected from all sources of the events collector component.
Features 3 represents the log entry amount collected from JIT-MF logs.
Features 4–9 represent the log entry amounts collected from each distinct forensic source (excluding JIT-MF). While
Table 1 shows that a typical collection involves retrieving multiple forensic sources, only five were populated during the case study (e.g., no connectivity data were reported).
Features 10–14 represent the log entry amounts collected from JIT-MF logs with distinct SQL statements. Since the JIT-MF logs in these case studies were generated using an SQLite JIT-MF-based driver, log entries included SQL statements which process the message object (as shown in Listing 2). The SELECT, INSERT, REPLACE, UPDATE and DELETE SQL statements can be considered to reflect app functionality related to the processing of the message object. We considered the amount of logs containing a specific SQL statement a feature.
The models in
Table 6 were generated based on the features described. Two models were created for each combination of anomaly detection methods and features or feature sets (in the case of K-means, PCA, and Autoencoder). In the case of Google BigQuery’s ARIMA_PLUS, log entries were automatically grouped in sixty-second time windows. For ARIMA_PLUS we used two different feature normalization properties (Standard Scaler and Min Max Scaler (
https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-preprocessing-functions), whereas for K-means, PCA, and Autoencoder we aggregated and counted events every thirty (30s) and sixty (60s) seconds. Each model in
Table 6 was created for every targeted app in the case study.
Anomalies were detected depending on the model used and the threshold set. ARIMA_PLUS is a univariate time-series model that uses a single feature to detect anomalous data points across historical data. In contrast, the K-means, PCA, and Autoencoder models use multiple features for clustering (K-means) and dimensionality reduction (PCA, Autoencoder), which results in the identification of anomalies based on outliers and reconstruction loss. Each model supports a custom threshold for anomaly detection in Google BigQuery ML (
https://cloud.google.com/blog/products/data-analytics/bigquery-ml-unsupervised-anomaly-detection). For ARIMA_PLUS models anomalies are identified based on the confidence interval for that timestamp. If the probability that the data point at that timestamp occurs outside of the prediction interval exceeds a given probability threshold, the data point is identified as an anomaly. Furthermore, since Google BigQuery returns the feature value, our detection algorithm implementation also checked that, for the given anomaly found,
Feature 1 (the discrepancy between logs) was greater than 0. For the other models, anomalies were identified based on the value of each input data point’s normalized distance to its nearest cluster. The data point was identified as an anomaly if that distance exceeded a threshold determined by the given contamination value. The contamination value defined the proportion of anomalies in the training dataset. This value ranged from 0.1 to 0.5, where 0.1 and 0.5 mean that 10% and 50% of the training data used to create the input model, respectively, were anomalous. Whereas for the ARIMA_PLUS models, a lower threshold value made the data points more likely to be considered anomalous, for the other models, a larger contamination value (threshold) made the data points more likely to be considered anomalous.
High-Level Event Reconstruction and Correlation
Before executing the attack detector component of VEDRANDO, we conducted a preliminary manual analysis of the logs generated by the SQLite JIT-MF drivers, to determine how low-level JIT-MF log entries could be combined to form more indicative high-level events. This analysis revealed that, in the case of SQLite-based JIT-MF drivers and the apps used in the case studies, a regex pattern for JIT-MF logs generated by each app could identify a log entry that reflected the actions of several JIT-MF low-level events generated after an action had occurred. For instance, when a message is sent (high-level event), multiple JIT-MF log entries (low-level events) are generated (related to updates made to several tables in the database). A single entry, however, is identified as explicitly updating and inserting content into the app’s specific messages table in the database. This manual process was also required to select the correlation regex keyword specific to each app. In these case studies, keyword regex aimed to extract the message content and identifier (ID). Therefore, we defined regex string patterns for these two keywords for each app, so that any message content or message ID found in the log entries could be correlated with related events.
5.3.3. Attack Investigation Results
For each targeted app, the ground truth attack steps of the simulated benign app hijack were recorded as shown in
Table 7 (a subset of the events shown in
Table 4). We demonstrated the value of timely evidence collected from the app memory for each attack by first showing that, upon manual inspection, evidence related to the attack steps was predominantly only collected by VEDRANDO’s events collector. Furthermore, based on the detection methodology described in Algorithm 1, we showed in the realistic context of an ongoing investigation that the evidence in JIT-MF logs was critical for detecting and responding to anomalies.
Artefacts Recovered by JIT-MF Logs
Table 7 summarizes which critical attack steps executed on all targeted messaging apps were found in logs typically collected by an EDR and in those collected by VEDRANDO’s events collector component that included JIT-MF logs.
The results showed that VEDRANDO’s events collector, using app-level virtualization enhanced with JIT-MF drivers, collected JIT-MF logs comprising evidence from memory from all the apps in the case studies, without requiring app-repackaging. In nine out of the ten case studies carried out (except for the Skype case study), critical attack steps (❹ and ❺) were only collected when considering JIT-MF forensic log sources. This evidence was located given knowledge of the ground truth. However, investigators and analysts investigating an attack scenario require a detection methodology that points to these specific events to allow detection of anomalous behavior. Specifically, events ❷ and ❸ were only considered anomalous after having been correlated to events ❹ and ❺, and collected solely by JIT-MF (except for one case study).
Reconstruction of Attack Steps
Now that we have established that only JIT-MF uncovered these anomalies, we focus on the thresholds and model parameter selection that performed best with the attack detector to uncover these anomalies.
Table 8 and
Table 9 show the effectiveness of the detection and correlation algorithm in VEDRANDO’s attack detector component for reconstructing stealthy attack steps executed during the stealth IM hijack case studies, with the input parameters defined in
Section 5.3.2. For each model and threshold input combination, we calculated the average recall, precision, and F1-scores, to measure the overall accuracy of the reconstructed set of events returned by the attack detector when compared to the ground truth set of events executed by the attack. The F1-score combines precision and recall values. Therefore, the higher the F1-Score, the more accurate the list of attack steps returned by VEDRANDO’s attack detector.
The tables above show the averaged results over all the attack case studies carried out during experimentation. The results demonstrate that, overall, threshold parameter values with greater allowance for anomalies (<0.9 for ARIMA_ PLUS and >0.3 for K-means, PCA and Autoencoder models) returned a more accurate reconstruction of the attack steps. Specifically, three models (PCA models M4 and M7, and the ARIMA_PLUS model using a standard scaler-M1) resulted in an F1-score of 80% and had 100% recall value; that is, full attack step reconstruction.
Further analysis of the results obtained by these three models revealed that the average recall value across apps increased at a faster rate than the precision value decreased. This was because, for individual case studies (which varied depending on the model used), a more lenient threshold value was required to obtain the same recall value that the other apps obtained with less lenient threshold values.
Figure 8 shows this for the specific case of the model input parameter resulting in the best overall F1-score value (M7). In this case, 90% of the apps used in the case studies reached an average 100% recall value on the reconstructed set of attack steps when the threshold was set to 0.3. However, the WhatsApp Business attack steps were only detected as anomalies when the threshold was set to 0.5.
Given that the detection algorithm resulted in high F1-scores when using both PCA models, we concluded that, overall, PCA worked well with the features selected to detect anomalies in JIT-MF logs. Crucially, by using PCA (an existing anomaly detection algorithm available to SOC analysts) with a 0.5 threshold, and using the set of features described in
Table 5 and correlation settings defined in
Section 5.3.2, the detection algorithm used by VEDRANDO could fully reconstruct the attack steps of benign app hijack attacks with a relatively high precision across all case studies based on evidence collected from JIT-MF logs.
We also evaluated the sensitivity of our detection and correlation algorithm to the threshold value given as a parameter using a Wilcoxon signed rank test (
https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test). The results showed that the change-in-value of the F1-scores between the threshold values for K-Means, PCA, and Autoencoder were statistically significant. Therefore, when using such models, the algorithm is considered sensitive to the threshold set. The set of ARIMA_PLUS values was smaller; therefore, Wilcoxon values could not be calculated for this model. However, upon inspection, the F1-scores remained the same for the two most lenient threshold values.
8. Conclusions
In this paper, we proposed VEDRANDO, an enhanced EDR for Android that comprises the timely collection of volatile memory artefacts and the detection of a class of stealth attacks that hijack benign Android applications. VEDRANDO highlights the critical role of evidence from memory in detecting and reconstructing the attack steps of stealth app hijack attacks that are not collected by the current state-of-the-art tools. By leveraging experimental techniques for timely memory collection and app-level virtualization, VEDRANDO can collect evidence from memory without requiring app repackaging or device rooting, thus ensuring the feasibility of our solution. VEDRANDO also uses existing anomaly detection methods and correlation techniques, typically available to SOC teams and investigators, to detect evidence and reconstruct the attack steps of stealthy benign hijack attacks. Our results show that deploying VEDRANDO is feasible, as it incurs minimal performance overheads, and JIT-MF driver development efforts can be eased through infrastructure-based JIT-MF drivers. Furthermore, our evaluation showed that, given a set of anomaly detection methods and parameters, VEDRANDO can effectively and precisely reconstruct attack steps up to the malware entry point for the class of stealth attacks that hijack benign messaging app functionality.