4.1 Implementation of Basic Techniques
We stated a pre-requisite in the prior section with respect to shadow stacks: ...Under the assumption that the attacker cannot access or modify a portion of the memory. This assumption does not have a straightforward justification in the context of embedded systems. As previously noted, low-end embedded systems simply do not have complex memory management units to support well-known features such as virtual memory, which is now common in higher-end processors, let alone have special built-in mechanisms to support hiding shadow stacks from an attacker. Therefore, a successful CFI mechanism has to first wrangle the available hardware capabilities to support shadow stacks.
Zhou et al.’s Silhouette [
97] is an attempt to support shadow stacks on ARMv7-M [
49], the architecture underlying the ARM Cortex-M series of processors commonly found in embedded systems. It also supports forward-edge CFI checks. Silhouette is designed for bare-metal codebases that do not utilize an RTOS. It is, thus, an example of how a sophisticated CFI mechanism would look in the context of a resource-constrained embedded system with a bare-metal codebase.
The ARMv7-M architecture supports two privilege levels in hardware, privileged and unprivileged. The optional
memory protection unit (MPU) allows a system designer to decide access rights to an address. A limitation of the ARMv7-M architecture is that the MPU can be controlled by any privileged code. For example, most RTOSs, such as FreeRTOS [
10], by default, execute both the tasks and the operating system as privileged code to mitigate the overhead of switching privilege levels. This makes using the MPU to protect a shadow stack a moot point, simply because an attacker that has infiltrated the system could re-program the MPU since they would most likely already execute under the privileged execution context.
Silhouette ensures that the MPU access rights are adhered to by working around this limitation. It replaces all store instructions, other than those that are supposed to directly store to the shadow stack, or the
hardware abstraction layer (HAL) code, with unprivileged store variants, at compile time, to ensure adherence to the memory access policies defined in the MPU for the target address, regardless of the processor’s current execution privilege level. The shadow stack is implemented in a similar manner as the parallel shadow stack explained in Section
3.1. To ensure that the store instructions with higher privilege levels are not abused by an attacker, Silhouette implements forward-edge CFI checks. Silhouette utilizes a labeling mechanism (Section
3.2) to guarantee forward-edge CFI [
14].
On the performance front, Silhouette is benchmarked using well-known embedded system benchmark suites, namely CoreMark-Pro [
23] and BEEBS [
69]. We will see these same benchmarks being used in other approaches too in later sections, providing a common playing field. The maximum performance overhead reported for the two benchmark suites is 4.9% and 24.8%, respectively, and a code memory overhead of 8.9% and 2.3%, respectively. The geometric mean of the performance overheads for all the benchmarks in each test suite is 1.3% and 3.4%, respectively. The approach used by Silhouette, which they term as
store hardening, basically utilizes a memory management technique to hide the shadow stack from the attacker.
Another mechanism that can be used to prevent access to the shadow stack is called
software fault isolation (SFI) [
64,
89]. SFI is a technique where the address space is partitioned into
fault domains. Any code within a fault domain has unrestricted access to code or data within the same fault domain, but the partitioning scheme prevents the code from accessing any memory outside the fault domain. This is achieved by instrumenting load/store instructions during compile time to trigger the fault handler if the memory access takes place outside the fault domain. A variant of Silhouette is proposed that utilizes this technique by instrumenting store instructions to restrict them from writing to the shadow stack unless the store instruction is part of the shadow stack manipulation code. The authors note a higher performance overhead, with the geometric mean results being 2.2% and 10.2%, respectively, for the two benchmarks, which leads the authors to conclude that the store hardening approach is superior in performance. However, it would be interesting to note how the performance would vary if the shadow stack was protected using an approach similar to Aweke and Austin’s [
9] lightweight SFI for IoT systems that shows an overhead of just 1% on the MiBench [
44] benchmarks. Their approach utilizes a small amount (150 lines) of trusted code that sets up the MPU to create the fault domains, trapping accesses outside the domain as memory access faults. Unfortunately, they do not present results using the CoreMark-Pro or BEEBS suites, making direct comparisons difficult.
While the Silhouette and its variant provide a good overview of the well-known techniques of shadow stacks and labels can be applied to a real low-end processor architecture, Kage [
32] extends Silhouette to provide an implementation of CFI for an RTOS environment on microcontrollers based on ARMv7-M. Kage modifies FreeRTOS and introduces the concept of a
trusted kernel and
untrusted tasks. Untrusted code is passed through the store hardening compiler technique introduced in Silhouette and transformed into unprivileged store variants. This prevents their write access to the trusted portions of memory, which can only be accessed through privileged store instructions. Therefore, the trusted code such as the kernel and its associated data structures are maintained as privileged instructions so that they may access any portion of the privileged or unprivileged memory. Portions of the trusted kernel, such as common RTOS infrastructure that is expected of application tasks (locks, queues, etc.), are made available via a secure API that is designed to vet arguments from untrusted code such that they are unable to overwrite control information within the trusted kernel. The authors showcase that the Kage kernel incurs an average performance overhead of 5.2% over the baseline FreeRTOS kernel when running a multitasking workload of one to three benchmarking tasks from the CoreMark test suite.
Silhouette and Kage provide a good overview of how well-known techniques of shadow stacks and labels can be applied to a real low-end processor architecture. However, there are avenues to improve the operation of such systems. We shall now look at some of them.
4.2 Beyond the Basics
While the techniques discussed in Section
3 consider forward-edge and backward-edge separately, some effort has been applied in recent years to develop more holistic mechanisms that apply to backward- and forward-edges at the same time.
An example of such a mechanism is the
Control-Flow Locking (CFL) technique [
12]. This is also an example of a
lazy CFI that trades off attack detection speed with performance overhead. While CFL is not explicitly targeted at resource-constrained embedded systems, the mechanism can be implemented with similar memory and performance overhead as any general label-based CFI for detecting forward-edge control-flow attacks. CFL uses locks, instead of shadow stacks, to determine if an attacker has diverted control flow to an arbitrary location. An overview of the CFL operation is given in Figure
3. The idea behind the CFL approach is simple. Similar to how labels are generated based on the valid control-flow graph,
key values are assigned to legitimate call/jump target locations. CFL targets indirect calls/jumps as well as return instructions (an x86 architecture-based processor was assumed). Once the unique key values, which essentially represent valid edges in the control-flow graph, are generated, the authors propose to then instrument the target binary with instructions to lock and unlock control-flow paths using these key values. Every legitimate control-flow redirection start point, which may be an indirect
call, jmp, or
ret instruction, is
preceded by a lock operation; i.e., the key value is stored into a buffer. The assumption here is that the buffer is stored in a memory location such that it can be modified only by the lock and unlock subroutines, and not by attacker-controlled code. Once program control is redirected to a valid destination (such as a function entry point), it is
immediately succeeded by an unlock operation where the key value is validated; i.e., it is checked against a list of key values that could end up at this target location. If the values match, the key is zeroed out (
unlocked) and execution continues as before. When the next control-flow redirection operation must take place, the key buffer is first checked to see if it contains a non-zero value. If it does, an attack is detected since no legitimate transfer would allow the key buffer to have a non-zero value due to the paired lock-unlock operations. Depending on the quality of the available CFG, this pairing of lock-unlock operations could be coarse or fine.
The overall mechanism is interesting due to its simplicity and the introduction of laziness. Not only does it prevent an illegitimate jump to a
valid control-flow transfer site, but also it automatically detects an illegitimate jump to an
invalid control-flow site in recent history
without requiring additional runtime memory such as using a shadow stack. Evaluations show that CFL can outperform fine-grained CFI mechanisms, with a maximum overhead of 21% vs. 31% overhead under Abadi et al.’s [
4] mechanism on the SPEC CPU2000 [
2] benchmarks. However, as discussed earlier, the mechanism is lazy. This laziness can introduce blind spots that can be exploited by an attacker. For example, the attacker can redirect control and can remain undetected until it is caught by the next locking site. While laziness allows the mechanism to work with the time and memory overhead similar to a labeling scheme, it could have interesting security repercussions especially in the context of the real-time embedded systems, many of which are used in industrial environments, controlling actuators in critical processes. If an attacker is able to send out control commands to these actuators before they are detected, the attacker can still inflict catastrophic damage. However, laziness is not inherently flawed. There is therefore an avenue to leverage real-time requirements to enforce timing bounds on laziness.
While CFL is an example of a CFI technique that re-purposes control-flow labels to solve both forward and backward control-flow attack detection at the same time, it still uses a form of memory protection. All the techniques discussed up to this point attempt to work around hardware limitations to enforce memory protection and are conservative. However, they do not take full advantage of the processor architecture or require radical software/hardware changes to improve performance.
4.3 Register-based Shadow Stacks
We will now discuss two approaches that would require significant software modifications to allow them to work. We will first briefly look at Zipper Stack [
59], which is the more radical of the two since it proposes CPU architecture modifications to forego shadow stacks. The other is
\(\mu\)RAI [
6], which is built for COTS embedded systems. It takes a more moderate approach by requiring reservation of parts of the CPU but can be implemented by recompiling the codebase with a modified compiler. Both implement backward-edge CFI.
Zipper stack aims to solve the problem of securing shadow stacks by replacing them with a set of processor architecture modifications. Shadow stacks, as discussed in Section
3, are inherently simple but require additional support to secure them from attacker manipulation. For example, Silhouette in Section
4.1 requires additional code instrumentation to secure the shadow stack. Zipper Stack aims to solve this problem by replacing the shadow stack with a single value stored in a special-purpose register called the
top register. A separate register, the
key register, holds a secret key. At the start of a new process, the key register and top register are initialized with random values. Each time a function call takes place, the top register is pushed onto the main stack alongside the actual return address. A
message authentication code (MAC) algorithm, a cryptographic operation that is commonly used to authenticate messages from a known source, generates a new MAC from the top register value and the return address using the key in the key register. This newly created MAC is then stored in the top register. During a return sequence, the steps are reversed to authenticate the return address. First, the previous MAC value is popped from the stack and the MAC is recalculated using the return address and the popped MAC value. If the calculated MAC matches that currently in the top register, the return address is verified to be authentic. The processor replaces the top register with the popped value and continues execution at the return address. The purpose of the MAC-based design is to reduce the attack surface. By utilizing the top register and chaining the MAC values with each successive function call, an attacker can only modify the return address and evade detection if it first modifies the value present in the top register (which is inaccessible to application code and is automatically updated by the hardware) before modifying the other MACs. Therefore, the rest of the MACs can be kept in non-secure memory that may be accessible to the attacker, reducing the amount of overhead introduced by accessing the “zipper stack” of MAC addresses.
The operation shows that Zipper Stack is heavily dependent on (a) the efficacy of the MAC algorithm to ensure
collisions (same MAC from different inputs) do not occur, (b) the speed of the algorithm since every function call would constitute running the algorithm at least twice, and (c) the attacker not being able to access the key register to forge MACs. For (a) the authors use a well-known MAC algorithm, for (b) the authors argue that a hardware implementation would allow MAC calculation in a single cycle, and for (c) the authors argue that even if the key is leaked, the top register can only be modified at a call or a return operation. Their custom implementation on an FPGA with a RISC-V CPU achieves a 1.86% overhead on the SPEC CINT 2000 [
2] benchmark.
While Zipper Stack presents a very radical approach that may never see wide-scale commercial adoption due to its hardware modifications, it is still interesting since custom architectures for specific applications, such as defense, are not uncommon in the embedded system world. In such cases, a custom architecture designed with optimized built-in defense mechanisms is not hard to envision. Interestingly, the use of MACs for authenticating return addresses may become possible very soon on commodity hardware. For example, PACStack [
60] re-purposes the ARM
pac instruction to create a MAC chain of return addresses, very similar to Zipper Stack. As part of the ARMv8.3-A PA extension, and soon to be available on SoCs based on ARMv8.3-A and later architecture revisions,
pac allows generating
pointer authentication codes (PACs), which are MACs generated on pointer values and stored alongside the pointer. Similar to Zipper Stacks, the authors use a
chain register to store PAC values, which are generated from previous chain register values and the return address of a function call. When a return sequence takes place, similar to Zipper Stack, the reverse operation takes place. PACStack showed a geometric mean of 2.75% and 3.28% performance overhead on the SPECrate and SPECspeed (part of the SPEC CPU 2017 benchmark suite), respectively. PACStack provides a strong argument for MAC-based shadow stack replacement, especially since it depends on architecture extensions, which will soon be available in commodity hardware.
On the other hand, the authors of \(\mu\)RAI take a similar but more realistic approach, especially on current-generation hardware. \(\mu\)RAI is also concerned solely with the backward-edge, but instead of verifying the return address as is common with shadow stack approaches, \(\mu\)RAI enforces Return Address Integrity (RAI), where the return address simply cannot be modified by an attacker. Their approach, in essence, is to prevent write access to the return address. \(\mu\)RAI has the same set of requirements as many of the schemes we have discussed in previous sections, such as data execution prevention (DEP or \(W \oplus X\)) and an MPU. Similar to Zipper Stack, it requires that one of the processor registers is wholly dedicated to its operation and should never spill. This is called the State Register(SR). \(\mu\)RAI’s operation requires that the attacker cannot modify the register.
\(\mu\)RAI works by instrumenting code before branches and at return points, similar to CFL. It works solely with direct branches, i.e., branches with encoded destinations, and converts all indirect branches into direct branches by matching all possible start and endpoints. Figure
4 provides a basic overview of how
\(\mu\)RAI instrumented code looks and operates. Every function
call site is assigned a unique
function key (FK). As is seen in the figure, Function A can have multiple call sites to another Function B.
\(\mu\)RAI instruments code such that before every such call site, the value in the SR register is XOR’ed with the FK for the call site. This value is also called the
Function ID (FID). The call goes through and Function B operates. At the point where Function B returns, it checks what the authors call the
Function Lookup Table (FLT). This table has all the FIDs that could call this function. Based on which FID matches the value in the SR, the function returns to the corresponding location. Finally, the SR is XOR’ed with the same FK used before the branch, returning it to the original value before the function call. The authors tested their approach on an ARM Cortex-M4-based board and report a maximum performance overhead of 8.1% on the CoreMark [
41] (a lighter variant of CoreMark-Pro) benchmark with an average of just 0.1%, making it comparable with shadow stack mechanisms discussed previously. However, it requires on average 34.6% extra flash memory for instrumentation and FLT.
The reader may have noticed that the possible return addresses are encoded into the code memory under DEP restrictions that prevent an attacker from modifying the code memory. DEP is enforced using the MPU. \(\mu\)RAI, therefore, foregoes the return address that the processor may record in its stack, which is inherently writable memory, during a function call. Instead, it implements a function return mechanism that is implemented completely in code memory. This enforces \(\mu\)RAI’s goal of return address integrity. \(\mu\)RAI is also the first mechanism that we have discussed in this survey that explicitly considers interrupts. Since interrupts can occur at any time and can potentially interfere with shadow stack operations, they require explicit consideration. \(\mu\)RAI instruments interrupt handler code to first save the return address that has been automatically stored on the stack by the hardware before the handler code is executed. \(\mu\)RAI saves the return address to a safe memory hidden behind the MPU. Here \(\mu\)RAI has to essentially create a shadow stack due to the limitation of the hardware. Supporting interrupts is a significant step to eventually supporting multi-threaded scheduling under a real-time operating system (RTOS). However, dedicating a register to \(\mu\)RAI operations would require modifications to the compiler as well as incompatibility with embedded systems having a severely limited processing capacity, especially when the software requires a large number of registers for computational purposes.
Unfortunately, none of these techniques improve forward-edge CFI. For example, in the case of \(\mu\)RAI, the attacker could keep redirecting code execution using branch operations without allowing code to execute till an FID table. Therefore, such CFI mechanisms are helpful from only a performance or memory perspective over a regular shadow stack. That is, they do not provide any additional security guarantees, while requiring significant codebase changes or at least a modified compiler to support their operation.
4.4 CFI Using Processor Architecture Extensions
Before we finally move toward real-time aware CFI mechanisms, we will look at two mechanisms that depend on very modern processor architecture extensions such as ARM TrustZone [
70]. TrustZone allows a processor to support two execution domains,
secure and
non-secure, each with its own address space, with the secure domain having supervisory access to the non-secure domain. CFI designers have found creative ways to use it as part of their designs.
The first is Nyman et al.’s CFI CaRE [
67], which presents an alternative approach to secure the shadow stack to that of Silhouette
4.1. An overview of its operation is given in Figure
5. While Silhouette uses binary instrumentation to prevent a privileged attacker from modifying the MPU that hides the shadow stack, CFI CaRE hides the shadow stack behind the TrustZone in the secure domain. CFI CaRE assumes that the original binary is only allowed to execute under the non-secure domain. It replaces all function calls with a
supervisory call(SVC) that launches a special function called the
branch monitor. The branch monitor runs in a privileged context, and based on the parameter passed to the SVC that launches it, the branch monitor is able to identify if the source of the SVC is a branch or a return. It then calls secure domain code, passing the source identifier as a parameter that updates the shadow stack. While the SVC ensures that all branches and returns are effectively trapped in the branch monitor, the TrustZone boundary ensures that non-secure domain code cannot view or modify the shadow stack. The authors used the Dhrystone (precursor to CoreMark) benchmarks to evaluate their work on an ARM Cortex-M23 processor. Performance overhead ranged between 13% and 513% with an overall 14.5% increase in flash memory consumption.
While CFI CaRE may seem like just a different implementation from previous approaches, it proposes a mechanism to address a crucial flaw in previous approaches with respect to embedded systems. The previously discussed approaches instrument binaries with no regard to the original layout. While this may be a non-issue for systems whose source code is available, many real-time embedded systems use proprietary legacy software and access to the source code may be limited. Further, due to memory and processor restrictions, these binaries are painstakingly built with strict adherence to page limits, available flash memory, and so forth. Unchecked binary instrumentation may destroy compatibility with the hardware. CFI CaRE’s usage of SVC simply overwrites the branch or return instructions, keeping the original binary layout intact. However, it does require extra space for the branch monitor.
CFI CaRE also supports interrupts and uses
trampolines, which are short sequences of code at the start of interrupt that call the secure domain to store the return address in a shadow stack. However, it does not support nested interrupts. If an attacker-controlled higher-priority interrupt fires before the trampoline can store the return address in the shadow stack, the attacker-controlled interrupt code could rewrite the return address. When the lower-priority interrupt finally gets to run, its trampoline would store a modified return address. Furthermore, nested interrupts can occur on an RTOS-controlled system. For example, the timer tick could fire alongside interrupts from other peripherals. Kawada et al.’s [
54] TZmCFI fills this gap. They too propose using the TrustZone to hide the shadow stack. However, they also extend the shadow stack concept to what they term as
exception shadow stacks that support nested interrupts. They modify the trampolines such that every trampoline will complete all pending shadow stack transactions of lower-priority interrupts before the interrupt body is allowed to execute. This ensures that if an attacker controls the interrupt body, it cannot affect the shadow stack copy of the interrupt return address. TZmCFI showed a performance overhead of up to 84% when supporting FreeRTOS as compared to FreeRTOS without CFI. For nested interrupts, the instrumented interrupts (with the trampolines) increased interrupt execution time from 30 cycles (un-instrumented) to 132 to 236 cycles, i.e., up to a 550% increase in execution time.
Other work that involves extending the architecture of the processing environment includes Intel’s
Control-Flow Enforcement (CET) [
53] architecture extensions in their recent Tiger Lake [
87] processors. The CET extensions provide hardware support for shadow stacks and forward-edge CFI. Due to their recent introduction in production hardware, there is a lack of prior CFI work that builds upon CET. Further, the Tiger Lake processor family are powerful desktop-grade processors, which are outside the scope of this work, which focuses on embedded systems (see definition in Section
1.2.1). Similar in concept to CET, the authors of HCFI [
21] suggest creating a new CFI-enabled
instruction set architecture (ISA) by modifying an existing ISA such as SparcV8’s Leon3 [
40]. They do so by adding new stages in the CPU pipeline to perform CFI operations such as shadow stack operations and show that performance overhead with respect to an unmodified Leon3 core is less than 1% on their FPGA implementation for the SpecInt2000 benchmarks. While optimum performance can be achieved by extending the processor architecture and/or designing custom processor cores, it remains to be seen if such extensive hardware modifications are feasible for the more resource-constrained processing environments of embedded systems. Until such a time, the TrustZone-based approaches discussed earlier are more realistic.
4.6 Section Summary
The techniques discussed in this section generally follow the basic techniques listed in Section
3. The proposed mechanisms either directly apply those basic techniques or have progressively complex hardware modifications, from special registers to reduce the cost of shadow stacks (Section
4.3) to novel ISA (Section
4.4). However, the techniques do not inherently change the underlying principles of CFI and can be
conventional by their nature. That is, they all verify the source and target destination addresses without much variation. Another important observation is that each of the techniques presented is uniquely tied to the underlying hardware for both performance and enforcement of CFI, making it difficult to compare their individual overheads. However, on a qualitative note, it is clear that the most performant CFI requires radical hardware changes, such as integrating shadow stack operations into the pipeline of the processor [
21].
A common theme in the techniques discussed, however, is the lack of any discussion regarding the implications of the overhead they introduce on systems where timing is critical, e.g., real-time systems. Real-time systems have certain characteristics that could be utilized to aid CFI and/or reduce the impact of the overhead introduced. We will now discuss these characteristics:
(1)
In periodic real-time systems, work is performed in a temporally predictable manner. That is, tasks execute during defined periodic intervals. CFI could utilize this predictable periodic nature to determine if an application is misbehaving due to attacker control.
(2)
The system is usually underutilized due to safety requirements. Since real-time systems are, in many cases, deployed in critical environments such as medical, industrial, or automotive systems, such systems are designed to not perform work all the time to reduce or eliminate the possibility of missing deadlines. For example, the system is usually provisioned with enough computing resources such that tasks do not need to consume 100% of the computing resource at all times to complete by their deadlines. Therefore, the system may have large periods of idle times. CFI could utilize the idle time, thereby reducing localized spikes in computational load and reducing the possibility of missing deadlines. Note that although these systems may be underutilized, they are still considered to be resource-constrained. The underutilization is intentional due to safety concerns and any addition in the computational requirements must be done judiciously.
(3)
The total system utilization at any given point of time is usually well characterized and there exist schedulability tests to determine if the system may be successfully scheduled without missing deadlines under a given scheduling algorithm. These tests may differ for different types of real-time task models (periodic tasks, aperiodic tasks, etc.). None of the techniques discusses their applicability and/or changes that must be introduced to satisfy these schedulability tests.
None of the techniques discussed in Section
4 consider timeliness. We now discuss CFI works that are specific to real-time embedded systems.