skip to main content
10.1145/3704304.3704316acmotherconferencesArticle/Chapter ViewFull TextPublication PagescciotConference Proceedingsconference-collections
research-article
Open access

Transaction Architecture for CXL Memory

Published: 22 January 2025 Publication History

Abstract

With the recent explosive growth in worldwide data processing demands, the need to support a large volume of transactions on shared data is increasing in both edge and datacenter processing. A recent innovation in server architectures is the use of disaggregated memory based on the Compute eXpress Link (CXL) interconnect protocol. Such memory architectures are increasingly popular in datacenters, as they allow dynamic demand-sensitive resizing of aggregated memory and support heterogeneous memory types. However, while this alleviates many concerns in datacenter architectures, ensuring data integrity when using memory based transactions over CXL faces many challenges.
We describe a novel hardware-based scheme for providing ACID (Atomicity, Consistency, Isolation, Durability) transactions in a CXL-based disaggregated memory architecture. The architecture and protocol presented in this paper requires only simple hardware extensions, and is applicable to both persistent memory durable transactions or in-memory volatile transactions.

1 Introduction

Applications in both scientific computing and IoT edge processing require both large amounts of data and fast data processing. To address the later, applications employ concurrent multi-threaded implementations to achieve their throughput and latency goals. Threads within an application routinely share memory variables and often access shared persistent state stored on non-volatile media. Concurrency control mechanisms are necessary to prevent races between the concurrent threads that can lead to data inconsistencies or program aborts. A common approach to handle the complexity of concurrency control is to structure the critical sections as transactions.
Transaction execution is serialized using software-based solutions like Two-Phase Locking (2PL) and Software Transactional Memory (STM), or, in high-performance processors, hardware mechanisms like Hardware Transactional Memory (HTM). The different techniques face a variety of performance tradeoffs and challenges. For instance, 2PL allows fine-grained locking within a transaction, but places constraints on the orders of lock acquisition and release, making it awkward to add new threads dynamically. STM is a software middleware layer to transparently manage concurrency, but results in performance overheads associated with frequent software intervention. Hardware-based transaction memory attempts to avoid the software overheads, but existing HTM implementations have several limitations specially with regard to the size of the transactions that can be supported, as discussed in Section 2.
To address the rapid growth of in-memory application data requirements, a recent innovation in server architectures is the use of disaggregated memory [16] based on the Compute eXpress Link (CXL) interconnect protocol [5]. Memory pooling architectures have been proposed [6, 11, 12, 19] as a cost-effective approach to handle the growing demand for memory capacity and performance. In memory pooling, a pool of memory devices (possibly heterogeneous types of memory) are connected to servers over a CXL link. The setup allows a host to access different types of memory (including non-volatile persistent memory for durability) with a single CXL interface. In addition, multiple servers can connect to and share the pool of memory with different sharing modalities. The pool may be partitioned among the hosts based on their projected memory requirements or may be shared among the servers. Furthermore, the CXL memory controller can provide processing-in-memory capabilities by implementing a variety of useful functions like scatter/gather accesses, compression, encryption, performance QoS etc.
In this paper we propose a novel transaction management framework that addresses many of the limitations of existing concurrency-control techniques by exploiting the CXL memory controller to work synergistically with the host server. The new protocol supports the ACID (Atomicity, Consistency, Isolation, and Durability) requirements of concurrent transactions by extending the atomic memory controller we proposed in [7, 21]. The controller supported Atomicity and Durability for persistent memory transactions on host-connected NVM and DRAM memory systems, while Consistency and Isolation were done in software using 2PL. In this paper we show how management of consistency and isolation can be incorporated into the hardware with only a small overhead, while addressing the major pain points of existing HTM implementations. Furthermore considerable conceptual simplicity and performance benefits are afforded by offloading the backend processing of logs and update of transaction memory to the CXL subsystem. Our extensions leverage the cache coherence protocols in multi-core processors as is done in current hardware for transaction support. However, by adapting the architecture of [7, 21] with CXL memory processing, we address a major issue in existing HTM implementations that limits the size of the transaction to the size of the caches (usually the L1), and avoid performance overheads of synchronous, host-managed memory updates by offloading them to the background CXL subsystem.
The remainder of the paper is organized as following. Section 2 provides a brief overview of ACID transaction semantics and HTM implementation, and summarizes the contributions of the paper. Our solution for hardware transaction management in disaggregated CXL memory is presented in Section 3. Implementation details and operational examples of the algorithm are described in Section 4. Section 5 describes related work in HTM, and a summary of the paper is presented in Section 6.

2 Overview

In this section we provide an overview of ACID transactions and Hardware Transaction Memory, to place our work in context.
Figure 1:
Figure 1: Concurrent threads T1 and T2 on a multicore. With only cache coherence a and b can have any of the four values (0,0), (0,3), (5,0) and (5,3). Using transactions, only the values (0,3) and (5,0) are allowed.

2.1 Transactions

Consider two threads T1, T2 in Figure 1 executing concurrently on different cores. The threads access shared variables x and y with initial values of 0. On completion of execution of T1 and T2 on the two cores, the legal values of variables a and b depend on the concurrency control mechanisms enforced. Almost all multi-core processors support cache coherence, which consistently orders accesses made to a single variable across the cores. However, coherence does not enforce any orderings between accesses of different variables even in a single thread, so the accesses to x and y in thread T1 (and independently in T2) can execute out of order. Hence with only cache coherency semantics, the allowable values of a and b can be any of (0,0), (0,3), (5, 0), or (5, 3). The sequential consistency model requires memory accesses within a thread to be done in program order, but does not limit the interleaving of accesses across threads. Hence, output (0, 0) will not be possible, but the other three are valid interleavings. However few multiprocessors support sequential consistency due to the overheads of enforcing the access orderings.
A common way of organizing concurrent event-driven software is a set of cooperating transactions. A transaction is a program unit that has ACID semantics: Atomicity, Consistency, Isolation and Durability. The durability (D) criterion is needed when transaction state must be preserved following a reboot or power failure. A, C, I ensure that the result of concurrent thread execution matches that of some serial execution of the threads. In the above example, under transactional semantics the output (a, b) would be either (0, 3) (representing the virtual execution order T1 followed by T2) or (5, 0) (virtual execution of T2 followed by T1). Different software protocols like two-phase locking (2PL) and software transaction memory (STM) [13] are used to allow the transactions to execute concurrently while serializing conflicting accesses correctly. However, 2PL requires locking and unlocking variables in a predefined order to avoid deadlock, generally favoring its use in canned transactions in a structured database. STM is a middleware layer that sequences the transactions correctly, but has performance drawbacks caused by frequent intervention and software overheads.

2.2 Hardware Transaction Memory (HTM)

HTM is a mechanism available in high-performance processors from Intel [14], ARM [1], and IBM [18], that provides support for in-memory transactions in hardware; i.e. code sections identified as transactions to the hardware execution unit, can execute concurrently on different cores while satisfying the ACI semantics. The start and finish of a transactional code section is marked using special instructions like begin_HTM and end_HTM respectively. All memory accesses within the transactional section are monitored by the hardware. Concurrent accesses to the same variable are detected by the hardware using the cache coherence mechanism, and, if necessary, one of the conflicting transactions is aborted.
We describe HTM using the example of Intel’s Transactional Synchronization Extensions (TSX) to the x86 Instruction Set Architecture (ISA)  [14], specifically using Restricted Transactional Memory (RTM). The L1 cache coherence mechanism is used to detect read/write conflicts. Variables within a transactional section of code are pinned to the L1 cache for the duration of the transaction1. A write by a transaction on core A to a variable pinned by a transaction in core B, or a read by A to a pinned variable written by B, will cause one of the transactions to abort. The major limitation to the HTM implementation is the atomicity requirement that causes the size of the transaction to be limited by the small L1 cache size on a core. Hence when there is a capacity overflow or contention within a cache set, the transaction is forced to abort.

2.3 Relation to Previous Work

Three of the most relevant previous approaches to handle large sized ACID Transactions within an HTM are discussed below [15, 17, 23]. A software approach for ACI transactions to overcome the L1 cache size limitation was proposed in [23], by logging the current values of variables updated within a transaction. This allows values to spill from the L1 all the way to memory, since the original value of an aborted transaction can be restored from the undo log. The proposal in [17] includes durability using persistent memory, but adds considerable complexity to the implementation. It proposes a modification of the L1 cache controller to create redo log values in hardware. On a transaction commit, it forces the synchronous writeback of all entries modified by the transaction to persistent memory, in contrast to RTM which simply flash clears the transaction status bits without copying them out of cache. Furthermore, to handle spills out of the L1 cache, it proposes the use of special overflow lists to hold the spilled values, which are then copied back to persistent memory on a transaction commit. The work in [15] addresses the situation where a transaction may access both persistent variables in NVM and non-persistent transactional variables in DRAM, and proposes a hybrid combination of undo and redo logging towards this end. In contrast, this paper is based on extending our previously proposed concept of a victim cache (that supported atomicity in persistent memory) to support complete ACI or ACID transactions, while exploiting the advantages of remote CXL memory to perform non-synchronous updates to transactional variables by backend log processing at the CXL controller.

2.4 Contributions

The solution proposed here is an extension of our earlier victim cache design described in [7, 21]. As mentioned previously, that design provided atomicity and durability for transactions in a persistent memory system. It assumed that consistency and isolation were enforced using a fine-grained locking mechanism like 2PL. This work extends the solution in the following ways:
It extends hardware transactional support to provide Consistency and Isolation for unbounded transaction sizes, while maintaining Atomicity and Durability using a modified victim cache.
It has been designed to work on modern CXL-based memory architectures rather than monolithic servers. This allows independent processing of front-end (cache) operations by the CPU, and back-end (memory) operations by the CXL controller. The decoupling provides conceptual clarity to the design in addition to performance gains of independent, concurrent processing between front and back ends. Furthermore, this framework provides a natural extension to handle transactions distributed across multiple host servers.
It retains the desirable properties of existing HTM implementations, and avoids costly extensions that require changes to the front-end L1 controller, or long copy operations in the critical path. The extensions use only simple hardware resources and operations.
The design is suitable for both volatile memory transactions (ACI semantics) as well as durable memory transactions (full ACID semantics), by using DRAM or persistent memory at the backend CXL-managed subsystem.

3 Transaction Management System

In this section we present the overall design of our solution CXL-HTM. An overview of the system architecture is presented in Figure 2. It consists of multiple cores with cache-coherent private L1 caches. The L1 caches are backed by both the normal (L2, L3,..., LLC) cache hierarchy (not shown), as well a Transaction Victim Cache (TVC) for transactional cache lines. When a cache line that has been accessed in a transaction is evicted from an L1 cache it is directed to the TVC. Non-transactional cache lines use the normal cache hierarchy and do not interfere with the TVC. To avoid ambiguity, our model assumes that threads consistently access a shared variable either within or outside a transaction, but not both.
We describe the design using a directory-based MSI (Modified-Shared-Invalid) protocol. As in the standard (non-transactional) implementation of MSI coherence, each block in an L1 cache is in one of three states: Invalid, Shared or Modified, signifying, respectively, that the block is not available, is a clean read-only copy, or a dirty writable copy. In addition, a transaction flag T marks the block as a transactional cache line; the flag is set if the block was brought into the cache by a transactional memory access and is cleared when the transaction commits or aborts.
The TVC holds transactional cache lines evicted from an L1 cache. If the eviction is from an active transaction due to a capacity or conflict miss, the cache line data is saved in the TVC in a transactional state (either S or M state). Dirty cache lines are also written back from L1 to the TVC when a transaction commits, and are held in the committed (C) state. The metadata associated with an evicted cache line is copied from the directory to the TVC and deleted from the directory. Subsequent directory lookups of the cache line are serviced by the TVC.
Like the victim cache in [7, 21], the TVC does not write back its entries to CXL memory. This ensures a transaction’s atomicity by preventing any of a transaction’s writes from updating memory before the transaction is committed. Instead, the CXL controller copies the values from the transaction log to the CXL memory in the background after the transaction commits. Once the memory has been updated, the cached values in the TVC can be deleted and the space freed up.
Store instructions within a transaction perform two write operations: the normal cached store, as well as a non-cached write of a log record to a write-back log (indicated by the dotted lines) associated with the transaction. Details of the implementation are available in [7, 9]. When a transaction commits, its log is timestamped and closed. The log is transferred to the CXL controller to update the modified words in memory (or, if durable transactions are being supported, to persistent memory).
Figure 2:
Figure 2: System Architecture
We next describe the modifications to the normal directory MSI protocol to handle transactions. We then discuss the implementation of transactional MSI to show how transaction aborts and commits are handled in the L1 caches, the TVC, and in CXL memory.

3.1 Transactional Directory MSI Protocol

The actions of the directory controller depends on whether a request is made from within a transaction (called transaction mode events) or from outside a transaction (called normal mode events). The controller maintains a bit vector inTransaction that indicates whether or not a core is executing a transaction. It is set when a transaction executes tx_begin and is cleared when it executes tx_end or aborts. The core’s L1 cache handler sends one of three coherence requests, ld_miss, sd_miss or sd_hit to the directory controller when it needs a new cache line or an upgrade to an existing one. It then enters an intermediate state waiting for the controller’s response (see Figure 3).
Normal Mode Events: We first review the steps done by the controller when the request is from outside a transaction. These are the standard operations in an MSI protocol. On a ld_miss or sd_miss, the controller checks whether the requested block is in the M state in another cache (the owner). If there is no owner, the requested block is retrieved from memory (via the normal L2 hierarchy) and returned to the requesting core along with an ACK signal. In addition, on a sd_miss, the controller must also send invalidate commands to all cores that hold the block in the S state; these cores will then downgrade the state of their cached copy to I. If there is an owner, the controller retrieves the cache line from it, and the owner downgrades its L1 state to S (on a load_miss) or to I (on a store_miss). The cache line is written to memory and forwarded to the requester along with an ACK. The implementation should handle potential races between the controller’s request for a cache block from its owner that overlaps with its eviction. Finally, a core makes a sd_hit coherence request to obtain write permissions for a block it has cached in S state. The controller must send an invalidate command to all other cores holding the block in the S state, and then an ACK to the requesting core.
Figure 3:
Figure 3: Transactional MSI Protocol
Transaction Mode Events: When a core makes a coherence request from within a transaction, the controller must check for a conflict with the accesses of other concurrent transactions. The conflict will be detected by analyzing the metadata associated with the cache line that is stored in either the directory controller or the TVC. A sd_miss or sd_hit (i.e. a write) for a cache line κ causes a conflict if κ is held in the S or the M state by another transaction, while a ld_miss causes a conflict if κ is owned by another transaction (i.e. is in the M state). A conflict will trigger the abort of the requesting transaction and a Tx_ABORT signal will be returned to the core. On an abort, all transaction cache lines that have been accessed by the transaction along with their associated metadata must be deleted from the L1 cache, the directory, and the TVC (see Section 3.3). If there is no conflict, a miss causes the cache line to be fetched from TVC or memory and returned to the requester along with an ACK, while a sd_hit only needs to return a ACK. The metadata in the directory or TVC is updated accordingly.

3.2 Transaction Manager

The transaction manager consists of two related components: a directory and the transaction victim cache (TVC). The former tracks the state of items in the L1 caches for coherence and transaction conflict detection. The TVC acts as the overflow storage for transaction cache lines that are evicted from the L1 caches, and as a buffer for dirty cache lines of committed transactions waiting for their updates to be reflected in CXL memory. The discussion below deals with transaction mode events.
Figure 4:
Figure 4: Format of major components: (a) L1 cache line (b) Directory entry (c) TVC entry
Directory Controller
The format of an L1 cache line is shown in Figure 4(a): the metadata consists of a tag field, the state bits S, M, I and a transaction bit T. Figure 4(b) shows the format of a directory entry: the tag identifies the memory address being tracked; state bits (M S, I) indicate the state of the cache line in the L1 cache(s); and bit vector Π (presence vector) indicates which (if any) of the L1 caches holds a copy of the cache line. A directory entry in state M indicates that the corresponding cache line is dirty in its owner’s cache identified in Π. The S state indicates a clean cache line that may be present in one or more of the L1 caches indicated in Π. If a requested memory address is not found in the directory, the search continues in the TVC. The L1 caches are direct mapped, and the directory is a set associative structure with one way per core and way-size equal to an L1 cache.
When a dirty cache line (state M) is evicted from the cache the directory entry is copied to the TVC along with a copy of the evicted data, and the directory entry deleted. Subsequent accesses to the directory entry are served from the TVC, until it is brought back to the directory. The same steps are followed when a cache line in the S state is evicted provided there is a directory entry for that cache line. If there is no directory entry, then the TVC is already tracking that cache line, and the eviction can occur silently in the L1 cache.
Figure 5:
Figure 5: States of a block in the TVC. The figure shows the state diagram for up to three outstanding versions.
Transaction Victim Cache
Figure 4(c) shows the format of a TVC entry. The metadata fields are similar to a directory entry with some changes: The TAG field identifies the item; the STATE field has an additional value C (committed), in addition to the I, S, M states of a directory entry. The C state indicates that it holds the value of the last committed transaction that modified the cache line. The bit vector Γ (transaction id) is similar to the presence vector Π and identifies the ongoing transactions that have accessed the block. Finally, a version field VER tracks the number of committed transactions that have written this cache line in the TVC, but have not yet been updated in CXL by the log writes. The field is incremented when a transaction evicts a dirty cache line, and decremented when the CXL location is updated. Figure 5 shows the state transition diagram for a TVC block.
The Transaction Victim Cache (TVC) holds transactional cache lines that have been evicted from their L1 caches. The evictions may have been caused by cache conflicts during transaction execution, or may have been explicitly written back when a transaction committed. The conflict-triggered evictions of dirty and clean cache lines are shown in Figure 5 as eD and eC respectively, while commit-induced writeback of dirty lines is denoted by wb.
Coherence Requests to the TVC: The TVC receives transaction coherence requests that have been forwarded by the directory controller. If there is no valid copy in the TVC (state I), the cache line is fetched from CXL memory and returned to the directory controller, without saving it in the TVC. If the cache line is in state C the data is returned without changing the TVC state or metadata. In both cases, the directory creates and adds an entry for the cache line and forwards it to the requester.
Transaction conflicts are handled as described earlier for Transaction Mode Events. A conflict will return a Tx_ABORT signal to the core. If there is no conflict, the TVC data is returned to the requester and an entry added to the directory. If the TVC state is either M or S, the entry is deleted or reverted to the previous committed state depending on its version (transition m).
Evictions to the TVC: The TVC also receives transaction cache lines evicted by a core due to conflicts or on a commit. The evicted block must be saved in the TVC and the metadata fields updated, as described below.
State I (No copy of the evicted cache line in the TVC): A fresh block is allocated in the TVC and the TAG and DATA fields updated with those for the evicted cache line; the TID and STATE fields are copied from the corresponding directory entry, and the directory entry is deleted. The version number VER is set to 0 or 1 depending on whether a clean or dirty cache line was evicted. If the writeback was due to a commit, the state is set to C. Note that only dirty cache lines are written back to the TVC on a commit.
State M (TVC holds a dirty copy of the cache line): This situation cannot occur. To see why, note that such an eviction could only be from the same transaction that initially evicted the cache line to the TVC. This would have required the transaction to have re-accessed the evicted cache line from the TVC, at which point it would have been deleted.
State S (TVC holds a clean copy of the cache line): This situation also cannot occur. A conflict-induced eviction of such a block would perform a silent eviction without a need to inform the small TVC. Also, on commit, only dirty blocks are written to the TVC.
State C (TVC holds value of the last committed transaction):
For a commit-induced writeback (wb), the DATA and TID fields are updated with that of the committing transaction. The version number VER is incremented by 1.
For a conflict-triggered eviction (eD or eC), the DATA and TID fields are updated as done in the I state. However, the cache line currently in the TVC must first be saved in an overflow area in case the overwriting transaction aborts before committing, requiring the TVC entry to be restored2. If the evicted line is dirty (eD) the version number VER is incremented by 1.
Overflow Handling in the TVC: The TVC does not evict any blocks to CXL memory. Instead CXL is updated independently by log writes when a transaction commits. This is done to ensure atomicity of the transaction; otherwise the CXL memory may be left in a partially updated state if the transaction aborts [7, 21].
A block in the TVC is invalidated (i.e.. freed) when the log of the last committed transaction that wrote the cache line has been copied to CXL memory. Therefore long-running transactions could exhaust the TVC with evicted cache lines. To handle this situation, the TVC is extended into the host DRAM memory when needed, so that a transaction does not need to abort even if the TVC overflows. The DRAM cache is managed by software. The performance overheads of frequent software intervention can be mitigated by appropriately sizing the TVC and migrating blocks of both committed and long-running transactions to DRAM, thereby allowing most small and normal-sized transactions to operate at cache speed.
Version Control in the TVC: Multiple versions of a cache line in TVC are created when successive committed transactions write back the same cache line before the earlier versions have been deleted by CXL updates. The TVC only saves the data of the last committed transaction, but tracks the number of pending CXL updates in the version metadata field. When a pending update to CXL completes (transition CXL), the version field VER is decremented. If the block was in state C when VER becomes 0 the block is deleted by setting its STATE to I.

3.3 Handling Transaction Aborts

When a transaction aborts, all traces of its execution must be removed from its L1 cache, the directory, and the TVC. The L1 entries can be efficiently deleted by invalidating all the blocks in the aborting core’s L1 cache that have their T bit set, and then clearing their T bits. Directory entries referencing the aborting core should delete the core from their Π vector and free up unused entries. The inTransaction bit for the transaction is cleared. TVC blocks referencing the aborted transaction (in Γ) may also need to be invalidated or rolled back. For a TVC block in state M, the transaction is removed from Γ and the version number decremented. If VER now equals 0, the block should be invalidated by setting its state to I. If the block is in STATE S, the transaction is removed from Γ and then if the aborted transaction is the only one present in Γ, the state is set to I or C depending on the version number (transition a*). Reverting to the C state may require restoring the temporarily saved state from the overflow area.
Independent of the front-end operations, the log record for that transaction is marked as aborted. When the backend CXL controller encounters the log corresponding to an aborted transaction, the controller simply discards the log. This prevents the partial updates of the aborted transaction from being written to CXL memory.

3.4 Handling Transaction Commits

When a transaction successfully completes, it must commit the transaction. This involves updating the state of the cache lines in L1, the directory, and TVC. Dirty cache lines in the committing L1 that have their T bit set are written back to the TVC and all L1 cache entries with T asserted are invalidated, and the T bits cleared. The transaction is removed from Π fields of directory entries and the directory state set to I if there are no other entries in Π. The inTransaction bit for the transaction is cleared. The TVC state of an entry that is updated by a writeback is set to C and the version number incremented. A TVC entry in the M state that is owned by the committing transaction changes its state to C. Finally, if the committing transaction belongs to a TVC entry in the S state, the transaction is removed from Γ; if Γ is empty the state is set to C or I.
Independently, a commit record is written to the log along with a transaction timestamp that orders committed transaction in the order of commits. The values in the committed logs are asynchronously copied back to the CXL memory in timestamp order, and deleted from the TVC.

4 Implementation and Operational Examples

We summarize the main implementation features to support the proposed protocol. Cache lines in the L1 cache have an additional status bit T bit that indicates its access from within a transaction. It identifies L1 cache blocks that must be invalidated on a transaction abort or flushed to the TVC on a transaction commit. The bit also directs evicted cache lines to either the TVC or the normal L2.
The directory controller entry holds the following metadata for a cache line in an L1 cache. A presence bit vector (Π) indicates whether or not the core holds a copy of the cache line. A STATE field indicates whether the cache line is clean in one or more of the L1 caches (state S), is dirty in one L1 cache (state M), or not present in any of the L1 caches (state I). In addition, a bit vector inTransaction indicates whether or not a core is executing a transaction.
The TVC is implemented as a high-associativity cache. It holds transactional cache lines that have been evicted from an L1. After a transaction commits, these continue to be held in the TVC till the cache lines is updated in CXL memory from the log. To handle overflow of a cache set, the TVC is extended into host memory as needed. An overflow bit associated with a set is used to indicate that the associativity for that set has been increased to include storage in host DRAM memory. During TVC lookup, the memory overflow area needs to be searched (nominally in software) as directed by the overflow bit for that set. Such TVC overflows can cause performance slowdowns if frequently accessed data overflows to DRAM, but does not require transaction aborts. Optimizing the TVC size and associativity based on workload characteristics is an important consideration that is beyond the scope of this paper. The bit vector Γ identifies the transactions that have accessed the transaction. The other meta data fields have been described earlier.

4.1 Example Scenarios

We illustrate the implementation by tracing the possible orderings of the transactions of Figure 1, and showing the step by step state changes. In the following the two statements of T1 are denoted by A and B, and those of T2 by C and D. The examples assume that initially the L1 caches and the victim cache are empty.
Figure 6:
Figure 6: Transactional MSI Protocol for Scenario 1.
Scenario 1 (T1 and T2 are non-conflicting): We assume that the instruction execution ordering is A, B, followed by C, D. Figure 6 shows a trace of the execution. On executing A, core 1 sends a sd_miss(x) request to the controller, which creates a directory entry for x with state M and presence vector Π = [\(1 0\)]. The cache line is read from memory into core 1’s L1 cache in the M, state and T is set to 1. On executing B, core 1 sends a ld_miss(y) to the controller that reads the cache line from memory into core 1’s L1 cache in the S state and sets T. A directory entry for y is created in state S with Π = [\(1 0\)].
T1 then closes and commits. The modified cache lines (x and a) are copied to the TVC, and all entries in core 1’s L1 cache with the T bit set are flash invalidated. The directory entries for x, y and a are also invalidated. The evicted dirty cache lines x and a are added to the TVC in the committed state (C) and their version numbers are set to 1. The execution of T2 mirrors that of T1, except that the ld_miss(x) of instruction D reads the cache line from the TVC with value of x equal to 3.
Figure 7:
Figure 7: Transactional MSI Protocol for Scenario 2.
Scenario 2 (T1 and T2 executions are interleaved): We assume that the instruction execution ordering is A, C, D, B. Figure 7 shows a trace of the execution. As in scenario 1, the access by A brings x into core 1’s L1 cache and updates the directory entry. Similarly, access by C brings y into core 2’s L1 and adds the entry for y into the directory.
When T2 executes D, it sends ld_miss(x) to the controller. Since the directory entry for x has state M and its Π is [\(1 0\)], the controller recognizes the conflict for x, and returns a Tx_ABORT to core 2. T2 must abort and invalidate y in its L1 cache and clear its T bit. The directory entry for y must also be invalidated.
T1 then executes B and sends ld_miss(y) to the controller. Since y is not present in the caches, the controller reads y (value 0) from memory, changes the directory state of y to S and updates Π to [\(1 0\)]. Core 1 sets the L1 cache state of y to S and the T bit to 1. T1 then commits, and the transactional cache lines in core 1’s L1 cache are flash invalidated, and the dirty lines copied to the TVC in state C. The controller invalidates the directory entries for x, a and y.
Figure 8:
Figure 8: Transactional MSI Protocol for Scenario 3
Scenario 3 (Handling cache evictions during interleaved execution): We assume that the instruction execution ordering is A, C, E B, D where E is a new instruction that forces the eviction of y from core 2’s L1 cache. Figure 8 shows a trace of the execution. Suppose that after A and C execute, T2 performs another instruction (e.g. E: z = 7), that causes y (value 5) to be evicted from core 2’s L1 cache. Since y has its T bit set, the controller forwards it to the TVC, where it is stored in state M with version number 1.
When T1 executes B, the controller forwards the ld_miss(y) request to the TVC, which detects the conflict with the ongoing transaction on core 2. It therefore returns Tx_ABORT to core 1. When T2 commits the state of y in the TVC is changed from M to C. If there had been another access to y (e.g. ld_miss) by T2 before it committed, it would be copied to T2’s L1 cache, an entry added to the directory, and deleted from the TVC.

5 Related Work

The capacity of HTM transactions are increased by introducing a software layer for version implementation using Snapshot Isolation  [8]. LogTM-SE  [23] proposes decoupling HTM from caches using an undo log and signatures, allowing for an in-place update to memory and unbounded nesting, context switching, and other migrations. However, these works do not address persistence and durability of transactions onto a non-volatile media such as persistent memory.
Some work  [4, 9, 10] utilizes un-modified HTM for concurrency control decoupled from persistence to HTM. cc-HTM  [9] introduces the concept of adjustable lag whereby users can allow transaction execution to continue in fast cache with selectable PM durability guarantees on the back-end. However, it requires aliasing all read and write accesses while concurrently maintaining log ordering and and replaying logs for retirement. NV-HTM  [4] removes the need for aliasing in cc-HTM, but is limited to one pending durability transaction per-thread and must wait for prior transactions to complete before making forward progress. Hardware Transactional Persistent Memory, or HTPM  [10], utilizes HTM for concurrency control and isolation, with a back-end memory controller based on  [7, 21]. HTPM requires no changes to current HTM semantics or additions to the cache or cache coherence policies and is implemented in the back-end memory controller or can be pushed to DIMMs. However, HTPM is bound by current HTM limitations.
Other work [2, 3, 17, 20, 22] requires making significant changes to the existing HTM semantics and implementations. For instance, PHTM [3] and PHyTM [2], propose a new instruction called TransparentFlush which can be used to flush a cache line from within a transaction to persistent memory without causing any transaction to abort. They also propose a change to the xend instruction that ends an atomic HTM region, so that it atomically updates a bit in persistent memory as part of its execution. Similarly, for DUDETM [20] to use HTM, it requires that designated memory variables within a transaction be allowed to be updated globally and concurrently without causing an abort. Durable HTM (DHTM)  [17] and Unbounded HTM (UTHM)  [15] are discussed previously in Section  2.3. To log within a transaction, PTM [22] proposes changes to processor caches while adding an on-chip scoreboard and global transaction id register to couple HTM with PM.

6 Conclusion

Modern cloud and edge servers need to handle large numbers of concurrent applications on shared data sets. To support these applications, disaggregated memory servers using CXL are becoming increasingly popular. However, standard HTM implementations have several drawbacks in enforcing data integrity amongst CXL memory transactions.
In this paper, we described a hardware-based scheme for providing ACID transactions in a CXL-based memory architecture. In our work, we separate the concerns between the CXL-controller based memory domain versus the processor caching domain, allowing for novel approaches to provide transaction support. The protocol presented in this paper leverages our earlier work on atomicity and durability in persistent memory [7, 21], but adds support for full ACID transaction semantics. The scheme is applicable to both persistent memory durable transactions or in-memory volatile transactions.

Footnotes

1
Non modified variables can be evicted from the L1 cache, but leave a trace of their earlier presence using status bits
2
Note that at most one version of a committed block needs to be ever saved in the overflow area.

References

[2]
Hillel Avni and Trevor Brown. 2016. PHyTM: Persistent Hybrid Transactional Memory. Proceedings of the VLDB Endowment 10, 4 (2016), 409–420.
[3]
Hillel Avni, Eliezer Levy, and Avi Mendelson. 2015. Hardware Transactions in Nonvolatile Memory. In Proceedings of the 29th International Symposium on Distributed Computing - Volume 9363 (Tokyo, Japan) (DISC 2015). Springer-Verlag New York, Inc., New York, NY, USA, 617–630.
[4]
Daniel Castro, Paolo Romano, and João Barreto. 2018. Hardware Transactional Memory Meets Memory Persistency. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 368–377.
[5]
CXL Consortium. [n. d.]. CXL Consortium, Compute Express Link 3.0 Specification. https://computeexpresslink.org/cxl-specification/
[6]
Debendra Das Sharma, Robert Blankenship, and Daniel Berger. 2024. An Introduction to the Compute Express Link (CXL) Interconnect. ACM Comput. Surv. 56, 11, Article 290 (jul 2024), 37 pages.
[7]
Kshitij A. Doshi, Ellis R. Giles, and Peter J. Varman. 2016. Atomic Persistence for SCM with a Non-intrusive Backend Controller. In The 22nd International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 77–89.
[8]
Ricardo Filipe, Shady Issa, Paolo Romano, and João Barreto. 2019. Stretching the capacity of hardware transactional memory in IBM POWER architectures. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming(PPoPP ’19). ACM.
[9]
Ellis Giles, Kshitij Doshi, and Peter Varman. 2017. Continuous Checkpointing of HTM Transactions in NVM. In Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory Management (Barcelona, Spain) (ISMM 2017). ACM, New York, NY, USA, 70–81.
[10]
Ellis Giles, Kshitij Doshi, and Peter Varman. 2018. Hardware transactional persistent memory. In Proceedings of the International Symposium on Memory Systems. 190–205.
[11]
Donghyun Gouk, Miryeong Kwon, Hanyeoreum Bae, Sangwon Lee, and Myoungsoo Jung. 2023. Memory pooling with cxl. IEEE Micro 43, 2 (2023), 48–57.
[12]
Donghyun Gouk, Sangwon Lee, Miryeong Kwon, and Myoungsoo Jung. 2022. Direct Access, High-Performance Memory Disaggregation with DirectCXL. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). USENIX Association, Carlsbad, CA, 287–294. https://www.usenix.org/conference/atc22/presentation/gouk
[13]
Maurice Herlihy, Victor Luchangco, Mark Moir, and William N Scherer III. 2003. Software transactional memory for dynamic-sized data structures. In Proceedings of the twenty-second annual symposium on Principles of distributed computing. ACM, 92–101.
[14]
Intel Corporation. 2012. Intel Transactional Synchronization Extensions. In Intel Architecture Instruction Set Extensions Programming Reference. Chapter 8. http://software.intel.com/.
[15]
Jungi Jeong, Jaewan Hong, Seungryoul Maeng, Changhee Jung, and Youngjin Kwon. 2020. Unbounded hardware transactional memory for a hybrid DRAM/NVM memory system. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 525–538.
[16]
Changyeon Jo, Hyunik Kim, Hexiang Geng, and Bernhard Egger. 2020. RackMem: A Tailored Caching Layer for Rack Scale Computing. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (Virtual Event, GA, USA) (PACT ’20). Association for Computing Machinery, New York, NY, USA, 467–480.
[17]
Arpit Joshi, Vijay Nagarajan, Marcelo Cintra, and Stratis Viglas. 2018. Dhtm: Durable hardware transactional memory. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 452–465.
[18]
H. Q. Le, G. L. Guthrie, D. E. Williams, M. M. Michael, B. G. Frey, W. J. Starke, C. May, R. Odaira, and T. Nakaike. 2015. Transactional memory support in the IBM POWER8 processor. IBM Journal of Research and Development 59, 1 (2015), 8:1–8:14.
[19]
Huaicheng Li, Daniel S Berger, Lisa Hsu, Daniel Ernst, Pantea Zardoshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, et al. 2023. Pond: Cxl-based memory pooling systems for cloud platforms. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 574–587.
[20]
Mengxing Liu, Mingxing Zhang, Kang Chen, Xuehai Qian, Yongwei Wu, Weimin Zheng, and Jinglei Ren. 2017. DudeTM: Building Durable Transactions with Decoupling for Persistent Memory. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (Xi’an, China) (ASPLOS ’17). ACM, New York, NY, USA, 329–343.
[21]
Libei Pu, Kshitij A. Doshi, Ellis R. Giles, and Peter J. Varman. 2016. Non-Intrusive Persistence with a Backend NVM Controller. IEEE Computer Architecture Letters 15, 1 (Jan 2016), 29–32.
[22]
Z. Wang, H. Yi, R. Liu, M. Dong, and H. Chen. 2015. Persistent Transactional Memory. IEEE Computer Architecture Letters 14, 1 (Jan 2015), 58–61.
[23]
Luke Yen, Jayaram Bobba, Michael R. Marty, Kevin E. Moore, Haris Volos, Mark D. Hill, Michael M. Swift, and David A. Wood. 2007. LogTM-SE: Decoupling Hardware Transactional Memory from Caches. In 2007 IEEE 13th International Symposium on High Performance Computer Architecture. 261–272.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
CCIOT '24: Proceedings of the 2024 9th International Conference on Cloud Computing and Internet of Things
November 2024
150 pages
ISBN:9798400717161
DOI:10.1145/3704304
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 January 2025

Check for updates

Author Tags

  1. Datacenter
  2. Disaggregated Memory
  3. CXL
  4. Transactions
  5. Hardware Transactional Memory
  6. Persistent Memory
  7. ACID

Qualifiers

  • Research-article

Conference

CCIOT 2024

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 27
    Total Downloads
  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)27
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media