1 Introduction
Applications in both scientific computing and IoT edge processing require both large amounts of data and fast data processing. To address the later, applications employ concurrent multi-threaded implementations to achieve their throughput and latency goals. Threads within an application routinely share memory variables and often access shared persistent state stored on non-volatile media. Concurrency control mechanisms are necessary to prevent races between the concurrent threads that can lead to data inconsistencies or program aborts. A common approach to handle the complexity of concurrency control is to structure the critical sections as transactions.
Transaction execution is serialized using software-based solutions like Two-Phase Locking (2PL) and Software Transactional Memory (STM), or, in high-performance processors, hardware mechanisms like Hardware Transactional Memory (HTM). The different techniques face a variety of performance tradeoffs and challenges. For instance, 2PL allows fine-grained locking within a transaction, but places constraints on the orders of lock acquisition and release, making it awkward to add new threads dynamically. STM is a software middleware layer to transparently manage concurrency, but results in performance overheads associated with frequent software intervention. Hardware-based transaction memory attempts to avoid the software overheads, but existing HTM implementations have several limitations specially with regard to the size of the transactions that can be supported, as discussed in Section
2.
To address the rapid growth of in-memory application data requirements, a recent innovation in server architectures is the use of disaggregated memory [
16] based on the Compute eXpress Link (CXL) interconnect protocol [
5]. Memory pooling architectures have been proposed [
6,
11,
12,
19] as a cost-effective approach to handle the growing demand for memory capacity and performance. In memory pooling, a pool of memory devices (possibly heterogeneous types of memory) are connected to servers over a CXL link. The setup allows a host to access different types of memory (including non-volatile persistent memory for durability) with a single CXL interface. In addition, multiple servers can connect to and share the pool of memory with different sharing modalities. The pool may be partitioned among the hosts based on their projected memory requirements or may be shared among the servers. Furthermore, the CXL memory controller can provide processing-in-memory capabilities by implementing a variety of useful functions like scatter/gather accesses, compression, encryption, performance QoS etc.
In this paper we propose a novel transaction management framework that addresses many of the limitations of existing concurrency-control techniques by exploiting the CXL memory controller to work synergistically with the host server. The new protocol supports the ACID (Atomicity, Consistency, Isolation, and Durability) requirements of concurrent transactions by extending the atomic memory controller we proposed in [
7,
21]. The controller supported Atomicity and Durability for persistent memory transactions on host-connected NVM and DRAM memory systems, while Consistency and Isolation were done in software using 2PL. In this paper we show how management of consistency and isolation can be incorporated into the hardware with only a small overhead, while addressing the major pain points of existing HTM implementations. Furthermore considerable conceptual simplicity and performance benefits are afforded by offloading the backend processing of logs and update of transaction memory to the CXL subsystem. Our extensions leverage the cache coherence protocols in multi-core processors as is done in current hardware for transaction support. However, by adapting the architecture of [
7,
21] with CXL memory processing, we address a major issue in existing HTM implementations that limits the size of the transaction to the size of the caches (usually the L1), and avoid performance overheads of synchronous, host-managed memory updates by offloading them to the background CXL subsystem.
The remainder of the paper is organized as following. Section
2 provides a brief overview of ACID transaction semantics and HTM implementation, and summarizes the contributions of the paper. Our solution for hardware transaction management in disaggregated CXL memory is presented in Section
3. Implementation details and operational examples of the algorithm are described in Section
4. Section
5 describes related work in HTM, and a summary of the paper is presented in Section
6.
3 Transaction Management System
In this section we present the overall design of our solution CXL-HTM. An overview of the system architecture is presented in Figure
2. It consists of multiple cores with cache-coherent private L1 caches. The L1 caches are backed by both the normal (L2, L3,..., LLC) cache hierarchy (not shown), as well a
Transaction Victim Cache (TVC) for transactional cache lines. When a cache line that has been accessed in a transaction is evicted from an L1 cache it is directed to the TVC. Non-transactional cache lines use the normal cache hierarchy and do not interfere with the TVC. To avoid ambiguity, our model assumes that threads consistently access a shared variable either within or outside a transaction, but not both.
We describe the design using a directory-based MSI (Modified-Shared-Invalid) protocol. As in the standard (non-transactional) implementation of MSI coherence, each block in an L1 cache is in one of three states: Invalid, Shared or Modified, signifying, respectively, that the block is not available, is a clean read-only copy, or a dirty writable copy. In addition, a transaction flag T marks the block as a transactional cache line; the flag is set if the block was brought into the cache by a transactional memory access and is cleared when the transaction commits or aborts.
The TVC holds transactional cache lines evicted from an L1 cache. If the eviction is from an active transaction due to a capacity or conflict miss, the cache line data is saved in the TVC in a transactional state (either S or M state). Dirty cache lines are also written back from L1 to the TVC when a transaction commits, and are held in the committed (C) state. The metadata associated with an evicted cache line is copied from the directory to the TVC and deleted from the directory. Subsequent directory lookups of the cache line are serviced by the TVC.
Like the victim cache in [
7,
21], the TVC does
not write back its entries to CXL memory. This ensures a transaction’s atomicity by preventing any of a transaction’s writes from updating memory before the transaction is committed. Instead, the CXL controller copies the values from the transaction log to the CXL memory in the background after the transaction commits. Once the memory has been updated, the cached values in the TVC can be deleted and the space freed up.
Store instructions within a transaction perform two write operations: the normal cached store, as well as a non-cached write of a log record to a write-back log (indicated by the dotted lines) associated with the transaction. Details of the implementation are available in [
7,
9]. When a transaction commits, its log is timestamped and closed. The log is transferred to the CXL controller to update the modified words in memory (or, if durable transactions are being supported, to persistent memory).
We next describe the modifications to the normal directory MSI protocol to handle transactions. We then discuss the implementation of transactional MSI to show how transaction aborts and commits are handled in the L1 caches, the TVC, and in CXL memory.
3.1 Transactional Directory MSI Protocol
The actions of the directory controller depends on whether a request is made from within a transaction (called
transaction mode events) or from outside a transaction (called
normal mode events). The controller maintains a bit vector
inTransaction that indicates whether or not a core is executing a transaction. It is set when a transaction executes
tx_begin and is cleared when it executes
tx_end or aborts. The core’s L1 cache handler sends one of three coherence requests,
ld_miss,
sd_miss or
sd_hit to the directory controller when it needs a new cache line or an upgrade to an existing one. It then enters an intermediate state waiting for the controller’s response (see Figure
3).
Normal Mode Events: We first review the steps done by the controller when the request is from outside a transaction. These are the standard operations in an MSI protocol. On a ld_miss or sd_miss, the controller checks whether the requested block is in the M state in another cache (the owner). If there is no owner, the requested block is retrieved from memory (via the normal L2 hierarchy) and returned to the requesting core along with an ACK signal. In addition, on a sd_miss, the controller must also send invalidate commands to all cores that hold the block in the S state; these cores will then downgrade the state of their cached copy to I. If there is an owner, the controller retrieves the cache line from it, and the owner downgrades its L1 state to S (on a load_miss) or to I (on a store_miss). The cache line is written to memory and forwarded to the requester along with an ACK. The implementation should handle potential races between the controller’s request for a cache block from its owner that overlaps with its eviction. Finally, a core makes a sd_hit coherence request to obtain write permissions for a block it has cached in S state. The controller must send an invalidate command to all other cores holding the block in the S state, and then an ACK to the requesting core.
Transaction Mode Events: When a core makes a coherence request from
within a transaction, the controller must check for a conflict with the accesses of other concurrent transactions. The conflict will be detected by analyzing the metadata associated with the cache line that is stored in either the directory controller or the TVC. A
sd_miss or
sd_hit (
i.e. a write) for a cache line
κ causes a conflict if
κ is held in the S or the M state by another transaction, while a
ld_miss causes a conflict if
κ is owned by another transaction (
i.e. is in the M state). A conflict will trigger the abort of the requesting transaction and a Tx_ABORT signal will be returned to the core. On an abort, all transaction cache lines that have been accessed by the transaction along with their associated metadata must be deleted from the L1 cache, the directory, and the TVC (see Section
3.3). If there is no conflict, a miss causes the cache line to be fetched from TVC or memory and returned to the requester along with an ACK, while a
sd_hit only needs to return a ACK. The metadata in the directory or TVC is updated accordingly.
3.2 Transaction Manager
The transaction manager consists of two related components: a directory and the transaction victim cache (TVC). The former tracks the state of items in the L1 caches for coherence and transaction conflict detection. The TVC acts as the overflow storage for transaction cache lines that are evicted from the L1 caches, and as a buffer for dirty cache lines of committed transactions waiting for their updates to be reflected in CXL memory. The discussion below deals with transaction mode events.
Directory Controller
The format of an L1 cache line is shown in Figure
4(a): the metadata consists of a
tag field, the state bits S, M, I and a transaction bit T. Figure
4(b) shows the format of a directory entry: the
tag identifies the memory address being tracked;
state bits (M S, I) indicate the state of the cache line in the L1 cache(s); and bit vector
Π (
presence vector) indicates which (if any) of the L1 caches holds a copy of the cache line. A directory entry in state M indicates that the corresponding cache line is dirty in its owner’s cache identified in
Π. The S state indicates a clean cache line that may be present in one or more of the L1 caches indicated in
Π. If a requested memory address is not found in the directory, the search continues in the TVC. The L1 caches are direct mapped, and the directory is a set associative structure with one way per core and way-size equal to an L1 cache.
When a dirty cache line (state M) is evicted from the cache the directory entry is copied to the TVC along with a copy of the evicted data, and the directory entry deleted. Subsequent accesses to the directory entry are served from the TVC, until it is brought back to the directory. The same steps are followed when a cache line in the S state is evicted provided there is a directory entry for that cache line. If there is no directory entry, then the TVC is already tracking that cache line, and the eviction can occur silently in the L1 cache.
Transaction Victim Cache
Figure
4(c) shows the format of a TVC entry. The metadata fields are similar to a directory entry with some changes: The TAG field identifies the item; the STATE field has an additional value C (
committed), in addition to the I, S, M states of a directory entry. The C state indicates that it holds the value of the last committed transaction that modified the cache line. The bit vector
Γ (
transaction id) is similar to the presence vector
Π and identifies the ongoing transactions that have accessed the block. Finally, a
version field VER tracks the number of committed transactions that have written this cache line in the TVC, but have not yet been updated in CXL by the log writes. The field is incremented when a transaction evicts a dirty cache line, and decremented when the CXL location is updated. Figure
5 shows the state transition diagram for a TVC block.
The Transaction Victim Cache (TVC) holds transactional cache lines that have been evicted from their L1 caches. The evictions may have been caused by cache conflicts during transaction execution, or may have been explicitly written back when a transaction committed. The conflict-triggered evictions of dirty and clean cache lines are shown in Figure
5 as e
D and e
C respectively, while commit-induced writeback of dirty lines is denoted by
wb.
•
Coherence Requests to the TVC: The TVC receives transaction coherence requests that have been forwarded by the directory controller. If there is no valid copy in the TVC (state I), the cache line is fetched from CXL memory and returned to the directory controller, without saving it in the TVC. If the cache line is in state C the data is returned without changing the TVC state or metadata. In both cases, the directory creates and adds an entry for the cache line and forwards it to the requester.
Transaction conflicts are handled as described earlier for Transaction Mode Events. A conflict will return a Tx_ABORT signal to the core. If there is no conflict, the TVC data is returned to the requester and an entry added to the directory. If the TVC state is either M or S, the entry is deleted or reverted to the previous committed state depending on its version (transition m).
•
Evictions to the TVC: The TVC also receives transaction cache lines evicted by a core due to conflicts or on a commit. The evicted block must be saved in the TVC and the metadata fields updated, as described below.
•
State I (No copy of the evicted cache line in the TVC): A fresh block is allocated in the TVC and the TAG and DATA fields updated with those for the evicted cache line; the TID and STATE fields are copied from the corresponding directory entry, and the directory entry is deleted. The version number VER is set to 0 or 1 depending on whether a clean or dirty cache line was evicted. If the writeback was due to a commit, the state is set to C. Note that only dirty cache lines are written back to the TVC on a commit.
•
State M (TVC holds a dirty copy of the cache line): This situation cannot occur. To see why, note that such an eviction could only be from the same transaction that initially evicted the cache line to the TVC. This would have required the transaction to have re-accessed the evicted cache line from the TVC, at which point it would have been deleted.
•
State S (TVC holds a clean copy of the cache line): This situation also cannot occur. A conflict-induced eviction of such a block would perform a silent eviction without a need to inform the small TVC. Also, on commit, only dirty blocks are written to the TVC.
•
State C (TVC holds value of the last committed transaction):
•
For a commit-induced writeback (wb), the DATA and TID fields are updated with that of the committing transaction. The version number VER is incremented by 1.
•
For a conflict-triggered eviction (e
D or e
C), the DATA and TID fields are updated as done in the I state. However, the cache line currently in the TVC must first be saved in an overflow area in case the overwriting transaction aborts before committing, requiring the TVC entry to be restored
2. If the evicted line is dirty (e
D) the version number VER is incremented by 1.
•
Overflow Handling in the TVC: The TVC does not evict any blocks to CXL memory. Instead CXL is updated independently by log writes when a transaction commits. This is done to ensure atomicity of the transaction; otherwise the CXL memory may be left in a partially updated state if the transaction aborts [
7,
21].
A block in the TVC is invalidated (i.e.. freed) when the log of the last committed transaction that wrote the cache line has been copied to CXL memory. Therefore long-running transactions could exhaust the TVC with evicted cache lines. To handle this situation, the TVC is extended into the host DRAM memory when needed, so that a transaction does not need to abort even if the TVC overflows. The DRAM cache is managed by software. The performance overheads of frequent software intervention can be mitigated by appropriately sizing the TVC and migrating blocks of both committed and long-running transactions to DRAM, thereby allowing most small and normal-sized transactions to operate at cache speed.
•
Version Control in the TVC: Multiple versions of a cache line in TVC are created when successive committed transactions write back the same cache line before the earlier versions have been deleted by CXL updates. The TVC only saves the data of the last committed transaction, but tracks the number of pending CXL updates in the version metadata field. When a pending update to CXL completes (transition CXL), the version field VER is decremented. If the block was in state C when VER becomes 0 the block is deleted by setting its STATE to I.
3.3 Handling Transaction Aborts
When a transaction aborts, all traces of its execution must be removed from its L1 cache, the directory, and the TVC. The L1 entries can be efficiently deleted by invalidating all the blocks in the aborting core’s L1 cache that have their T bit set, and then clearing their T bits. Directory entries referencing the aborting core should delete the core from their Π vector and free up unused entries. The inTransaction bit for the transaction is cleared. TVC blocks referencing the aborted transaction (in Γ) may also need to be invalidated or rolled back. For a TVC block in state M, the transaction is removed from Γ and the version number decremented. If VER now equals 0, the block should be invalidated by setting its state to I. If the block is in STATE S, the transaction is removed from Γ and then if the aborted transaction is the only one present in Γ, the state is set to I or C depending on the version number (transition a*). Reverting to the C state may require restoring the temporarily saved state from the overflow area.
Independent of the front-end operations, the log record for that transaction is marked as aborted. When the backend CXL controller encounters the log corresponding to an aborted transaction, the controller simply discards the log. This prevents the partial updates of the aborted transaction from being written to CXL memory.
3.4 Handling Transaction Commits
When a transaction successfully completes, it must commit the transaction. This involves updating the state of the cache lines in L1, the directory, and TVC. Dirty cache lines in the committing L1 that have their T bit set are written back to the TVC and all L1 cache entries with T asserted are invalidated, and the T bits cleared. The transaction is removed from Π fields of directory entries and the directory state set to I if there are no other entries in Π. The inTransaction bit for the transaction is cleared. The TVC state of an entry that is updated by a writeback is set to C and the version number incremented. A TVC entry in the M state that is owned by the committing transaction changes its state to C. Finally, if the committing transaction belongs to a TVC entry in the S state, the transaction is removed from Γ; if Γ is empty the state is set to C or I.
Independently, a commit record is written to the log along with a transaction timestamp that orders committed transaction in the order of commits. The values in the committed logs are asynchronously copied back to the CXL memory in timestamp order, and deleted from the TVC.
4 Implementation and Operational Examples
We summarize the main implementation features to support the proposed protocol. Cache lines in the L1 cache have an additional status bit T bit that indicates its access from within a transaction. It identifies L1 cache blocks that must be invalidated on a transaction abort or flushed to the TVC on a transaction commit. The bit also directs evicted cache lines to either the TVC or the normal L2.
The directory controller entry holds the following metadata for a cache line in an L1 cache. A presence bit vector (Π) indicates whether or not the core holds a copy of the cache line. A STATE field indicates whether the cache line is clean in one or more of the L1 caches (state S), is dirty in one L1 cache (state M), or not present in any of the L1 caches (state I). In addition, a bit vector inTransaction indicates whether or not a core is executing a transaction.
The TVC is implemented as a high-associativity cache. It holds transactional cache lines that have been evicted from an L1. After a transaction commits, these continue to be held in the TVC till the cache lines is updated in CXL memory from the log. To handle overflow of a cache set, the TVC is extended into host memory as needed. An overflow bit associated with a set is used to indicate that the associativity for that set has been increased to include storage in host DRAM memory. During TVC lookup, the memory overflow area needs to be searched (nominally in software) as directed by the overflow bit for that set. Such TVC overflows can cause performance slowdowns if frequently accessed data overflows to DRAM, but does not require transaction aborts. Optimizing the TVC size and associativity based on workload characteristics is an important consideration that is beyond the scope of this paper. The bit vector Γ identifies the transactions that have accessed the transaction. The other meta data fields have been described earlier.
4.1 Example Scenarios
We illustrate the implementation by tracing the possible orderings of the transactions of Figure
1, and showing the step by step state changes. In the following the two statements of T1 are denoted by
A and
B, and those of T2 by
C and
D. The examples assume that initially the L1 caches and the victim cache are empty.
Scenario 1 (T1 and T2 are non-conflicting): We assume that the instruction execution ordering is
A, B, followed by
C, D. Figure
6 shows a trace of the execution. On executing
A, core 1 sends a
sd_miss(
x) request to the controller, which creates a directory entry for
x with state M and presence vector
Π = [
\(1 0\)]. The cache line is read from memory into core 1’s L1 cache in the M, state and T is set to 1. On executing
B, core 1 sends a
ld_miss(
y) to the controller that reads the cache line from memory into core 1’s L1 cache in the S state and sets T. A directory entry for
y is created in state S with
Π = [
\(1 0\)].
T1 then closes and commits. The modified cache lines (x and a) are copied to the TVC, and all entries in core 1’s L1 cache with the T bit set are flash invalidated. The directory entries for x, y and a are also invalidated. The evicted dirty cache lines x and a are added to the TVC in the committed state (C) and their version numbers are set to 1. The execution of T2 mirrors that of T1, except that the ld_miss(x) of instruction D reads the cache line from the TVC with value of x equal to 3.
Scenario 2 (T1 and T2 executions are interleaved): We assume that the instruction execution ordering is
A, C, D, B. Figure
7 shows a trace of the execution. As in scenario 1, the access by
A brings
x into core 1’s L1 cache and updates the directory entry. Similarly, access by
C brings
y into core 2’s L1 and adds the entry for
y into the directory.
When T2 executes D, it sends ld_miss(x) to the controller. Since the directory entry for x has state M and its Π is [\(1 0\)], the controller recognizes the conflict for x, and returns a Tx_ABORT to core 2. T2 must abort and invalidate y in its L1 cache and clear its T bit. The directory entry for y must also be invalidated.
T1 then executes B and sends ld_miss(y) to the controller. Since y is not present in the caches, the controller reads y (value 0) from memory, changes the directory state of y to S and updates Π to [\(1 0\)]. Core 1 sets the L1 cache state of y to S and the T bit to 1. T1 then commits, and the transactional cache lines in core 1’s L1 cache are flash invalidated, and the dirty lines copied to the TVC in state C. The controller invalidates the directory entries for x, a and y.
Scenario 3 (Handling cache evictions during interleaved execution): We assume that the instruction execution ordering is
A, C, E B, D where
E is a new instruction that forces the eviction of
y from core 2’s L1 cache. Figure
8 shows a trace of the execution. Suppose that after
A and
C execute, T2 performs another instruction (e.g.
E: z = 7), that causes
y (value 5) to be evicted from core 2’s L1 cache. Since
y has its T bit set, the controller forwards it to the TVC, where it is stored in state M with version number 1.
When T1 executes B, the controller forwards the ld_miss(y) request to the TVC, which detects the conflict with the ongoing transaction on core 2. It therefore returns Tx_ABORT to core 1. When T2 commits the state of y in the TVC is changed from M to C. If there had been another access to y (e.g. ld_miss) by T2 before it committed, it would be copied to T2’s L1 cache, an entry added to the directory, and deleted from the TVC.
5 Related Work
The capacity of HTM transactions are increased by introducing a software layer for version implementation using Snapshot Isolation [
8]. LogTM-SE [
23] proposes decoupling HTM from caches using an undo log and signatures, allowing for an in-place update to memory and unbounded nesting, context switching, and other migrations. However, these works do not address persistence and durability of transactions onto a non-volatile media such as persistent memory.
Some work [
4,
9,
10] utilizes un-modified HTM for concurrency control decoupled from persistence to HTM. cc-HTM [
9] introduces the concept of adjustable
lag whereby users can allow transaction execution to continue in fast cache with selectable PM durability guarantees on the back-end. However, it requires aliasing all read and write accesses while concurrently maintaining log ordering and and replaying logs for retirement. NV-HTM [
4] removes the need for aliasing in cc-HTM, but is limited to one pending durability transaction per-thread and must wait for prior transactions to complete before making forward progress. Hardware Transactional Persistent Memory, or HTPM [
10], utilizes HTM for concurrency control and isolation, with a back-end memory controller based on [
7,
21]. HTPM requires no changes to current HTM semantics or additions to the cache or cache coherence policies and is implemented in the back-end memory controller or can be pushed to DIMMs. However, HTPM is bound by current HTM limitations.
Other work [
2,
3,
17,
20,
22] requires making significant changes to the existing HTM semantics and implementations. For instance, PHTM [
3] and PHyTM [
2], propose a new instruction called
TransparentFlush which can be used to flush a cache line from
within a transaction to persistent memory without causing any transaction to abort. They also propose a change to the
xend instruction that ends an atomic HTM region, so that it atomically updates a bit in persistent memory as part of its execution. Similarly, for DUDETM [
20] to use HTM, it requires that designated memory variables
within a transaction be allowed to be updated globally and concurrently without causing an abort. Durable HTM (DHTM) [
17] and Unbounded HTM (UTHM) [
15] are discussed previously in Section
2.3. To log within a transaction, PTM [
22] proposes changes to processor caches while adding an on-chip scoreboard and global transaction id register to couple HTM with PM.