Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- abstractSeptember 2012
Hardware prefetchers for emerging parallel applications
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesPages 485–486https://doi.org/10.1145/2370816.2370909Hardware prefetching has been studied in the past for multi-programmed workloads as well as GPUs. Efficient hardware prefetchers like stream-based or GHB-based ones work well for multiprogrammed workloads because different programs get mapped to ...
- abstractSeptember 2012
SkipCache: miss-rate aware cache management
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesPages 481–482https://doi.org/10.1145/2370816.2370907It is common for computers to have multi-level caches. This piece of work revolves around one question: Are all levels needed by all applications during all phases of their execution?, especially in the multi programmed scenario where giving the entire ...
- abstractSeptember 2012
Design of a storage processing unit
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesPages 479–480https://doi.org/10.1145/2370816.2370906Researchers showed that performing computation directly on storage devices improves system performance in terms of energy consumption and processing time. For example, Riedel et al. [2] proposed an active disk which performs computation using the ...
- posterSeptember 2012
Energy-efficient cache partitioning for future CMPs
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesPages 465–466https://doi.org/10.1145/2370816.2370898The demand for high performance computing systems requires processor vendors to increase the number of cores per chip multiprocessor (CMP). However, as their number grows, the core-to-way ratio in the last level cache (LLC) increases, presenting ...
- posterSeptember 2012
PS-Dir: a scalable two-level directory cache
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesPages 451–452https://doi.org/10.1145/2370816.2370891As the number of cores increases in both incoming and future chip multiprocessors, coherence protocols must address novel hardware structures in order to scale in terms of performance, power, and area. It is well known that most blocks accessed by ...
- posterSeptember 2012
Off-chip access localization for NoC-based multicores
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesPages 447–448https://doi.org/10.1145/2370816.2370889In a network-on-chip based multicore, an off-chip data access needs to travel through the on-chip network, spending considerable amount of time within the chip (in addition to the memory access itself). Further, it also causes additional delays for on-...
- research-articleSeptember 2012
Base-delta-immediate compression: practical data compression for on-chip caches
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesPages 377–388https://doi.org/10.1145/2370816.2370870Cache compression is a promising technique to increase on-chip cache capacity and to decrease on-chip and off-chip bandwidth usage. Unfortunately, directly applying well-known compression algorithms (usually implemented in software) leads to high ...
- research-articleSeptember 2012
A software memory partition approach for eliminating bank-level interference in multicore systems
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesPages 367–376https://doi.org/10.1145/2370816.2370869Main memory system is a shared resource in modern multicore machines, resulting in serious interference, which causes performance degradation in terms of throughput slowdown and unfairness. Numerous new memory scheduling algorithms have been proposed to ...
- research-articleSeptember 2012
The evicted-address filter: a unified mechanism to address both cache pollution and thrashing
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesPages 355–366https://doi.org/10.1145/2370816.2370868Off-chip main memory has long been a bottleneck for system performance. With increasing memory pressure due to multiple on-chip cores, effective cache utilization is important. In a system with limited cache space, we would ideally like to prevent 1) ...
- research-articleSeptember 2012
Optimal bypass monitor for high performance last-level caches
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesPages 315–324https://doi.org/10.1145/2370816.2370862In the last-level cache, large amounts of blocks have reuse distances greater than the available cache capacity. Cache performance and efficiency can be improved if some subset of these distant reuse blocks can reside in the cache longer. The bypass ...
- research-articleSeptember 2012
Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesPages 293–304https://doi.org/10.1145/2370816.2370860The replacement policies for the last-level caches (LLCs) are usually designed based on the access information available locally at the LLC. These policies are inherently sub-optimal due to lack of information about the activities in the inner-levels of ...
- research-articleSeptember 2012
Shared memory multiplexing: a novel way to improve GPGPU throughput
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesPages 283–292https://doi.org/10.1145/2370816.2370858On-chip shared memory (a.k.a. local data share) is a critical resource to many GPGPU applications. In current GPUs, the shared memory is allocated when a thread block (also called a workgroup) is dispatched to a streaming multiprocessor (SM) and is ...
- research-articleSeptember 2012
Complexity-effective multicore coherence
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesPages 241–252https://doi.org/10.1145/2370816.2370853Much of the complexity and overhead (directory, state bits, invalidations) of a typical directory coherence implementation stems from the effort to make it "invisible" even to the strongest memory consistency model. In this paper, we show that a much ...
- research-articleSeptember 2012
RISE: improving the streaming processors reliability against soft errors in gpgpus
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesPages 191–200https://doi.org/10.1145/2370816.2370846With hundreds of cores integrated into a single chip, the general-purpose computing on graphic processing units (GPGPUs) provide high computing power to accelerate parallel applications. However, they are prone to manifest high soft-error vulnerability ...
- research-articleSeptember 2012
Making data prefetch smarter: adaptive prefetching on POWER7
- Victor Jiménez,
- Roberto Gioiosa,
- Francisco J. Cazorla,
- Alper Buyuktosunoglu,
- Pradip Bose,
- Francis P. O'Connell
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesPages 137–146https://doi.org/10.1145/2370816.2370837Hardware data prefetch engines are integral parts of many general purpose server-class microprocessors in the field today. Some prefetch engines allow the user to change some of their parameters. The prefetcher, however, is usually enabled in a default ...
- research-articleSeptember 2012
Optimizing datacenter power with memory system levers for guaranteed quality-of-service
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesPages 117–126https://doi.org/10.1145/2370816.2370834Co-location of applications is a proven technique to improve hardware utilization. Recent advances in virtualization have made co-location of independent applications on shared hardware a common scenario in datacenters. Co-location, while maintaining ...
- research-articleSeptember 2012
XPoint cache: scaling existing bus-based coherence protocols for 2D and 3D many-core systems
- Ronald G. Dreslinski,
- Thomas Manville,
- Korey Sewell,
- Reetuparna Das,
- Nathaniel Pinckney,
- Sudhir Satpathy,
- David Blaauw,
- Dennis Sylvester,
- Trevor Mudge
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesPages 75–86https://doi.org/10.1145/2370816.2370829With multi-core processors now mainstream, the shift to many-core processors poses a new set of design challenges. In particular, the scalability of coherence protocols remains a significant challenge. While complex Network-on-Chip interconnect fabrics ...