Newsletter Downloads
Compiler-controlled memory
Optimizations aimed at reducing the impact of memory operations on execution speed have long concentrated on improving cache performance. These efforts achieve a. reasonable level of success. The primary limit on the compiler's ability to improve memory ...
Segregating heap objects by reference behavior and lifetime
Dynamic storage allocation has become increasingly important in many applications, in part due to the use of the object-oriented paradigm. At the same time, processor speeds are increasing faster than memory speeds and programs are increasing in size ...
Schedule-independent storage mapping for loops
This paper studies the relationship between storage requirements and performance. Storage-related dependences inhibit optimizations for locality and parallelism. Techniques such as renaming and array expansion can eliminate all storage-related ...
An empirical analysis of instruction repetition
We study the phenomenon of instruction repetition, where the inputs and outputs of multiple dynamic instances of a static instruction are repeated. We observe that over 80% of the dynamic instructions executed in several programs are repeated and most ...
Space-time scheduling of instruction-level parallelism on a raw machine
- Walter Lee,
- Rajeev Barua,
- Matthew Frank,
- Devabhaktuni Srikrishna,
- Jonathan Babb,
- Vivek Sarkar,
- Saman Amarasinghe
Increasing demand for both greater parallelism and faster clocks dictate that future generation architectures will need to decentralize their resources and eliminate primitives that require single cycle global communication. A Raw microprocessor ...
Data speculation support for a chip multiprocessor
Thread-level speculation is a technique that enables parallel execution of sequential applications on a multiprocessor. This paper describes the complete implementation of the support for threadlevel speculation on the Hydra chip multiprocessor (CMP). ...
VISA: Netstation's virtual Internet SCSI adapter
In this paper we describe the implementation of VISA, our Virtual Internet SCSI Adapter. VISA was built to evaluate the performance impact on the host operating system of using IP to communicate with peripherals, especially storage devices. We have ...
Active disks: programming model, algorithms and evaluation
Several application and technology trends indicate that it might be both profitable and feasible to move computation closer to the data that it processes. In this paper, we evaluate Active Disk architectures which integrate significant processing power ...
A cost-effective, high-bandwidth storage architecture
- Garth A. Gibson,
- David F. Nagle,
- Khalil Amiri,
- Jeff Butler,
- Fay W. Chang,
- Howard Gobioff,
- Charles Hardin,
- Erik Riedel,
- David Rochberg,
- Jim Zelenka
This paper describes the Network-Attached Secure Disk (NASD) storage architecture, prototype implementations oj NASD drives, array management for our architecture, and three, filesystems built on our prototype. NASD provides scalable storage bandwidth ...
Hardware-software trade-offs in a direct Rambus implementation of the RAMpage memory hierarchy
The RAMpage memory hierarchy is an alternative to the traditional division between cache and main memory: main memory is moved up a level and DRAM is used as a paging device. The idea behind RAMpage is to reduce hardware complexity, if at the cost of ...
Dependence based prefetching for linked data structures
We introduce a dynamic scheme that captures the accesspat-terns of linked data structures and can be used to predict future accesses with high accuracy. Our technique exploits the dependence relationships that exist between loads that produce addresses ...
Performance counters and state sharing annotations: a unified approach to thread locality
This paper describes a combined approach for improving thread locality that uses the bardware performance monitors of modem processors and program-centric code annotations to guide thread scheduling on SMPs. The approach relies on a shared state cache ...
Cache-conscious data placement
As the gap between memory and processor speeds continues to widen, cache eficiency is an increasingly important component of processor performance. Compiler techniques have been used to improve instruction cache pet$ormance by mapping code with temporal ...
An out-of-order execution technique for runtime binary translators
A dynamic translator emulates an instruction set architccturc by translating source instructions to native code during execution. On statically-scheduled hardware, higher performance can potentially be achieved by reordering the translated instructions; ...
Overlapping execution with transfer using non-strict execution for mobile programs
In order to execute a program on a remote computer, it mustfirst be transferred over a network. This transmission incurs the over-head of network latency before execution can begin. This latency can vary greatly depending upon the size of the program., ...
Variable length path branch prediction
Accurate branch prediction is required to achieve high performance in deeply pipelined, wide-issue processors. Recent studies have shown that conditional and indirect (or computed) branch targets can be accuratelypredicted by recording the path, which ...
Performance isolation: sharing and isolation in shared-memory multiprocessors
Shared-memory multiprocessors (SMPs) are being extensively used as general-purpose servers. The tight coupling of multiple processors, memory, and I/O provides enormous computing power in a single system, and enables the efficient sharing of these ...
UTLB: a mechanism for address translation on network interfaces
An important aspect of a high-speed network system is the ability to transfer data directly between the network interface and application buffers. Such a direct data path requires the network interface to "know" the virtual-to-physical address ...
Locality-aware request distribution in cluster-based network servers
We consider cluster-based network servers in which a front-end directs incoming requests to one of a number of back-ends. Specifically, we consider content-based request distribution: the front-end uses the content requested, in addition to information ...
Investigating optimal local memory performance
Recent work has demonstrated that, cache space is often poorly utilized. However, no previous work has yet demonstrated upper bounds on what a cache or local memory could achieve when exploiting both spatial and temporal locality. Belady's MIN algorithm ...
Precise miss analysis for program transformations with caches of arbitrary associativity
Analyzing and optimizing program memory performance is a pressing problem in high-performance computer architectures. Currently, software solutions addressing the processor-memory performance gap include compiler-or programmer-applied optimizations like ...
Capturing dynamic memory reference behavior with adaptive cache topology
Memory references exhibit locality and are therefore not uniformly distributed across the sets of a cache. This skew reduces the effectiveness of a cache because it results in the caching of a considerable number of less-recently-used lines which are ...
Accelerating multi-media processing by implementing memoing in multiplication and division units
This paper proposes a technique that enables performing multi-cycle (multiplication, division, square-root …) computations in a single cycle. The technique is based on the notion of memoing: saving the input and output of previous calculations ...
Value speculation scheduling for high performance processors
Recent research in value prediction shows a surprising amount of predictability for the values produced by register-writing instructions. Several hardware based value predictor designs have been proposed to exploit this predictability by eliminating ...
An empirical study of decentralized ILP execution models
Recent fascination for dynamic scheduling as a means for exploiting instruction-level parallelism has introduced significant interest in the scalability aspects of dynamic scheduling hardware. In order to overcome the scalability problems of centralized ...
Fast out-of-order processor simulation using memoization
Our new out-of-order processor simulatol; FastSim, uses two innovations to speed up simulation 8--15 times (vs. Wisconsin SimpleScalar) with no loss in simulation accuracy. First, FastSim uses speculative direct-execution to accelerate the functional ...
A look at several memory management units, TLB-refill mechanisms, and page table organizations
Virtual memory is a staple in modem systems, though there is little agreement on how its functionality is to be implemented on either the hardware or software side of the interface. The myriad of design choices and incompatible hardware mechanisms ...
Performance of database workloads on shared-memory systems with out-of-order processors
Database applications such as online transaction processing (OLTP) and decision support systems (DSS) constitute the largest and fastest-growing segment of the market for multiprocessor servers. However, most current system designs have been optimized ...