skip to main content
Reflects downloads up to 06 Nov 2024Bibliometrics

Newsletter Downloads

Skip Table Of Content Section
article
Free
Compiler-controlled memory

Optimizations aimed at reducing the impact of memory operations on execution speed have long concentrated on improving cache performance. These efforts achieve a. reasonable level of success. The primary limit on the compiler's ability to improve memory ...

article
Free
Segregating heap objects by reference behavior and lifetime

Dynamic storage allocation has become increasingly important in many applications, in part due to the use of the object-oriented paradigm. At the same time, processor speeds are increasing faster than memory speeds and programs are increasing in size ...

article
Free
Schedule-independent storage mapping for loops

This paper studies the relationship between storage requirements and performance. Storage-related dependences inhibit optimizations for locality and parallelism. Techniques such as renaming and array expansion can eliminate all storage-related ...

article
Free
An empirical analysis of instruction repetition

We study the phenomenon of instruction repetition, where the inputs and outputs of multiple dynamic instances of a static instruction are repeated. We observe that over 80% of the dynamic instructions executed in several programs are repeated and most ...

article
Free
Space-time scheduling of instruction-level parallelism on a raw machine

Increasing demand for both greater parallelism and faster clocks dictate that future generation architectures will need to decentralize their resources and eliminate primitives that require single cycle global communication. A Raw microprocessor ...

article
Free
Data speculation support for a chip multiprocessor

Thread-level speculation is a technique that enables parallel execution of sequential applications on a multiprocessor. This paper describes the complete implementation of the support for threadlevel speculation on the Hydra chip multiprocessor (CMP). ...

article
Free
VISA: Netstation's virtual Internet SCSI adapter

In this paper we describe the implementation of VISA, our Virtual Internet SCSI Adapter. VISA was built to evaluate the performance impact on the host operating system of using IP to communicate with peripherals, especially storage devices. We have ...

article
Free
Active disks: programming model, algorithms and evaluation

Several application and technology trends indicate that it might be both profitable and feasible to move computation closer to the data that it processes. In this paper, we evaluate Active Disk architectures which integrate significant processing power ...

article
Free
A cost-effective, high-bandwidth storage architecture

This paper describes the Network-Attached Secure Disk (NASD) storage architecture, prototype implementations oj NASD drives, array management for our architecture, and three, filesystems built on our prototype. NASD provides scalable storage bandwidth ...

article
Free
Hardware-software trade-offs in a direct Rambus implementation of the RAMpage memory hierarchy

The RAMpage memory hierarchy is an alternative to the traditional division between cache and main memory: main memory is moved up a level and DRAM is used as a paging device. The idea behind RAMpage is to reduce hardware complexity, if at the cost of ...

article
Free
Dependence based prefetching for linked data structures

We introduce a dynamic scheme that captures the accesspat-terns of linked data structures and can be used to predict future accesses with high accuracy. Our technique exploits the dependence relationships that exist between loads that produce addresses ...

article
Free
Performance counters and state sharing annotations: a unified approach to thread locality

This paper describes a combined approach for improving thread locality that uses the bardware performance monitors of modem processors and program-centric code annotations to guide thread scheduling on SMPs. The approach relies on a shared state cache ...

article
Free
Cache-conscious data placement

As the gap between memory and processor speeds continues to widen, cache eficiency is an increasingly important component of processor performance. Compiler techniques have been used to improve instruction cache pet$ormance by mapping code with temporal ...

article
Free
An out-of-order execution technique for runtime binary translators

A dynamic translator emulates an instruction set architccturc by translating source instructions to native code during execution. On statically-scheduled hardware, higher performance can potentially be achieved by reordering the translated instructions; ...

article
Free
Overlapping execution with transfer using non-strict execution for mobile programs

In order to execute a program on a remote computer, it mustfirst be transferred over a network. This transmission incurs the over-head of network latency before execution can begin. This latency can vary greatly depending upon the size of the program., ...

article
Free
Variable length path branch prediction

Accurate branch prediction is required to achieve high performance in deeply pipelined, wide-issue processors. Recent studies have shown that conditional and indirect (or computed) branch targets can be accuratelypredicted by recording the path, which ...

article
Free
Performance isolation: sharing and isolation in shared-memory multiprocessors

Shared-memory multiprocessors (SMPs) are being extensively used as general-purpose servers. The tight coupling of multiple processors, memory, and I/O provides enormous computing power in a single system, and enables the efficient sharing of these ...

article
Free
UTLB: a mechanism for address translation on network interfaces

An important aspect of a high-speed network system is the ability to transfer data directly between the network interface and application buffers. Such a direct data path requires the network interface to "know" the virtual-to-physical address ...

article
Free
Locality-aware request distribution in cluster-based network servers

We consider cluster-based network servers in which a front-end directs incoming requests to one of a number of back-ends. Specifically, we consider content-based request distribution: the front-end uses the content requested, in addition to information ...

article
Free
Investigating optimal local memory performance

Recent work has demonstrated that, cache space is often poorly utilized. However, no previous work has yet demonstrated upper bounds on what a cache or local memory could achieve when exploiting both spatial and temporal locality. Belady's MIN algorithm ...

article
Free
Precise miss analysis for program transformations with caches of arbitrary associativity

Analyzing and optimizing program memory performance is a pressing problem in high-performance computer architectures. Currently, software solutions addressing the processor-memory performance gap include compiler-or programmer-applied optimizations like ...

article
Free
Capturing dynamic memory reference behavior with adaptive cache topology

Memory references exhibit locality and are therefore not uniformly distributed across the sets of a cache. This skew reduces the effectiveness of a cache because it results in the caching of a considerable number of less-recently-used lines which are ...

article
Free
Accelerating multi-media processing by implementing memoing in multiplication and division units

This paper proposes a technique that enables performing multi-cycle (multiplication, division, square-root …) computations in a single cycle. The technique is based on the notion of memoing: saving the input and output of previous calculations ...

article
Free
Value speculation scheduling for high performance processors

Recent research in value prediction shows a surprising amount of predictability for the values produced by register-writing instructions. Several hardware based value predictor designs have been proposed to exploit this predictability by eliminating ...

article
Free
An empirical study of decentralized ILP execution models

Recent fascination for dynamic scheduling as a means for exploiting instruction-level parallelism has introduced significant interest in the scalability aspects of dynamic scheduling hardware. In order to overcome the scalability problems of centralized ...

article
Free
Fast out-of-order processor simulation using memoization

Our new out-of-order processor simulatol; FastSim, uses two innovations to speed up simulation 8--15 times (vs. Wisconsin SimpleScalar) with no loss in simulation accuracy. First, FastSim uses speculative direct-execution to accelerate the functional ...

article
Free
A look at several memory management units, TLB-refill mechanisms, and page table organizations

Virtual memory is a staple in modem systems, though there is little agreement on how its functionality is to be implemented on either the hardware or software side of the interface. The myriad of design choices and incompatible hardware mechanisms ...

article
Free
Performance of database workloads on shared-memory systems with out-of-order processors

Database applications such as online transaction processing (OLTP) and decision support systems (DSS) constitute the largest and fastest-growing segment of the market for multiprocessor servers. However, most current system designs have been optimized ...

Subjects

Comments