John Heinlein, Ph.D.

John Heinlein, Ph.D.

San Jose, California, United States
4K followers 500+ connections

About

My background and experience offers a balance of both technology and business expertise…

Articles by John

  • Even the failures are success!

    Even the failures are success!

    Thomas Edison was famous for having made 1,000 unsuccessful attempts at inventing the light bulb. When pressed by a…

    2 Comments

Activity

Join now to see all activity

Experience

  • Sonatus Graphic

    Sonatus

    Sunnyvale, California, United States

  • -

    San Jose, California, United States

  • -

    San Jose, California, United States

  • -

    Shin Yokohama, Japan

  • -

    San Jose, California, United States

  • -

    San Jose, CA

  • -

    San Jose, CA

  • -

    San Jose, CA

  • -

    San Jose, CA

  • -

    San Jose, California, United States

  • -

    San Jose, California, United States

  • -

  • -

  • -

  • -

  • -

Education

  • Stanford University Graphic

    Stanford University

    -

    Activities and Societies: Specialization in computer architecture. Core member of Stanford FLASH Multiprocessor project team (www-flash.stanford.edu). Dissertation focused on embedded programmability to optimize high performance data transfer and synchronization in large scale multiprocessors Dissertation: http://i.stanford.edu/pub/cstr/reports/csl/tr/98/759/CSL-TR-98-759.pdf

    Honors: Air Force Laboratory Graduate Fellowship (1991-1994), Intel Foundation Fellowship (1995-1996), National Science Foundation Fellowship (funding declined)

  • -

  • -

    Activities and Societies: Eta Kappa Nu, (EE honor society, Treasurer), Tau Beta Pi (Engineering Honor Society), Carnegie Mellon Computer Club (President)

    Honors: University Honors, ECE Department Everard M. Williams Award, National Engineering Consortium William L. Everitt Student Award of Excellence

Volunteer Experience

  • The Tech Interactive Graphic

    Chairman, Tech Challenge Executive Committee

    The Tech Interactive

    - Present 10 years 10 months

    Education

    2019-present I serve as Chairman of the Executive Committee for the Tech Challenge, one of the signature programs of the Tech Interactive, which is a STEM-oriented science and technology competition for students in grade 4-12. In prior years I was a committee member. In the committee we provide support for fund raising as well as marketing and other steering of the challenge. Arm is a corporate sponsor and I am one of two people representing our sponsorship to the committee.

  • Carnegie Mellon University Graphic

    ECE Advisory Council for the Department of Electrical and Computer Engineering

    Carnegie Mellon University

    - Present 5 years 6 months

    Education

    The ECE Alumni Advisory Council exists to advise on strategic goals for the department in the areas of undergraduate programs, graduate programs, research, and corporate relations.

  • The Harker School Graphic

    Parent Development Council

    The Harker School

    - Present 7 years 4 months

    Education

    I am one of the parent volunteers who drive fundraising and development of the Harker School, support annual giving and capital campaign programs.

  • Carnegie Mellon University Graphic

    Alumni Admissions Council

    Carnegie Mellon University

    - 6 years 10 months

    Education

    I supported Carnegie Mellon University through their Carnegie Mellon Alumni Council (CMAC) program which provides local outreach to potential students, applicant interviews, and helps raise awareness of CMU nationwide.

Publications

  • Optimized Communication and Synchronization Using a Programmable Protocol Engine (Doctoral Dissertation)

    Stanford University

    My doctoral dissertation focused on the topic of different ways to use communication across large scale multiprocessor designs. Using an infrastructure conceived for coherency using software, I studied uses of that same infrastructure instead for synchronization and high speed block transfer, demonstrating that specially optimized protocols can vastly outperform those based on shared memory alone.

    For full abstract, please see:…

    My doctoral dissertation focused on the topic of different ways to use communication across large scale multiprocessor designs. Using an infrastructure conceived for coherency using software, I studied uses of that same infrastructure instead for synchronization and high speed block transfer, demonstrating that specially optimized protocols can vastly outperform those based on shared memory alone.

    For full abstract, please see: http://infolab.stanford.edu/TR/CSL-TR-98-759.html

    Excerpted abstract:

    [...]
    Our study focuses in detail on two classes of communication that are important for large scale multiprocessors: block transfer and synchronization using locks and barriers. In particular, we attempt to improve the performance of these classes of communication as compared to implementations using only software on top of shared memory.
    [...]
    We find that embedding advanced communication and synchronization features in a programmable controller has a number of advantages. For example, the block transfer protocol improves transfer performance in some cases, enables the processor to perform other work in parallel, and reduces processor cache pollution caused by the transfer. The synchronization protocols reduce overhead and eliminate bottlenecks associated with synchronization primitives implemented using software on top of shared memory. Simulations of scientific applications running on FLASH show that, in many cases, synchronization support improves performance and increases the range of machine sizes over which the applications scale. Our study shows that embedded programmability is a convenient approach for supporting block transfer and synchronization, and that the FLASH system design effectively supports this approach.

    See publication
  • Coherent Block Transfer in the FLASH Multiprocessor

    Proceedings of the 11th International Parallel Processing Symposium (IPPS97)

    Abstract:
    A key goal of the Stanford FLASH project is to explore the integration of multiple communication protocols in a single multiprocessor architecture. To achieve this goal, FLASH includes a programmable node controller called MAGIC, which contains an embedded protocol processor capable of implementing multiple protocols. In this paper we present a specialized protocol for block data transfer integrated with a conventional cache coherence protocol. Block transfer forms the basis for…

    Abstract:
    A key goal of the Stanford FLASH project is to explore the integration of multiple communication protocols in a single multiprocessor architecture. To achieve this goal, FLASH includes a programmable node controller called MAGIC, which contains an embedded protocol processor capable of implementing multiple protocols. In this paper we present a specialized protocol for block data transfer integrated with a conventional cache coherence protocol. Block transfer forms the basis for message passing implementations on top of shared memory, occurs in important workloads such as databases, and is frequently used by the operating system. We discuss the issues that arise in designing a fully integrated protocol and its interactions with cache coherence. Using microbenchmarks, MPI communication primitives, and an application running on the operating system, we compare our protocol with standard bcopy and bcopy augmented with prefetches. Our results show that integrated block transfer can accelerate communication between nodes while off-loading the task from the main processor utilizing the network more efficiently, and reducing the associated cache pollution. Given the aggressive support for prefetching in FLASH, prefetched bcopy is able to achieve competitive performance in many cases but lacks the other three advantages of our protocol

    Other authors
    See publication
  • Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor

    Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI)

    Abstract: The advantages of using message passing over shared memory for certain types of communication and synchronization have provided an incentive to integrate both models within a single architecture. A key goal of the FLASH (FLexible Architecture for SHared memory) project at Stanford is to achieve this integration while maintaining a simple and efficient design. This paper presents the hardware and software mechanisms in FLASH to support various message passing protocols. We achieve low…

    Abstract: The advantages of using message passing over shared memory for certain types of communication and synchronization have provided an incentive to integrate both models within a single architecture. A key goal of the FLASH (FLexible Architecture for SHared memory) project at Stanford is to achieve this integration while maintaining a simple and efficient design. This paper presents the hardware and software mechanisms in FLASH to support various message passing protocols. We achieve low overhead message passing by delegating protocol functionality to the programmable node controllers in FLASH and by providing direct user-level access to this messaging subsystem. In contrast to most earlier work, we provide an integrated solution that handles the interaction of the messaging protocols with virtual memory, protected multiprogramming, and cache coherence. Detailed simulation studies indicate that this system can sustain message-transfers rates of several hundred megabytes per second, effectively utilizing projected network bandwidths for next generation multiprocessors.

    Other authors
    See publication
  • The Performance Impact of Flexibility in the Stanford FLASH Multiprocessor

    Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI)

    Abstract: A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In the Stanford FLASH multiprocessor, flexibility is obtained by requiring all transactions in a node to pass through a programmable node controller, called MAGIC. In this paper, we evaluate the performance costs of flexibility by comparing the…

    Abstract: A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In the Stanford FLASH multiprocessor, flexibility is obtained by requiring all transactions in a node to pass through a programmable node controller, called MAGIC. In this paper, we evaluate the performance costs of flexibility by comparing the performance of FLASH to that of an idealized hardwired machine on representative parallel applications and a multiprogramming workload. To measure the performance of FLASH, we use a detailed simulator of the FLASH and MAGIC designs, together with the code sequences that implement the cache-coherence protocol. We find that for a range of optimized parallel applications the performance differences between the idealized machine and FLASH are small For these programs, either the miss rates are small or the latency of the programmable protocol can be hidden behind the memory access time, For applications that incur a large number of remote misses or exhibit substantial hot-spotting, performance is poor for both machines, though the increased remote access latencies or the occupancy of MAGIC lead to lower performance for the flexible design, In most cases, however, FLASH is only 2%-12% slower than the idealized machine

    Other authors
    See publication
  • Mable: A Technique for Efficient Machine Simulation

    Stanford University Computer Systems Laboratory

    Abstract: We present a framework for an efficient instruction-level machine simulator which can
    be used with existing software tools to develop and analyze programs for a proposed
    processor architecture. The simulator exploits similarities between the instruction sets
    of the emulated machine and the host machine to provide fast simulation. Furthermore,
    existing program development tools on the host machine such as debuggers and
    profilers can be used without modification on the…

    Abstract: We present a framework for an efficient instruction-level machine simulator which can
    be used with existing software tools to develop and analyze programs for a proposed
    processor architecture. The simulator exploits similarities between the instruction sets
    of the emulated machine and the host machine to provide fast simulation. Furthermore,
    existing program development tools on the host machine such as debuggers and
    profilers can be used without modification on the emulated program running under the
    simulator. The simulator can therefore be used to debug and tune application code for
    the new processor without building a whole new set of program development tools.
    The technique has applicability to a diverse set of simulation problems. We show how
    the framework has been used to build simulators for a shared-memory multiprocessor,
    a superscalar processor with support for speculative execution, and a dual-issue
    embedded processor.

    Other authors
    • Philippe Lacroute
    • Peter Davies
    See publication
  • The Stanford FLASH Multiprocessor

    Proceedings of the 21st International Symposium on Computer Architecture (ISCA21)

    Abstract:
    The FLASH multiprocessor efficiently integrates support for cache-coherent shared memory and high-performance message passing, while minimizing both hardware and software overhead. Each node in FLASH contains a microprocessor, a portion of the machine’s global memory, a port to the interconnection network an I/O interface, and a custom node controller called MAGIC. The MAGIC chip handles all communication both within the node and among nodes, using hsrdwired data paths for…

    Abstract:
    The FLASH multiprocessor efficiently integrates support for cache-coherent shared memory and high-performance message passing, while minimizing both hardware and software overhead. Each node in FLASH contains a microprocessor, a portion of the machine’s global memory, a port to the interconnection network an I/O interface, and a custom node controller called MAGIC. The MAGIC chip handles all communication both within the node and among nodes, using hsrdwired data paths for efficient data movement and a programmable processor optimized for executing protocol operations. the use of the protocol processor makes FLASH very flexible — it can support a variety of different communication mechanisms — and simplifies the design and implementation.
    This paper presents the architecture of FLASH and MAGIC, and discusses the base cache-coherence and message-passing protocols. Latency and occupancy numbers, which are derived from our system-level simulator and our Verilog code, are given for several common protocol operations. The paper also describes our software strategy and FLASH’s current status.

    Other authors
    See publication
  • Instruction Level Profiling and Evaluation of the IBM RS/6000

    ISCA'91: Proceedings of the 18th Annual International Symposium on Computer Architecture

    Abstract: This paper reports preliminary results from using goblin, a new instruction level profiling system, to evaluate the IBM RISC System/6000 architecture. The evaluation presented is based on the SPEC benchmark suite. Each SPEC program (except gee) is processed by goblin to produce an instrumented version. During execution of the instrumented program, profiling routines are invoked which trace the execution of the program. These routines also collect statistics on dynamic instruction mix,…

    Abstract: This paper reports preliminary results from using goblin, a new instruction level profiling system, to evaluate the IBM RISC System/6000 architecture. The evaluation presented is based on the SPEC benchmark suite. Each SPEC program (except gee) is processed by goblin to produce an instrumented version. During execution of the instrumented program, profiling routines are invoked which trace the execution of the program. These routines also collect statistics on dynamic instruction mix, branching behavior, and resource utilization. Based on these statistics, the actual performance and the architectural efficiency of the RS/6000 are evaluated. In order to provide a context for this evaluation, a comparison to the DECStation 3100 is also presented. The entire profiling and evaluation experiment on nine of the ten SPEC programs involves tracing and analyzing over 32 billion instructions on the RS/6000. The evaluation indicates that for the SPEC benchmark suite the architecture of the RS/6000 is well balanced and exhibits impressive performance, especially on the floating-point intensive applications.

    Other authors
    • Chriss Stephens
    • Bryce Cogswell
    • Gregory Palmer
    • John P Shen
    See publication

Honors & Awards

  • 2022 Outstanding Volunteer Fundraiser

    Association for Fundraising Professionals, Silicon Valley Chapter

    Recognition for achievement in volunteer fundraising on behalf of The Tech Challenge, nominated by The Tech Interactive

Languages

  • French

    Professional working proficiency

  • Spanish

    Limited working proficiency

  • Japanese

    Limited working proficiency

  • English

    Native or bilingual proficiency

Recommendations received

16 people have recommended John

Join now to view

More activity by John

View John’s full profile

  • See who you know in common
  • Get introduced
  • Contact John directly
Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More