John Heinlein, Ph.D.
San Jose, California, United States
4K followers
500+ connections
About
My background and experience offers a balance of both technology and business expertise…
Articles by John
Activity
-
What a blast to be able to ask Richard Nass the questions for a change! Thanks for joining us!
What a blast to be able to ask Richard Nass the questions for a change! Thanks for joining us!
Shared by John Heinlein, Ph.D.
-
Big thank you to John Heinlein, Ph.D. and Sonatus for having me as a guest on their Garage Podcast. I have to admit, it was a little weird being in…
Big thank you to John Heinlein, Ph.D. and Sonatus for having me as a guest on their Garage Podcast. I have to admit, it was a little weird being in…
Liked by John Heinlein, Ph.D.
Experience
Education
-
Stanford University
-
Activities and Societies: Specialization in computer architecture. Core member of Stanford FLASH Multiprocessor project team (www-flash.stanford.edu). Dissertation focused on embedded programmability to optimize high performance data transfer and synchronization in large scale multiprocessors Dissertation: http://i.stanford.edu/pub/cstr/reports/csl/tr/98/759/CSL-TR-98-759.pdf
Honors: Air Force Laboratory Graduate Fellowship (1991-1994), Intel Foundation Fellowship (1995-1996), National Science Foundation Fellowship (funding declined)
-
-
-
-
Activities and Societies: Eta Kappa Nu, (EE honor society, Treasurer), Tau Beta Pi (Engineering Honor Society), Carnegie Mellon Computer Club (President)
Honors: University Honors, ECE Department Everard M. Williams Award, National Engineering Consortium William L. Everitt Student Award of Excellence
Volunteer Experience
-
Chairman, Tech Challenge Executive Committee
The Tech Interactive
- Present 10 years 10 months
Education
2019-present I serve as Chairman of the Executive Committee for the Tech Challenge, one of the signature programs of the Tech Interactive, which is a STEM-oriented science and technology competition for students in grade 4-12. In prior years I was a committee member. In the committee we provide support for fund raising as well as marketing and other steering of the challenge. Arm is a corporate sponsor and I am one of two people representing our sponsorship to the committee.
-
ECE Advisory Council for the Department of Electrical and Computer Engineering
Carnegie Mellon University
- Present 5 years 6 months
Education
The ECE Alumni Advisory Council exists to advise on strategic goals for the department in the areas of undergraduate programs, graduate programs, research, and corporate relations.
-
Parent Development Council
The Harker School
- Present 7 years 4 months
Education
I am one of the parent volunteers who drive fundraising and development of the Harker School, support annual giving and capital campaign programs.
-
Alumni Admissions Council
Carnegie Mellon University
- 6 years 10 months
Education
I supported Carnegie Mellon University through their Carnegie Mellon Alumni Council (CMAC) program which provides local outreach to potential students, applicant interviews, and helps raise awareness of CMU nationwide.
Publications
-
Optimized Communication and Synchronization Using a Programmable Protocol Engine (Doctoral Dissertation)
Stanford University
My doctoral dissertation focused on the topic of different ways to use communication across large scale multiprocessor designs. Using an infrastructure conceived for coherency using software, I studied uses of that same infrastructure instead for synchronization and high speed block transfer, demonstrating that specially optimized protocols can vastly outperform those based on shared memory alone.
For full abstract, please see:…My doctoral dissertation focused on the topic of different ways to use communication across large scale multiprocessor designs. Using an infrastructure conceived for coherency using software, I studied uses of that same infrastructure instead for synchronization and high speed block transfer, demonstrating that specially optimized protocols can vastly outperform those based on shared memory alone.
For full abstract, please see: http://infolab.stanford.edu/TR/CSL-TR-98-759.html
Excerpted abstract:
[...]
Our study focuses in detail on two classes of communication that are important for large scale multiprocessors: block transfer and synchronization using locks and barriers. In particular, we attempt to improve the performance of these classes of communication as compared to implementations using only software on top of shared memory.
[...]
We find that embedding advanced communication and synchronization features in a programmable controller has a number of advantages. For example, the block transfer protocol improves transfer performance in some cases, enables the processor to perform other work in parallel, and reduces processor cache pollution caused by the transfer. The synchronization protocols reduce overhead and eliminate bottlenecks associated with synchronization primitives implemented using software on top of shared memory. Simulations of scientific applications running on FLASH show that, in many cases, synchronization support improves performance and increases the range of machine sizes over which the applications scale. Our study shows that embedded programmability is a convenient approach for supporting block transfer and synchronization, and that the FLASH system design effectively supports this approach.
-
Coherent Block Transfer in the FLASH Multiprocessor
Proceedings of the 11th International Parallel Processing Symposium (IPPS97)
Abstract:
A key goal of the Stanford FLASH project is to explore the integration of multiple communication protocols in a single multiprocessor architecture. To achieve this goal, FLASH includes a programmable node controller called MAGIC, which contains an embedded protocol processor capable of implementing multiple protocols. In this paper we present a specialized protocol for block data transfer integrated with a conventional cache coherence protocol. Block transfer forms the basis for…Abstract:
A key goal of the Stanford FLASH project is to explore the integration of multiple communication protocols in a single multiprocessor architecture. To achieve this goal, FLASH includes a programmable node controller called MAGIC, which contains an embedded protocol processor capable of implementing multiple protocols. In this paper we present a specialized protocol for block data transfer integrated with a conventional cache coherence protocol. Block transfer forms the basis for message passing implementations on top of shared memory, occurs in important workloads such as databases, and is frequently used by the operating system. We discuss the issues that arise in designing a fully integrated protocol and its interactions with cache coherence. Using microbenchmarks, MPI communication primitives, and an application running on the operating system, we compare our protocol with standard bcopy and bcopy augmented with prefetches. Our results show that integrated block transfer can accelerate communication between nodes while off-loading the task from the main processor utilizing the network more efficiently, and reducing the associated cache pollution. Given the aggressive support for prefetching in FLASH, prefetched bcopy is able to achieve competitive performance in many cases but lacks the other three advantages of our protocolOther authorsSee publication -
Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor
Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI)
Abstract: The advantages of using message passing over shared memory for certain types of communication and synchronization have provided an incentive to integrate both models within a single architecture. A key goal of the FLASH (FLexible Architecture for SHared memory) project at Stanford is to achieve this integration while maintaining a simple and efficient design. This paper presents the hardware and software mechanisms in FLASH to support various message passing protocols. We achieve low…
Abstract: The advantages of using message passing over shared memory for certain types of communication and synchronization have provided an incentive to integrate both models within a single architecture. A key goal of the FLASH (FLexible Architecture for SHared memory) project at Stanford is to achieve this integration while maintaining a simple and efficient design. This paper presents the hardware and software mechanisms in FLASH to support various message passing protocols. We achieve low overhead message passing by delegating protocol functionality to the programmable node controllers in FLASH and by providing direct user-level access to this messaging subsystem. In contrast to most earlier work, we provide an integrated solution that handles the interaction of the messaging protocols with virtual memory, protected multiprogramming, and cache coherence. Detailed simulation studies indicate that this system can sustain message-transfers rates of several hundred megabytes per second, effectively utilizing projected network bandwidths for next generation multiprocessors.
Other authorsSee publication -
The Performance Impact of Flexibility in the Stanford FLASH Multiprocessor
Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI)
Abstract: A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In the Stanford FLASH multiprocessor, flexibility is obtained by requiring all transactions in a node to pass through a programmable node controller, called MAGIC. In this paper, we evaluate the performance costs of flexibility by comparing the…
Abstract: A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In the Stanford FLASH multiprocessor, flexibility is obtained by requiring all transactions in a node to pass through a programmable node controller, called MAGIC. In this paper, we evaluate the performance costs of flexibility by comparing the performance of FLASH to that of an idealized hardwired machine on representative parallel applications and a multiprogramming workload. To measure the performance of FLASH, we use a detailed simulator of the FLASH and MAGIC designs, together with the code sequences that implement the cache-coherence protocol. We find that for a range of optimized parallel applications the performance differences between the idealized machine and FLASH are small For these programs, either the miss rates are small or the latency of the programmable protocol can be hidden behind the memory access time, For applications that incur a large number of remote misses or exhibit substantial hot-spotting, performance is poor for both machines, though the increased remote access latencies or the occupancy of MAGIC lead to lower performance for the flexible design, In most cases, however, FLASH is only 2%-12% slower than the idealized machine
Other authorsSee publication -
Mable: A Technique for Efficient Machine Simulation
Stanford University Computer Systems Laboratory
Abstract: We present a framework for an efficient instruction-level machine simulator which can
be used with existing software tools to develop and analyze programs for a proposed
processor architecture. The simulator exploits similarities between the instruction sets
of the emulated machine and the host machine to provide fast simulation. Furthermore,
existing program development tools on the host machine such as debuggers and
profilers can be used without modification on the…Abstract: We present a framework for an efficient instruction-level machine simulator which can
be used with existing software tools to develop and analyze programs for a proposed
processor architecture. The simulator exploits similarities between the instruction sets
of the emulated machine and the host machine to provide fast simulation. Furthermore,
existing program development tools on the host machine such as debuggers and
profilers can be used without modification on the emulated program running under the
simulator. The simulator can therefore be used to debug and tune application code for
the new processor without building a whole new set of program development tools.
The technique has applicability to a diverse set of simulation problems. We show how
the framework has been used to build simulators for a shared-memory multiprocessor,
a superscalar processor with support for speculative execution, and a dual-issue
embedded processor.Other authors -
The Stanford FLASH Multiprocessor
Proceedings of the 21st International Symposium on Computer Architecture (ISCA21)
Abstract:
The FLASH multiprocessor efficiently integrates support for cache-coherent shared memory and high-performance message passing, while minimizing both hardware and software overhead. Each node in FLASH contains a microprocessor, a portion of the machine’s global memory, a port to the interconnection network an I/O interface, and a custom node controller called MAGIC. The MAGIC chip handles all communication both within the node and among nodes, using hsrdwired data paths for…Abstract:
The FLASH multiprocessor efficiently integrates support for cache-coherent shared memory and high-performance message passing, while minimizing both hardware and software overhead. Each node in FLASH contains a microprocessor, a portion of the machine’s global memory, a port to the interconnection network an I/O interface, and a custom node controller called MAGIC. The MAGIC chip handles all communication both within the node and among nodes, using hsrdwired data paths for efficient data movement and a programmable processor optimized for executing protocol operations. the use of the protocol processor makes FLASH very flexible — it can support a variety of different communication mechanisms — and simplifies the design and implementation.
This paper presents the architecture of FLASH and MAGIC, and discusses the base cache-coherence and message-passing protocols. Latency and occupancy numbers, which are derived from our system-level simulator and our Verilog code, are given for several common protocol operations. The paper also describes our software strategy and FLASH’s current status.
Other authorsSee publication -
Instruction Level Profiling and Evaluation of the IBM RS/6000
ISCA'91: Proceedings of the 18th Annual International Symposium on Computer Architecture
Abstract: This paper reports preliminary results from using goblin, a new instruction level profiling system, to evaluate the IBM RISC System/6000 architecture. The evaluation presented is based on the SPEC benchmark suite. Each SPEC program (except gee) is processed by goblin to produce an instrumented version. During execution of the instrumented program, profiling routines are invoked which trace the execution of the program. These routines also collect statistics on dynamic instruction mix,…
Abstract: This paper reports preliminary results from using goblin, a new instruction level profiling system, to evaluate the IBM RISC System/6000 architecture. The evaluation presented is based on the SPEC benchmark suite. Each SPEC program (except gee) is processed by goblin to produce an instrumented version. During execution of the instrumented program, profiling routines are invoked which trace the execution of the program. These routines also collect statistics on dynamic instruction mix, branching behavior, and resource utilization. Based on these statistics, the actual performance and the architectural efficiency of the RS/6000 are evaluated. In order to provide a context for this evaluation, a comparison to the DECStation 3100 is also presented. The entire profiling and evaluation experiment on nine of the ten SPEC programs involves tracing and analyzing over 32 billion instructions on the RS/6000. The evaluation indicates that for the SPEC benchmark suite the architecture of the RS/6000 is well balanced and exhibits impressive performance, especially on the floating-point intensive applications.
Other authors
Honors & Awards
-
2022 Outstanding Volunteer Fundraiser
Association for Fundraising Professionals, Silicon Valley Chapter
Recognition for achievement in volunteer fundraising on behalf of The Tech Challenge, nominated by The Tech Interactive
Languages
-
French
Professional working proficiency
-
Spanish
Limited working proficiency
-
Japanese
Limited working proficiency
-
English
Native or bilingual proficiency
Recommendations received
16 people have recommended John
Join now to viewMore activity by John
-
Exciting to be on Sanjay Gangal's famous EDA Cafe podcast yet again, at least my second time! We were excited to share our perspective on SDV and…
Exciting to be on Sanjay Gangal's famous EDA Cafe podcast yet again, at least my second time! We were excited to share our perspective on SDV and…
Shared by John Heinlein, Ph.D.
-
"By leveraging software-defined vehicles, we're not just improving cars today; we're laying the foundation for tomorrow's innovations." John…
"By leveraging software-defined vehicles, we're not just improving cars today; we're laying the foundation for tomorrow's innovations." John…
Liked by John Heinlein, Ph.D.
-
Our latest episode of #TheGarage #podcast with guest Richard Nass from Open Systems Media about his perspective from meeting leaders across the…
Our latest episode of #TheGarage #podcast with guest Richard Nass from Open Systems Media about his perspective from meeting leaders across the…
Shared by John Heinlein, Ph.D.
-
So pleased to be able to support our important partner NXP at their summit today in Detroit!
So pleased to be able to support our important partner NXP at their summit today in Detroit!
Shared by John Heinlein, Ph.D.
-
NXP Tech Days Detroit is finally here! 🏎🤓🎉 We are looking forward to our journey on the Road to Innovation over the next two days. NXP Tech…
NXP Tech Days Detroit is finally here! 🏎🤓🎉 We are looking forward to our journey on the Road to Innovation over the next two days. NXP Tech…
Liked by John Heinlein, Ph.D.
-
Sonatus is proud to partner with NXP Semiconductors including through their CoreRide platform, which expanded this month to include new products for…
Sonatus is proud to partner with NXP Semiconductors including through their CoreRide platform, which expanded this month to include new products for…
Shared by John Heinlein, Ph.D.
-
This week I was a guest on the EE Journal Fish Fry podcast, sharing insights about Sonatus can deliver learning from software defined data centers to…
This week I was a guest on the EE Journal Fish Fry podcast, sharing insights about Sonatus can deliver learning from software defined data centers to…
Shared by John Heinlein, Ph.D.
-
Software defined vehicles take center stage in my EE Journal Fish Fry podcast this week! John Heinlein, Ph.D. (Sonatus) joins me to chat about the…
Software defined vehicles take center stage in my EE Journal Fish Fry podcast this week! John Heinlein, Ph.D. (Sonatus) joins me to chat about the…
Liked by John Heinlein, Ph.D.
-
I have been thoroughly enjoying the series of podcasts by John Heinlein, Ph.D. at Sonatus. This one with Rivian shares insight that aligns really…
I have been thoroughly enjoying the series of podcasts by John Heinlein, Ph.D. at Sonatus. This one with Rivian shares insight that aligns really…
Liked by John Heinlein, Ph.D.
-
Happy 1 year Anniversary Tsavorite Scalable Intelligence! Grateful for the exceptional amount of work accomplished by our team in the last 12 months…
Happy 1 year Anniversary Tsavorite Scalable Intelligence! Grateful for the exceptional amount of work accomplished by our team in the last 12 months…
Liked by John Heinlein, Ph.D.
Other similar profiles
Explore collaborative articles
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
Explore More