The data centre industry is moving so fast that traditional sources of expertise, like consultants, are in some ways lagging behind, according to Vertiv senior director of sales LuLu Shiraz. In her keynote at the Sydney Cloud & Datacenter Convention 2024, she pointed out that today, even adding one GPU-based server is enough to completely change infrastructure of design requirements in a data centre.
“Design and deployment are no longer straightforward,” she said. “They are intensive and progressive. In the past, IT and facilities team operated separately. IT handled the technology and the facilities managed the power and cooling. That’s not going to work anymore.”
“Now with high-performance computing (HPC), power, cooling and IT are linked with this web of very critical systems, which must include the chip and the GPU,” she said. “The integration of components like cold plates, manifolds, chillers and coolant distribution units (CDUs) highlights the gap between IT and facilities management. Updating infrastructure requires seamless coordination and expertise to ensure everything works together efficiently.”
She highlighted the growing sensitivity of HPC workloads to even minor disruptions in cooling. Shiraz explained that modern HPC servers can no longer tolerate more than one second of cooling system failure without risking shutdowns, a stark contrast to the previous 60-second window.
“Waiting for a generator to start is no longer an option,” she said. This heightened sensitivity has made it critical to ensure robust and reliable uninterruptible power supplies (UPS) for CDUs and other infrastructure components.
The arrival of GPUs
She said there was an increasingly complex interplay between power, cooling and IT, particularly as GPUs become the driving force behind accelerated computing. Shiraz highlighted the immense power demands associated with GPUs, which she referred to as “the heartbeat of accelerated computing.”
Nvidia’s Blackwell B200 chips, for example, can draw up to 1 kilowatt of power per GPU, marking a significant increase from previous generations like the H100, which consumed 400 watts per GPU.
“This is not stopping there,” she cautioned, “as the Blackwell GB200 super chip will push these limits even further, potentially doubling the load compared to typical CPUs.”
Shiraz pointed out the escalating demands HPC places on infrastructure. “Racks that once managed moderate loads are now pushing up to 50 kilowatts, and data centres are evolving with configurations reaching 100 kilowatts,” she said.
The next wave of HPC infrastructure will need to cope with densities as high as 130-140 kilowatts, and in some cases even 220 kilowatts. “The game has changed,” she added, emphasising that traditional forecasts for power and cooling have become obsolete due to these increased demands.
She also highlighted the dynamic nature of AI and HPC workloads. “HPC workloads have dynamic load profiles that differ from static ones. AI applications can start at a 20% load and spike to 80% or even higher in a matter of seconds,” she explained. These rapid and unpredictable fluctuations can put extreme pressure on power systems, often leading to instability in generators and straining battery reserves.
Liquid integration
Liquid cooling is now a critical component for managing the thermal challenges posed by AI and HPC. Shiraz noted that liquid-fed cold plates, which directly cool CPUs and GPUs by capturing 70-80% of the heat, are becoming the standard in high-density environments. “However, air cooling isn’t going away,” she said, pointing out that a hybrid approach, combining both liquid and air cooling, is often the best solution. “Between 15 and 20 kilowatts per rack is the tipping point where a hybrid system becomes necessary,” she said.
Shiraz was quick to acknowledge that, despite the rise of liquid cooling, other technologies like immersion cooling are still on the horizon. “Right now, liquid, direct-to-chip cooling is the solution being used in collaboration with Nvidia, AMD and Intel. That’s not to say immersion cooling is out,
but we’re seeing liquid cooling take centre stage at this time,” she said.
Nvidia’s approach
Referring back to the theme of integrated approaches to data centre design, earlier this year, Nvidia collaborated with Vertiv and several industry leaders from universities, startups, and other vendors to design a cooling system that helped Nvidia and its partners secure a US$5 million grant from the United States Department of Energy’s COOLERCHIPS program. The application will bring together direct-chip liquid cooling and immersion cooling.
First, chips will be cooled with cold plates whose coolant evaporates “like sweat on the foreheads of hard-working processors”, then cools to condense and re-form as liquid. Second, entire servers, with their lower power components, will be encased in hermetically sealed containers and immersed in coolant. They will use a liquid common in refrigerators and car air conditioners, but not yet used in data centres.
Nvidia and its partners believe this approach has the capacity to cool a containerised data with racks 25 times as dense as today’s server racks and operating in an environment with outside temperatures up to 40°C – running up to 20% more efficiently. This is crucial given that data centres are expected to consume as much as eight percent of the world’s electricity by 2030 if no action is taken, according to a Buffalo University study.
In summary, Shiraz concluded that, for those managing or designing data centres, it is no longer enough to focus on isolated components like power or cooling. “The future demands a holistic, forward-thinking approach to ensure data centres can keep pace with the ever-increasing demands of AI and HPC,” she said.