Covering Scientific & Technical AI | Wednesday, December 11, 2024

DeltaAI Unveiled: How NCSA Is Meeting the Demand for Next-Gen AI Research 

The National Center for Supercomputing Applications (NCSA) at the University of Illinois Urbana-Champaign has just launched its highly anticipated DeltaAI system.

DeltaAI is an advanced AI computing and data resource that will be a companion system to NCSA’s Delta, a 338-node, HPE Cray-based supercomputer installed in 2021. The new DeltaAI has been funded by the National Science Foundation with nearly $30 million in awards and will be accessible to researchers across the country through the NSF ACCESS program and the National Artificial Intelligence Research Resource (NAIRR) pilot. 

The system will accelerate complex AI, machine learning, and HPC applications running terabytes of data by using advanced AI hardware, including the Nvidia H100 Hopper GPUs and GH200 Grace Hopper Superchips. 

This week, HPCwire caught up with NCSA Director Bill Gropp at SC24 in Atlanta to get the inside story on the new DeltaAI system which became fully operational last Friday.

From Delta to DeltaAI: Meeting the Growing Demand for GPUs

Gropp says DeltaAI was inspired by the increasing demand NCSA saw for GPUs while conceiving and deploying the original Delta system. 

“The name Delta comes from the fact that we saw these advances in the computing architecture, particularly in GPUs and other interfaces. And some of the community had been adopting these, but not all of the community, and we really feel that that’s an important direction for people to take,” Gropp told HPCwire. 

“So, we proposed Delta to NSF and got that funded. I think it was the first, essentially, almost-all-GPU resource since Keeneland, which was a long, long time ago, and we had expected it to be a mix of modeling simulation, like molecular dynamics, fluid flows, and AI. But as we deployed [Delta], AI just took off, and there was more and more demand.” 

The original Delta system with its Nvidia A100 GPUs and more modest amounts of GPU memory was state of the art for its time, Gropp says, but after the emergence and proliferation of large language models and other forms of generative AI, the game changed. 

“We looked at what people needed, and we realized that there was enormous demand for GPU resources for AI research and that more GPU memory is going to be needed for these larger models,” he said. 

Scaling GPU Power to Demystify AI

The original Delta system at NCSA, the companion system to the new DeltaAI. (Source: NCSA)

The new DeltaAI system will provide approximately twice the performance of the original Delta, offering petaflops of double-precision (FP64) performance for tasks requiring high numerical accuracy, such as fluid dynamics or climate modeling, and a staggering 633 petaflops of half-precision (FP16) performance, optimized for machine learning and AI workloads.  

This extraordinary compute capability is driven by 320 NVIDIA Grace Hopper GPUs, each equipped with 96GB of memory—resulting in a total of 384GB of GPU memory per node. The nodes are further supported by 14 PB of storage at up to 1TB/sec and are interconnected with a highly scalable fabric. 

Gropp says supplemental NSF funding for Delta and DeltaAI will allow them to deploy additional nodes with more than a terabyte of GPU memory per node which will support AI research, particularly studies dedicated to understanding training and inference with LLMs. Gropp hopes this aspect of DeltaAI’s research potential will be a boon for explainable AI, as these as these massive memory resources enable researchers to handle larger models, process more data simultaneously, and conduct deeper explorations into the mechanics of AI systems. 

“There’s a tremendous amount of research we have done in explainable AI, trustworthy AI, and understanding how inference works,” Gropp explains, emphasizing key questions driving this work: “Why do the models work this way? How can you improve their quality and reliability?” 

Understanding how AI models arrive at specific conclusions is crucial for identifying biases to ensure fairness and increase accuracy, especially in high-stakes applications like healthcare and finance. Explainable AI has emerged as a response to “black box” AI systems and models that are not easily understood or accessible and often lack transparency in how they process inputs to generate outputs. 

As AI adoption accelerates, the demand for explainability and accuracy grows in parallel, prompting questions like “How can you reduce what is essentially interpolation error in these models so that people can depend on what they’re getting out of it?” Gropp said. “Seeing that demand is why we proposed this. I think that’s why NSF funded it, and it’s why we’re so excited.” 

Democratizing AI … and HPC?

DeltaAI will be made available to researchers nationwide through the NSF ACCESS program and the National Artificial Intelligence Research Resource (NAIRR) pilot initiative. This broad accessibility is designed to foster collaboration and extend the reach of DeltaAI’s advanced compute capabilities. 

“We are really looking forward to seeing more and more users taking advantage of our state-of-the-art GPUs, as well as taking advantage of the kind of support that we can offer, and the ability to work with other groups and share our resources,” Gropp said. 

Gropp says the new system will serve a dual role in advancing both AI and more conventional computational science. While DeltaAI’s nodes are optimized for AI-specific workloads and tools, they are equally accessible to HPC users, as the system’s design makes it a versatile platform that serves both AI research and traditional HPC applications. 

HPC workloads like molecular dynamics, fluid mechanics, and structural mechanics, will benefit significantly from the system’s advanced architecture, particularly its multi-GPU nodes and unified memory. These features address common challenges in HPC, like memory bandwidth limitations, by offering tremendous bandwidth that enhances performance for computationally intensive tasks. 

Balancing AI Hype with Practical Scientific Progress

DeltaAI is integrated with the original Delta system on the same Slingshot network and shared file system, representing a forward-thinking approach to infrastructure design. This interconnected setup not only maximizes resource efficiency but also lays the groundwork for future scalability. 

Gropp says that plans are already in place to add new systems over the next year or two, reflecting a shift toward a continuous upgrade model rather than waiting for current hardware to reach obsolescence. While this approach may introduce challenges in managing a more heterogeneous system, the benefits of staying at the forefront of innovation far outweigh the complexities. 

This innovative approach to infrastructure design ensures that traditional computing workloads are maintained and seamlessly integrated alongside AI advancements, fostering a balanced and versatile research environment amid the AI-saturated landscape of modern computing that can lead to AI fatigue. 

“The hype surrounding AI can be exhausting,” Gropp notes. “We do have to be careful because there is tremendous value in what AI can do. But there are a lot of things that it can’t do, and I think it will never be able to do, at least with the technologies we have.” 

DeltaAI exemplifies NCSA’s commitment to advancing both the frontiers of scientific understanding and the practical application of AI and HPC technologies. Scientific applications such as turbulence modeling are benefiting from combining HPC and AI. 

“I think that’s an exciting example of what we really want to be able to do. Not only do we want to understand it and satisfy our curiosity about it, but we’d like to be able to take that knowledge and use that to just make life better for humanity. Being able to do that translation is important,” Gropp said.

AIwire