Covering Scientific & Technical AI | Saturday, November 2, 2024

Nvidia’s Blackwell Platform Powers AI Progress in Open Compute Project 

Nvidia announced it has contributed foundational elements of its Blackwell accelerated computing platform design to the Open Compute Project (OCP).

Shared at the OCP Global Summit in San Jose today, Nvidia said that key portions of the design of its full rack Blackwell system, called the GB200 NVL72, will be contributed to the OCP community. This design information includes the GB200 NVL72 system’s rack architecture, compute and switch tray mechanicals, liquid cooling and thermal environment specs, and the NVLink cable cartridge volumetrics.

Designed to train up to 27 trillion-parameter LLMs, Nvidia’s GB200 NVL72 rack consists of 36 GB200 Grace Blackwell superchips interconnected with 36 Grace CPUs and 72 Blackwell GPUs. Nvidia says it delivers 720 petaflops of training performance and 1.4 exaflops of inferencing performance. The system is liquid cooled and its NVLink interconnect technology, with a bandwidth of 1.8TB/s, allows it to act as a single massive GPU.

The namesake of the Blackwell rack is the Blackwell GPU, Nvidia’s newest chip containing 208 billion transistors made with TSMC’s 4nm process. A single Blackwell GPU can train a 1-trillion-parameter model, according to Nvidia, and is up to 30x faster than the preceding Hopper GPUs, or H100. The chips require less energy than the H100, Nvidia claims, noting that training a 1.8 trillion parameter model once would have taken 8,000 Hopper GPUs and 15 megawatts of power, whereas now it would take only 2,000 Blackwell GPUs at a power consumption of 4 megawatts.

The GB200 NVL72 rack. Source: Nvidia

Other notable Nvidia contributions to OCP include the NVIDIA HGX H100 baseboard, which has become the de facto baseboard standard for AI servers, and the NVIDIA ConnectX-7 adapter, which now serves as the foundation design of the OCP Network Interface Card (NIC) 3.0. The company also announced it would broaden NVIDIA Spectrum-X support for OCP standards.

“Building on a decade of collaboration with OCP, Nvidia is working alongside industry leaders to shape specifications and designs that can be widely adopted across the entire data center,” said Nvidia founder and CEO Jensen Huang. “By advancing open standards, we’re helping organizations worldwide take advantage of the full potential of accelerated computing and create the AI factories of the future.”

Nvidia's contribution to OCP represents an important trend in advancing open hardware for AI and HPC. By sharing elements of its Blackwell platform, Nvidia is enabling broader access to its technology, which could enhance interoperability with other open systems. This contribution may also help improve data center efficiency by making its energy-efficient, AI-optimized architecture available in open designs. The contribution also supports the ongoing growth of the AI and HPC ecosystem, offering developers and organizations more options to leverage advanced computing technologies for AI applications in scientific research and large-scale computing.

A compute tray in the GB200 NVL72 rack. Source: Nvidia

As AI models grow in size and complexity, particularly with the advent of multi-trillion parameter models, the need for more powerful and scalable computing infrastructure becomes critical. Democratizing access to AI is not just about making software more available but also about ensuring that the hardware required to train and deploy these models is within reach for a wider range of organizations. Scientific AI, in particular, demands robust computing infrastructure capable of handling vast datasets and sophisticated models, pushing the limits of traditional architectures. Contributions like Nvidia’s to OCP help address this gap by fostering open, scalable solutions that make advanced hardware more accessible to enable more institutions to participate in AI for scientific discovery.

At the OCP Global Summit, Nvidia also announced that Blackwell is now in full production. A recent example of the promising use of the GB200 NVL72 platform is in Taiwan, where Nvidia and Taiwanese electronics manufacturer Foxconn are building what they call the island’s largest supercomputing project, the Hon Hai Kaohsiung Super Computing Center. The project will be built around the Blackwell architecture and will include 64 GB200 NVL72 racks and 4, 608 Tensor Core GPUs.

Foxconn plans to use the supercomputer, housed in Kaohsiung, Taiwan, to power breakthroughs in cancer research, LLM development, and smart city innovations, according to Nvidia, with full deployment expected by 2026.

To learn more about NVIDIA’s GB200 NVL72 OCP contribution check out the OCP specification documentation here.

AIwire