AI Exascale: In This Club, You Must ‘Earn the Exa’
There have been some recent press releases and headlines with the phrase “AI Exascale” in them. Other than flaunting the word exascale or even zettascale, these stories do not provide enough information to justify using the term exascale. Those in the HPC community understand words like exascale, which means the computer in question has achieved an exaFLOPS of sustained performance. For the newbies, the prefix exa is shorthand for 10^18, and FLOPS is Floating Point Operations Per Second (things like addition and multiplication). An exaFLOP is one quintillion floating point operations per second.
More specifically, per Wikipedia, “Exascale computing refers to computing systems capable of calculating at least 10^18 IEEE 754 Double Precision (64-bit) operations (multiplications and/or additions) per second (exaFLOPS).”
Measuring the FLOPS rate of a system requires running the open-source High-Performance LINPACK benchmark program. There are other programs to measure the FLOPS rate, but the HPLinpack has a historical record dating back to 1993. As a matter of fact, there is a list, The TOP500, updated twice a year, that reports the performance of this benchmark using double precision FLOPS. Why double precision? Well, that is what gives the best answers for many of the numerical problems that these huge systems solve.
Currently, there are two systems worldwide that are in the exascale club according to the TOP500 list:
1. Frontier from DOE/SC/Oak Ridge National Laboratory reaching 1.206 ExaFLOPS with a theoretical peak of 1.715 ExaFLOPS
2. Aurora from DOE/SC/Argonne National Laboratory reaching 1.012 ExaFFLOPS with a theoretical peak of 1.980 ExaFLOPS
A few things to note about these numbers. First, for each machine, theoretical peak performance is the sum of the maximum performance rate for each component of the system; that is if each component was running full-tilt with no regard for any underlying application (In the TOP500, this is reported as Rpeak). In practice, theoretical rates are never achieved because, in technical terms, “there is other stuff going on” in real applications.
Second, the maximally achieved performance (Rmax as reported by the TOP500 list) is measured using the HPLinpack benchmark. Other benchmarks or applications may squeeze more FLOPS out of a machine, but HPLinpack is used because it has a long historical record and can be used as a standard yardstick.
Finally, other large machines may have chosen not to run the benchmark or submitted their results to the TOP500. Other exascale class machines are under construction, so the club will expand.
In addition, the HPC community also recognizes the emerging convergence of high-performance computing (HPC) and artificial intelligence (AI) workloads. While traditional “TOP500 HPC” machines focused on computing for modeling phenomena in physics, chemistry, and biology, the mathematical models that drive these computations require, for the most part, 64-bit accuracy. On the other hand, the machine learning methods used in AI achieve desired results using 32-bit and even lower floating-point precision formats. There is a mix format benchmark, HPL-MxP, that is being used as a way to evaluate new mix-mode (HPC&AI) systems.
Running and managing these systems is not trivial. These systems are the pinnacle of high-performance computation. They are designed, built, and tested with the best technology available.
Nip This in the Bud
Given the current understanding and consensus on the meaning of exascale, one can certainly understand the surprise when recent announcements touted “exascale” and even “zettascale” (10^21 FLOPS) systems based on the Nvidia Blackwell GPU. Sure, the Blackwell GPU is a powerhouse of SIMD computation for both HPC and AI applications, but tagging it with unmeasured and contrived performance metrics is disingenuous, to say the least.
One must ask, how do these “snort your coffee” numbers arise from unbuilt systems? The process of beating the world’s fastest machine, cough, on paper, cough, is actually very simple; first, however, we need to take a short detour and talk about floating point numbers.
A Few Bits About Floating Point Format
Representing numbers in computers is a tricky task. Computers are finite and thus cannot represent all possible numbers. In scientific computing, applications are programmed using Floating Point or FP for short.
Two basic types of FP numbers are used in scientific and technical computing. These numbers are measured by the numbers of bits (ones and zeros) used to represent a number.
- 32-bit single precision types with a range of around -3.40282347E+38 to -1.17549435E-38, or from 1.17549435E-38 to 3.40282347E+38 and a precision of about seven decimal digits
- 64-bit double precision types with a range of around 1.797693134862315E+308 to -2.225073858507201E-308, or from 2.225073858507201E-308 to 1.797693134862315E+308 and a precision of about fifteen decimal digits.
Values too large or too small for these ranges will cause an error.
The representation of a 32 and 64-bit double-precision floating point number is shown in the figures below:
Almost all HPC calculations use FP64 (or a combination of FP32 and 64) because answers are more useful when they have more precision. Better precision makes for better results, but computational costs require more time because math with double precision requires that you twiddle two 64-bit numbers to get a third 64-bit number. There are tested and optimized libraries for CPUs that use single and double precision to do complex math. GPU vendors also provide single and double-precision math libraries. For HPC systems, top performance is always measured in double precision.
Enter GenAI. The point of GenAI and LLMs is to create (train) models and determine “weights” using massive amounts of data. Once a model is trained, these weights are used to steer the model when it is queried (inference). These weights are trained (calculated) with high precision and require large amounts of memory and computation to traverse. One trick used with LLMS is called quantization, where the precision of weights is reduced. In many cases, the model will behave the same with lower precision weights, thus reducing the computational requirements to run a model (The models that can be downloaded and run on your laptop from Hugging Face have been quantized.)
In the quantization game, less is often better. For this reason, there have been many new lower precision formats introduced in Generative AI (FP16, BFLOAT16, and FP8). The most recent and possibly the smallest is the FP4 format. That is correct, using 4-bits to represent a floating point number.
The format of these low-precision numbers is far from settled. A recent post on X/Twitter by user @[email protected] commented on yet another FP8 format “Unless I’ve missed one, *excluding* block floats, this puts us at 18 total FP8 formats“
Returning to the FP4 format. For those who did not pay attention in computer science class, 4-bits only provide sixteen possible numbers or levels of difference for a weight. FP4 is the smallest possible float size that follows all IEEE principles, including normalized numbers, subnormal numbers, signed zero, signed infinity, and multiple NaN values. It is a 4-bit float with 1-bit sign, 2-bit exponent, and 1-bit mantissa. All the possible numbers ranging from -3 to 3 are shown in the table below. The columns have different values for the sign and mantissa bits, and the rows have different values for the exponent bits.
Spec Sheet Supercomputing
FP4 is a good optimization for Generative AI and can speed up inference with quantized models. The Nvidia Blackwell Architecture Technical Brief lists the Tensor core FP4 rate at a theoretical 20 petaFLOPS for dense matrices. And here is the tip-off for the recent exascale and even setta scale announcements.
Recently, there was an announcement that a 90 exaFLOPS machine was being built using 4,608 Nvidia Blackwell GPUs. Simple math: 20 petaFLOPS x 4,608 GPU equals 82, 160 exaFLOPS. And BAM! We have an exaflop machine. The “AI exaFLOPS” moniker does not matter because no AI was run to obtain this number.
Similarly, a zettascale machine was announced that used 131,072 Nvidia Blackwell GPUS. Again, 20 petaFLOPS x 92,160 equals 1,843 exaFLOPS (or 1.8 zettaFLOPS) And BAM! we have a zettaFLOP. Again, calling it “AI zetaFLOPS” is silly because no AI was run on this unfinished machine.
More impressive than these Blackwell machines is the worldwide smartphone supercomputer. Assuming all phones are connected by a phone number over a worldwide network, their computation, in theory, could be combined. As of June 2024, there are approximately 7.2 billion (10^9) smartphones in the world, and the average cell phone processor can run at approximately ten teraFLOPS (10^12) of single precision. Using the “add the FLOPS” method creates a smartphone supercomputer that boasts 10^21 FLOPS or a zettaFLOP. BAM! zettaFLOPS for everybody.
Of course, some minor details need to be worked out before we land on the TOP500 list. By the way, if your phone starts getting hot for no reason, it might be an HPLinpack run or, more likely, some crypto mining running as part of the fun new app you just downloaded.
The Requisite Car Analogy
Every good argument needs a car analogy. In the case of FP4 computing, it goes something like this. The average double precision FP64 car weighs about 4,000 pounds (1814 Kilos). It is great at navigating terrain, holds four people comfortably, and gets 30 MPG. Now, consider the FP4 car, which has been stripped down to 250 pounds (113 Kilos) and gets an astounding 480 MPG.
Great news. You have the best gas mileage ever! Except, you don’t mention a few features of your fantastic FP4 car.
First, the car has been stripped down of everything except a small engine and maybe a seat. What’s more, the wheels are 16-sided (2^4) and provide a bumpy ride as compared to the smooth FP64 sedan ride with wheels that have somewhere around 2^64 sides. There may be places where your FP4 car works just fine, like cruising down Inference Lane, but it will not do well heading down the FP64 HPC highway. Different strokes for different folks.
Going Forward
The specs sheet exascale numbers are often reported as “AI ExaFLOPS,” which does not grant the use of exascale. To get into the exascale club, you need to supply the bouncers with the following information.
- The hardware and application that was used to measure the FLOPS rate
- The precision of the floating point used in the measurement (e.g., FP64, FP32, FP6, FP4)
It is good form to refer to non-computed numbers (spec sheet summations) as “theoretical peak” for a specific precision, but this will not get you in the club. Fuzzing things up with “AI FLOPS” will not help either. The Nvidia Blackwell is a blazingly fast GPU, and providing actual measured numbers with the simple details mentioned will easily get you into the club.