GPUs, wake up!

2024-09-24

Because of the popularity of GenAI, NVIDIA has achieved the miracle of GPU.

According to a study by semiconductor analyst firm TechInsights, Nvidia's data center GPU shipments in 2023 have exploded, totaling about 3.76 million units. The study shows that Nvidia's GPU shipments in 2023 have increased by more than 1 million units compared to 2022, when Nvidia's data center GPU shipments totaled 2.64 million units.

According to the quarterly GPU shipment report released by Jon Peddie Research in September, quarterly GPU shipments increased by 1.8% from the first quarter of 2024 to the second quarter of 2024. This also marks a 16% year-on-year increase in overall shipments.

However, all signs indicate that the miracle of GPU is coming to an end.

01 The basic principle of GPU

Architecturally, a single GPU consists of multiple processor clusters (PCs), each of which contains multiple streaming multiprocessors (SMs). Each SM contains a 1-level instruction cache (L1) that interacts closely with its core. Typically, the SM will utilize its 1-level cache (L1) and share a 2-level cache (L2) before accessing data from high-bandwidth dynamic random access memory (DRAM). The architecture of the GPU is designed to handle memory latency and focus more on computation, making it less affected by the time it takes to retrieve data from memory. As long as the GPU has enough computing power to keep it busy, any potential memory access latency will be effectively masked. The SM is the workhorse of the GPU, responsible for executing parallel tasks, managing memory access, and performing a variety of calculations. These calculations range from basic arithmetic and logical operations to complex matrix operations and specialized graphics or scientific calculations. These are all optimized for parallel execution to maximize the efficiency and performance of the GPU.

FMA (Fused Multiply-Add) is the most common operation in modern neural networks and is the building block of fully connected layers and convolutional layers, both of which can be viewed as a collection of vector dot products. This operation combines multiplication and addition into a single step, providing computational efficiency and numerical accuracy.

Here, a and b are multiplied and the product is added to d to get c. Multiply-add operations are used a lot in matrix multiplication. In matrix multiplication, each element of the result matrix is the sum of multiple multiply-add operations.

Consider two matrices A and B, where A is of size m×n and B is of size n×p. The result C will be a matrix of size m×p, where each element cij is calculated as follows:

Each element of the resulting matrix C is the sum of the products of the corresponding elements in a row of A and a column of B. Since each computation is independent, they can be performed in parallel:

Concurrent matrix multiplication is challenging. Achieving efficient matrix multiplication depends heavily on the specific hardware used and the size of the problem being solved. Matrix multiplication involves a large number of independent element-wise operations. GPUs are designed to handle such parallel workloads efficiently, with thousands of cores performing these operations simultaneously.

GPUs are often viewed as SIMD (Single Instruction Multiple Data) parallel processing units that can execute the same instructions on large amounts of data simultaneously. Due to the parallel SIMD nature of GPUs, matrix multiplication speeds can be significantly increased, and this acceleration is critical for applications that require real-time or near-real-time processing.

02 From 3D Rendering to HPC

Because of these characteristics, GPUs were originally created to power 3D graphics rendering. Over time, they became more versatile and programmable. They revolutionized gaming by adding capabilities for better visuals and realistic scenes with advanced lighting and shading.

Let’s start with a simple processor task — displaying an image on the screen (shown below).

Although it seems simple, this task involves several steps: geometry transformation, rasterization, fragment processing, frame buffer manipulation, and output merging. These outline the process of the GPU pipeline for rendering 3D graphics.

In the GPU pipeline, the image is converted to a polygonal mesh representation, as shown below:

A single teapot image is converted into a mesh structure consisting of hundreds of triangles, each processed individually in the same way.

What does the GPU offer that the CPU cannot in handling this "simple" task? A high-end server CPU can have up to 128 cores, so the CPU can process 128 triangles in the teapot simultaneously. What the user sees is a partially rendered teapot that slowly completes as the CPU cores complete and select new triangles to render.

As you can see from this example, the GPU performs vector-based math calculations and matrix multiplications to render the image. Rendering a simple teapot requires about 192 bytes, while a complex GTA scene with 100 objects requires about 10KB.

But it didn't stop there.

Because the built-in parallelism and high throughput of GPUs accelerated computing, researchers used GPUs for tasks such as protein folding simulations and physics calculations. These early results showed that GPUs could accelerate computationally intensive tasks beyond graphics rendering, such as matrix and vector operations used in neural networks. Although neural networks can be implemented without GPUs, their functionality is limited by the available computing power. The advent of GPUs provided the necessary resources for effectively training deep and complex neural networks, driving the rapid development and widespread adoption of deep learning technology.

In order to enable GPUs to efficiently handle a variety of tasks, Nvidia developed different types of GPU cores, specialized for various functions:

CUDA Cores: For general-purpose parallel processing, including rendering graphics, scientific computing, and basic machine learning tasks.
Tensor Cores: Designed for deep learning and AI, they accelerate tensor operations such as matrix multiplication, which are critical for training and inference of neural networks.
RT Cores: Focused on real-time ray tracing, providing realistic lighting, shadows, and reflections in graphics.

Among them, Tensor Cores are dedicated hardware units designed to accelerate tensor operations, which are a generalized form of matrix multiplication, especially in mixed-precision calculations common in AI. Compared to CPUs, GPUs are not only faster but also more energy efficient in matrix multiplication tasks. GPUs can perform more calculations per watt of power consumed. This efficiency is critical in data centers and cloud environments where energy consumption is a significant issue. By combining multiplication and addition into a single optimized operation, GPUs can provide significant performance and accuracy advantages.

We have now identified the following key characteristics of GPUs: massively parallel high throughput, specialized hardware, high memory bandwidth, energy efficiency, real-time processing, and acceleration. By leveraging these capabilities, especially matrix math, GPUs can deliver unparalleled performance and efficiency for HPC and AI tasks, making them the first choice for researchers, developers and organizations working on advanced technologies and complex computing challenges. It is used in applications such as molecular dynamics simulation, weather and climate modeling, seismic data processing, training deep neural networks, real-time object detection and natural language processing (NLP). This in turn has contributed to the prosperity of Nvidia, the largest player in GPUs.

But past signals indicate that the GPU myth may be over.

03 Don't just focus on the GPU

In a recent interview with the Wall Street Journal, AMD CEO Lisa Su said that as the industry focuses on more standardized model design, there will be opportunities to build more customized chips that are less demanding in terms of programmability and flexibility. Such chips will be more energy-efficient, smaller, and less expensive.

"Currently, GPUs are the preferred architecture for large language models because they are very efficient in parallel processing but lack programmability," Su said. "Will it still be the preferred architecture in more than five years? I think things will change."

Su predicts that GPUs will not lose their dominance in five or seven years, but new forces other than GPUs will emerge.

The Wall Street Journal further pointed out that large cloud computing providers such as Amazon and Google have developed their own custom AI chips for internal use, such as Amazon's AWS Trainium and AWS Inferentia, and Google's tensor processing unit (TPU). These chips are only used to perform specific functions: for example, Trainium can only train models, while Inferentia can only perform inference. Inference is less intensive than training, and during training, the model must process new information and respond.

Broadcom CEO Hock Tan said in an internal speech this year that the company's custom chip division, which mainly helps Google make AI chips, has a quarterly operating profit of more than $1 billion.

Shane Rau, vice president of computing semiconductor research at market intelligence firm International Data Corp., said custom chips have great advantages in energy saving and cost, and are much smaller because they can be hard-wired to a certain extent: they can perform a specific function, run a specific type of model, or even run a specific model.

But Rau said the market for commercial sales of these super-customized dedicated chips is still immature, which is a manifestation of the overwhelming innovation of AI models.

Highly customized chips also have the problem of insufficient flexibility and interoperability, said Chirag Dekate, vice president analyst at research firm Gartner. Such chips are very difficult to program, usually requiring a custom software stack, and it is difficult to make them work with other types of chips.

But many chip offerings today fall somewhere in between, with some GPUs that can be more customized and some specialized chips that have some degree of programmability. That presents an opportunity for chipmakers, even before generative AI becomes more standardized. It can also be a challenge.

"That's a big problem we've been struggling with," said Gavin Uberti, co-founder and CEO of Etched. The startup's chips perform inference exclusively on the Transformer architecture, which Google developed in 2017 and has since become the standard for large language models. While customizable to a degree, the chips must also be flexible enough to accommodate smaller jobs that vary by model.

"Right now, the models are stable enough that I think betting on Transformer makes sense, but I don't think betting on Llama 3.1 405B makes sense at this point," said Uberti, referring to Meta Platforms' AI models. "Transformers will still be there, but they'll be bigger and more evolved." He added, "You have to be careful not to get too specialized."

There's no one-size-fits-all solution in computing, either, said AMD CEO Lisa Su. Future AI models will use a combination of different types of chips, including today's dominant GPUs and more specialized chips still to be developed, to perform various functions.

"There will be other architectures," she said. "It's just that it will depend on how the model evolves.”

As IEEE reports, it's clear that Nvidia doesn't lack for competitors. It's also clear that no competitor will be able to challenge Nvidia, let alone beat it, in the next few years. Everyone interviewed for this article agrees that Nvidia's current dominance is unparalleled, but that doesn’t mean it will always exclude competitors.

"Listen, the market needs choice," said analyst Moorhead. "If by 2026, I can't imagine AMD's market share being less than 10% or 20%, and the same goes for Intel. Generally, the market likes three companies, and we have three reasonable competitors." Kimball, another analyst, said that at the same time, hyperscale companies may challenge Nvidia as they move more AI services to internal hardware.

And then there are the uncertainties. Cerebras, SambaNova, and Groq are among the many startups that hope to eat into Nvidia's market share with novel solutions. In addition, there are dozens of other companies joining in, including d-Matrix, Untether, Tenstorrent, and Etched, all of which are pinning their hopes on new chip architectures optimized for generative AI.

Many of these startups may fail, but perhaps the next Nvidia will emerge from the survivors.

Reference link

https://www.hpcwire.com/2024/06/10/nvidia-shipped-3-76-million-data-center-gpus-in-2023-according-to-study/

View more at EASELINK

Previous: DeepMind’s latest achievements target quantum mechanics! Next: Is it feasible for Qualcomm to acquire Intel?

Back to list

HOT NEWS

Understanding the Importance of Signal Buffers in Electronics

Have you ever wondered how your electronic devices manage to transmit and receive signals with such precision? The secret lies in a small ...

2023-11-13

Turkish domestically produced microcontrollers about to be put into production

Turkey has become one of the most important non-EU technology and semiconductor producers and distributors in Europe. The European se...

2024-08-14

Basics of Power Supply Rejection Ratio (PSRR)

1 What is PSRRPSRR Power Supply Rejection Ratio, the English name is Power Supply Rejection Ratio, or PSRR for short, ...

2023-09-26

How to understand Linear Analog Multipliers and Dividers?

IntroductionLinear analog multipliers and dividers are an advanced-looking device at first glance, but they're actually crucial player...

2023-09-08

Another century of Japanese electronics giant comes to an end

"Toshiba, Toshiba, the Toshiba of the new era!" In the 1980s, this advertising slogan was once popular all over the country.S...

2023-10-13

ASIC chip (Application-Specific Integrated Circuit) is an integrated circuit designed and manufactured specifically to meet the need...

2023-10-05

Demystifying Data Acquisition ADCs/DACs: Special Purpose Applications

Introduction to Data Acquisition ADCs/DACsUnlocking the potential of data has become an integral part of our ever-evolving technol...

2023-10-12

GPUs, wake up!

01 The basic principle of GPU

02 From 3D Rendering to HPC

03 Don't just focus on the GPU

HOT NEWS

Understanding the Importance of Signal Buffers in Electronics

Have you ever wondered how your electronic devices manage to transmit and receive signals with such precision? The secret lies in a small ...

Turkish domestically produced microcontrollers about to be put into production

Turkey has become one of the most important non-EU technology and semiconductor producers and distributors in Europe. The European se...

Basics of Power Supply Rejection Ratio (PSRR)

1 What is PSRRPSRR Power Supply Rejection Ratio, the English name is Power Supply Rejection Ratio, or PSRR for short, ...

How to understand Linear Analog Multipliers and Dividers?

IntroductionLinear analog multipliers and dividers are an advanced-looking device at first glance, but they're actually crucial player...

Another century of Japanese electronics giant comes to an end

"Toshiba, Toshiba, the Toshiba of the new era!" In the 1980s, this advertising slogan was once popular all over the country.S...

Understanding the World of Encoders, Decoders, and Converters: A Comprehensive Guide

Encoders play a crucial role in the world of technology, enabling the conversion of analog signals into digital formats.

In 2023, ASIC chips aim at two major directions

ASIC chip (Application-Specific Integrated Circuit) is an integrated circuit designed and manufactured specifically to meet the need...

Demystifying Data Acquisition ADCs/DACs: Special Purpose Applications

Introduction to Data Acquisition ADCs/DACsUnlocking the potential of data has become an integral part of our ever-evolving technol...

Office

Quote

Business

Send request/ Leave your message

RECYCLE Electronic Components

GPUs, wake up!

01 The basic principle of GPU

02 From 3D Rendering to HPC

03 Don't just focus on the GPU

HOT NEWS

Understanding the Importance of Signal Buffers in Electronics

Have you ever wondered how your electronic devices manage to transmit and receive signals with such precision? The secret lies in a small ...

Turkish domestically produced microcontrollers about to be put into production

Turkey has become one of the most important non-EU technology and semiconductor producers and distributors in Europe. The European se...

Basics of Power Supply Rejection Ratio (PSRR)

1 What is PSRRPSRR Power Supply Rejection Ratio, the English name is Power Supply Rejection Ratio, or PSRR for short, ...

How to understand Linear Analog Multipliers and Dividers?

IntroductionLinear analog multipliers and dividers are an advanced-looking device at first glance, but they&#39;re actually crucial player...

Another century of Japanese electronics giant comes to an end

&quot;Toshiba, Toshiba, the Toshiba of the new era!&quot; In the 1980s, this advertising slogan was once popular all over the country.S...

Understanding the World of Encoders, Decoders, and Converters: A Comprehensive Guide

Encoders play a crucial role in the world of technology, enabling the conversion of analog signals into digital formats.

In 2023, ASIC chips aim at two major directions

ASIC chip (Application-Specific Integrated Circuit) is an integrated circuit designed and manufactured specifically to meet the need...

Demystifying Data Acquisition ADCs/DACs: Special Purpose Applications

Introduction to Data Acquisition ADCs/DACsUnlocking the potential of data has become an integral part of our ever-evolving technol...

Office

Quote

Business

Send request/ Leave your message

RECYCLE Electronic Components

IntroductionLinear analog multipliers and dividers are an advanced-looking device at first glance, but they're actually crucial player...

"Toshiba, Toshiba, the Toshiba of the new era!" In the 1980s, this advertising slogan was once popular all over the country.S...