Home News Nvidia’s Invisible Moat - What's this?

Nvidia’s Invisible Moat - What's this?

2024-06-05

Nvidia's Invisible Moat - What's this?

GPU? AI? Hardware/Software?

Today, it seems difficult to summarize this trillion-dollar chip leader with one or a few words. However, Nvidia's moat has not changed for many years, and there are three main ones. Two of them are familiar to everyone: GPU chip hardware and CUDA programming software. Hardware is responsible for stacking computing power, and software is responsible for building an ecosystem. This combination of software and hardware has allowed Nvidia to "seek defeat alone" in the tide of artificial intelligence, and of course, it has also made the company's stock price market value "seek defeat alone" among all chip companies.

However, Nvidia has another invisible moat that may not be known to the public: the network.

01 Demand: Both

The origin of the matter is still due to the explosion of artificial intelligence and large models. The reason why the large model is "large" is that the model has a large number of parameters, a large amount of data used for training, and a large scale of the training system. It has also become a computing power beast, and the cost of training a large model is astronomical.

Not long ago, Professor Fei-Fei Li's team at Stanford University released an annual report, and the data showed that the training cost of GPT-4 exceeded US$78 million, and the training cost of Google Gemini Ultra exceeded an astonishing US$191 million.

You should know that most of these astronomical costs are spent on GPUs.

There are also data showing that Meta consumed 1 million GPU hours to train the 65 billion parameter Llama model, and Google consumed 2.56 times 10 to the 24th power FLOPS of computing power to train the 540 billion parameter PaLM model.

Therefore, no matter how powerful a single chip is, it can never meet the needs of large model training. Therefore, the interconnection between multiple chips has become a key technology in the era of large models. Some big guys even said that even if a GPU with slightly lower computing power is used, as long as the multi-chip interconnection is done well, the overall computing power will not drop. This is because the overall bottleneck of the system has changed from data calculation to data transfer.

Don't forget that the network is Nvidia's "invisible" moat.

In order to reduce costs and allow more people to participate in the alchemy of large models or deploy trained large models to practical applications, various technology giants have also come up with many tricks. For example, people have come up with the concept of an AI data center. Compared with traditional data centers, AI data centers are designed around AI needs from the beginning, and as the name suggests, they are dedicated to AI services.

But if we look closely at this AI data center, it is actually divided into two main scenarios. One is the "AI factory" mentioned many times by Lao Huang. You can imagine it as a real factory with thousands of the strongest GPU "workers", and the products produced here are trained large models. Generally speaking, this AI factory model is for ultra-large-scale heavy loads. The advantage of this model is that it can reduce the tedious work of technology companies to build their own AI infrastructure from scratch, and use factories to outsource AI training.

In AI factories, the ultimate goal is ultra-high performance, so NVIDIA uses ultra-high-speed and ultra-low-latency network technologies such as NVLink and InfiniBand to interconnect GPUs. Generally speaking, the most advanced NVLink can connect 8 to 500 GPU cards, and it is a natural lossless network, so it can achieve the performance ceiling. However, the cost of these customized networks is too high, so they are destined not to be suitable for everyone.

So there is a second AI data center scenario, called AI cloud. Similar to the cloud computing we are familiar with, AI cloud is essentially the cloudification of AI infrastructure and computing power, allowing more people to use AI resources in the cloud at a lower cost. Unlike AI factories, AI cloud is more oriented to less heavy loads, such as model fine-tuning, training of small and medium models, and various inference scenarios.

Because of this, performance may not be the most important factor here, but cost is.

Of course, it would be better if you can have both.

In traditional cloud computing, thousands of computers are interconnected through Ethernet. In fact, after the Ethernet technology was invented in the 1970s, it quickly became a cornerstone technology in key areas such as data centers, cloud computing, network communications, and industrial control. For AI cloud, the new self-reliance faces two major challenges: technology and ecology, so the wisest choice is still to be compatible with the existing Ethernet-based cloud network architecture.

However, in the AI era, the biggest problem with traditional Ethernet is performance. If you want both the ecology and flexibility of Ethernet and high performance, you must make technological changes to Ethernet.

And this is exactly the essential logic of the emergence of NVIDIA's Spectrum-X network platform.

02 Solution: Full stack

The question is, what is the difference between Ethernet for AI computing and traditional Ethernet?

Let's start with the conclusion. Ethernet technology for AI computing requires new features such as high performance, high stability, low jitter, predictable performance, and the ability to efficiently cope with burst traffic in AI business. Let's introduce it in detail.

As mentioned earlier, when the scale of large models gradually explodes, the bottleneck of system performance has shifted from the computing power of a single GPU card to the bandwidth and performance of network interconnection and communication between multiple cards. When the number of GPUs expands to tens of thousands, even a single data center cannot accommodate them, and data centers in different regions need to work together, which puts higher requirements on network performance. In short, performance is an important requirement that must be guaranteed.

In addition, from the perspective of programming and usability, it is unrealistic for programmers to program these tens of thousands of GPU cards separately. These computing resources must be integrated through software to hide the underlying hardware implementation details, so that developers look like they are programming a GPU. This is also the concept of "data center as computer" mentioned by NVIDIA.

This concept is a bit like virtualization in traditional cloud computing, but in traditional cloud computing, different users or businesses are relatively loose and independent. Moreover, different tasks are not necessarily very sensitive to network jitter and stability. The most that may happen is that the buffering time is a little longer when watching TV series, and retransmission is OK.

In contrast, the requirements for stability in AI cloud are completely higher. Since N GPUs are required to run a single AI load synchronously, once packet loss or jitter occurs, it may cause "alchemy" to fail or become a performance bottleneck of the system. In addition, burst traffic often occurs during AI training. For example, after the GPU calculation is completed, the gradient value of the model will be instantly synchronized between GPUs through the network, resulting in sudden traffic peaks. This requires the network to have the ability to handle burst traffic and predict performance.

To solve these problems, traditional Ethernet is definitely not enough. So NVIDIA launched a new Ethernet technology called Spectrum-X. Its core is still based on the Ethernet protocol, but it has been optimized for AI computing characteristics.

First of all, it is worth mentioning that Spectrum-X is not a single technology, but a system-level network architecture composed of multiple software and hardware technologies. At the hardware level, it includes a 400G Ethernet switch called Spectrum-4, which integrates 100 billion transistors, has a total switching bandwidth capacity of 51.2Tb/s, supports 128 400G ports or 64 800G ports, and is the core of the entire Spectrum-X network platform.

03 Example: From Impossible to Possible

With the support of underlying technology, the key functions of AI cloud network can be built. Taking performance as an example, in Spectrum-X, multiple tasks can be run in parallel and performance isolated. That is to say, even if multiple different task loads are running, each task can achieve the performance of bare metal. The essence of this function is a more efficient congestion control algorithm, that is, a single task will not occupy all network bandwidth, resulting in a situation where three monks have no water to drink. Technically, if a large task is not sent smoothly, it will block the entire network, causing the performance of other tasks in the network to degrade. Through end-to-end collaboration between SuperNIC and switches, hardware-based enhanced congestion control and priority-based flow control are realized, ensuring that there will be no packet loss or jitter on the lossless Ethernet network. This sounds like a simple technology, but it actually relies on very rigorous end-to-end cooperation between SuperNIC and switches to achieve, which is also the main reason why traditional network cards or traditional switches cannot achieve this function.

Another interesting example is digital twins. This is a concept that originated from the metaverse, which refers to a virtual representation of a physical entity, such as the digital twin of each of us. This concept actually has many benefits in AI data centers. For example, building a real AI cluster is a very complex task and requires a lot of investment. The traditional method is to build it first, then debug and optimize it. But once a problem is found, the cost of adjustment and modification is also huge.

Therefore, digital twin technology can be used to build a digital AI cluster first, and then complete the simulation verification, debugging, optimization and other work mentioned above on the virtual cluster, thereby accelerating the deployment and launch of the physical cluster and greatly reducing costs.

In order to build a digital twin AI cluster, software is definitely the key. NVIDIA has launched the NVIDIA AIR platform, which can simulate the key network software, operating system and NetQ network management software of the data center for free. At present, the virtual realization of the complete switching network of the entire data center has been realized, and it is likely to add support for BlueField SuperNIC on the host side in the future.

04 Revelation: The logic behind the trend

We have talked a lot about the network transformation of AI data centers, and also introduced in depth how the network has become NVIDIA's invisible moat. NVIDIA's layout of AI networks can actually give us a lot of inspiration.

For example, we have to admit Huang's technical foresight. Of course, this is not necessarily Huang himself, but the collective wisdom of many green factory bosses standing behind Huang. But the reason why NVIDIA can seize so many opportunities is inseparable from the layout and deep cultivation of technology. When NVIDIA began to focus on BlueField DPU, the wave of AI and large models had not actually arrived. Who would have thought that this DPU technology, which was originally used in traditional cloud computing data centers, would become an indispensable key to AI networks.

In addition, the era of one trick to eat all the world is over. As strong as NVIDIA is, it has also laid out multiple directions in the field of AI networks, such as NVLink, InfiniBand, Spectrum-X, etc., which, in Internet jargon, has formed a set of "combination punches".

In addition, NVIDIA knows that the key to solving problems is to find key problems. In the white paper "Network Technology in the AI Era" released by Green Factory itself, they summarized the differences between traditional Ethernet and AI Ethernet, and also sorted out the differences between CPU-centric networks and GPU-centric networks.

View more at EASELINK

Previous: Demystifying Computer Memory Chips A Comprehensive Guide Next: MLCC - What are the trends of mainstream manufacturers?

Back to list

HOT NEWS

Understanding the Importance of Signal Buffers in Electronics

Have you ever wondered how your electronic devices manage to transmit and receive signals with such precision? The secret lies in a small ...

2023-11-13

Turkish domestically produced microcontrollers about to be put into production

Turkey has become one of the most important non-EU technology and semiconductor producers and distributors in Europe. The European se...

2024-08-14

1 What is PSRRPSRR Power Supply Rejection Ratio, the English name is Power Supply Rejection Ratio, or PSRR for short, ...

2023-09-26

Amazon halts development of a chip

Amazon has stopped developing its Inferentia AI chip and is instead focusing on semiconductors for training AI models, an area the com...

2024-12-10

Survival Guide – AI Chip Unicorn’s

Recently, the world's "AI chip unicorns" have successively announced new developments in their companies and products. Gro...

2024-04-26

Another century of Japanese electronics giant comes to an end

"Toshiba, Toshiba, the Toshiba of the new era!" In the 1980s, this advertising slogan was once popular all over the country.S...

2023-10-13

Understanding the World of Encoders, Decoders, and Converters: A Comprehensive Guide

Encoders play a crucial role in the world of technology, enabling the conversion of analog signals into digital formats.

2023-10-20

Nvidia’s Invisible Moat - What's this?

01 Demand: Both

02 Solution: Full stack

03 Example: From Impossible to Possible

04 Revelation: The logic behind the trend

HOT NEWS

Understanding the Importance of Signal Buffers in Electronics

Have you ever wondered how your electronic devices manage to transmit and receive signals with such precision? The secret lies in a small ...

Turkish domestically produced microcontrollers about to be put into production

Turkey has become one of the most important non-EU technology and semiconductor producers and distributors in Europe. The European se...

UFS 4.1 standard is commercially available, and industry giants respond positively

The formulation of the UFS 4.1 standard may accelerate the implementation of large-capacity storage such as QLC

Basics of Power Supply Rejection Ratio (PSRR)

1 What is PSRRPSRR Power Supply Rejection Ratio, the English name is Power Supply Rejection Ratio, or PSRR for short, ...

Amazon halts development of a chip

Amazon has stopped developing its Inferentia AI chip and is instead focusing on semiconductors for training AI models, an area the com...

Survival Guide – AI Chip Unicorn’s

Recently, the world's "AI chip unicorns" have successively announced new developments in their companies and products. Gro...

Another century of Japanese electronics giant comes to an end

"Toshiba, Toshiba, the Toshiba of the new era!" In the 1980s, this advertising slogan was once popular all over the country.S...

Understanding the World of Encoders, Decoders, and Converters: A Comprehensive Guide

Encoders play a crucial role in the world of technology, enabling the conversion of analog signals into digital formats.

Office

Quote

Business

Send request/ Leave your message

RECYCLE Electronic Components

Nvidia’s Invisible Moat - What's this?

01 Demand: Both

02 Solution: Full stack

03 Example: From Impossible to Possible

04 Revelation: The logic behind the trend

HOT NEWS

Understanding the Importance of Signal Buffers in Electronics

Have you ever wondered how your electronic devices manage to transmit and receive signals with such precision? The secret lies in a small ...

Turkish domestically produced microcontrollers about to be put into production

Turkey has become one of the most important non-EU technology and semiconductor producers and distributors in Europe. The European se...

UFS 4.1 standard is commercially available, and industry giants respond positively

The formulation of the UFS 4.1 standard may accelerate the implementation of large-capacity storage such as QLC

Basics of Power Supply Rejection Ratio (PSRR)

1 What is PSRRPSRR Power Supply Rejection Ratio, the English name is Power Supply Rejection Ratio, or PSRR for short, ...

Amazon halts development of a chip

Amazon has stopped developing its Inferentia AI chip and is instead focusing on semiconductors for training AI models, an area the com...

Survival Guide – AI Chip Unicorn’s

Recently, the world&#39;s &quot;AI chip unicorns&quot; have successively announced new developments in their companies and products. Gro...

Another century of Japanese electronics giant comes to an end

&quot;Toshiba, Toshiba, the Toshiba of the new era!&quot; In the 1980s, this advertising slogan was once popular all over the country.S...

Understanding the World of Encoders, Decoders, and Converters: A Comprehensive Guide

Encoders play a crucial role in the world of technology, enabling the conversion of analog signals into digital formats.

Office

Quote

Business

Send request/ Leave your message

RECYCLE Electronic Components

Recently, the world's "AI chip unicorns" have successively announced new developments in their companies and products. Gro...

"Toshiba, Toshiba, the Toshiba of the new era!" In the 1980s, this advertising slogan was once popular all over the country.S...