At the 2024 GTC conference, Huang Renxun had a B200 in his right hand and an H100 in his left hand. Of course, there were new people and old people forgetting: “We need a bigger GPU. If it can't be bigger, we will combine more GPUs together to become a larger virtual GPU. "
Nvidia announced the Blackwell architecture B200 GPU, and personally photographed the internet celebrity graphics card H100 on the beach.
According to Huang Renxun's introduction, the theoretical AI performance of B200 can reach 20PFLOPS, which is five times that of H100. Compared with the H100's 80 billion transistor scale, the B200's transistor scale is as high as 208 billion.
Generally speaking, the most common way to increase chip computing power is to use advanced manufacturing processes to stuff more transistors into the chip with higher density. As Moore's Law says:
The number of transistors that can be accommodated on an integrated circuit doubles approximately every 18 to 24 months.
For example, the A100 GPU using the 7nm process has a chip (Die) area of 826mm² and contains 54.2 billion transistors; the H100 using the 5nm (TSMC N4) process has a chip area reduced to 814mm², but the number of transistors has soared to 80 billion.
However, while the number of transistors has nearly tripled, the B200 does not use the more advanced 3nm process, but uses the same 5nm process as the H100. The "big" and "combination" that Jen-Hsun Huang talks about are literal:
From a technical perspective, the B200 actually "joins" two chips into one large chip.
In Nvidia's PPT demonstration, two GPUs were "seamlessly bonded" together from the edges, doubling the computing power while covering an area of X2.
The 1+1=2 method seems simple and crude, but behind it is a charge and adventure on the edge of physics.
There are two ways for factories to improve productivity: one is to expand the factory and install more production lines; the other is to upgrade the production lines and increase the number of production lines while the factory area remains unchanged.
Chip companies have been adopting the second method: through production line innovation (process), they can pack more transistors into the limited chip area to avoid the increase in rent costs caused by expanding factories.
However, the limitation of this approach is that the research and development costs corresponding to production line innovation (process) are getting higher and higher, and may even be higher than the rent. The 5nm process used by the H100 is likely to be the limit of GPU mass production. If it continues to go down to 3nm, it is likely to suffer in terms of cost.
Expanding factories is indeed one way, but when it comes to chip production, we will encounter a problem that the Chinese are very familiar with: limited land supply.
Each chip is "cut" from a 12-inch silicon wafer (land), so the larger the chip (factory) area, the fewer chips can be "cut" from each wafer.
Taking into account the yield and heat dissipation issues of large-area chips (construction accidents), the cost of a single chip will increase exponentially.
From this, a third idea emerged: build an identical factory and allow two factories to produce at the same time, which not only avoids cost problems, but also improves production efficiency.
This method sounds simple, but is difficult to achieve in practice. The chip needs to go through two stages when performing computing tasks: data transmission and calculation. Data transmission takes too much time and calculation is "empty", which will cause a waste of computing power. Just like two factories need a foreman to convey instructions. When the foreman is giving a speech in factory A, the workers in factory B are all fishing.
This means that if 10 chips are packaged on one motherboard, not only will the performance not be improved by 10 times, but it will probably not even be doubled.
In 2011, Nvidia released the GTX590 graphics card, whose biggest feature is that it has two GPU chips installed on one PCB board.
However, in a specific game, if you want to use the computing power of two GPUs at the same time, you not only need specialized software support, but the performance is only about 130% of a single chip.
The reason is that a large amount of computing power is wasted by inefficient data transmission.
In order to solve the problem of production line workers being passive when the foreman is away, the NVIDIA team published a paper in 2017 and proposed an architecture called "combinable package GPU". The core is to integrate multiple GPUs into the same chip package.
Traditional chip packaging is "sealed first and then assembled", that is, two chips are packaged and then connected with wires. NVIDIA's solution is to "assemble first and then seal". First, the two chips are assembled into one large chip and then packaged together.
Reduce the physical distance between chips (factory buildings) to 0, the foreman delivers instructions, and workers on both sides learn and implement them at the same time, reducing data transmission time and achieving 1+1=2.
A few months later, old rival AMD said that no one knew how to write a paper. It published a paper showing the design of four GPUs integrated in the same package, claiming that its performance was 45.5% higher than the most powerful GPU at the time, and it was coming soon.
But neither Nvidia nor AMD could really "soon" this solution.
The first one to make 1+1=2 was Apple.
The meaning of 1+1=2 is as Apple said in the press release:
M1 Ultra still shows the integrity of a chip when working, and will be recognized by all software as a complete chip. Developers can directly use its powerful performance without rewriting the code. This has no precedent in history.
M1 Ultra is made up of two identical M1 Max chips
Before Apple, almost all "stitching" solutions could not solve the loss caused by the chip during the connection process, making the performance often "1+1<2". Behind the M1 Ultra is a "stitching technology" called UltraFusion.
According to Apple's official statement, Ultra Fusion is jointly developed by Apple and TSMC. But judging from experience, Apple's biggest role is to reimburse TSMC's R&D expenses in the form of "technology naming fees."
The core of the stitching of two chips is to solve the problem of data transmission between chips.
In order to achieve "seamless bonding", Apple used TSMC's most expensive and advanced packaging technology-the fifth generation CoWoS-S.
The traditional transmission method is to package two chips on a substrate, and the transmission between the chips is solved by wires. The CoWoS solution adds a silicon interposer between the substrate and the chip. By wiring in the silicon interposer, the two small chips are indirectly connected. The connection density is twice that of the existing technology.
The key to this technology lies in the silicon interposer, which is also the source of money burning.
The silicon mid-level is essentially a silicon wafer, the raw material from which chips are "cut". Just to make the connection, it is necessary to add the cost of an additional layer of silicon wafer. I am afraid that only Apple can do this.
Later, Nvidia adopted the more mature CoWoS on the H100, which still cost more than $4,000. As the original trial and error maker, Apple's costs will only be higher.
In addition to CoWoS, Apple's money is also spent on "stitching" technology.
The essence of chip manufacturing is to carve complex circuits on silicon wafers. However, in the actual manufacturing process, the circuit is not directly engraved on the silicon wafer, but is first engraved on a mask, and then "transferred" to the silicon wafer through photolithography and etching.
The problem Nvidia encountered back then was that the GPU chip itself was large in area. Once two GPUs were spliced, it would exceed the size of the normal mask (the area of the H100 was close to the limit of TSMC's 5nm mask), and the circuit could not be completely portray.
The solution proposed by Apple is that one mask is not enough, so let's just use four.
Through the "stitching" of four masks, the circuit characterization area is increased to 2500mm², which is more than three times that of Nvidia's GPU in the same period (815mm²).
In chip manufacturing, a large part of the cost comes from mask production.
Mask production requires a Mask Writer, which is as precise as a photolithography machine. Moreover, Mask Writer is only used during mask production, and each chip is only made once, making it difficult to spread the cost.
In addition, because Ultra Fusion uses a lot of new technologies, such as high aspect ratio through silicon via (TSV) technology to connect chips, new non-gel thermal interface material (TIM) for heat dissipation, etc., TSMC They all took the invoice to Apple for reimbursement.
When the M1 Ultra was released, the industry did not have accurate cost estimates. It's not that the researchers are not up to par, it's that the technology is too advanced and cannot be calculated.
The most critical issue in the high-tech industry is not how the technology is implemented, but who will pay to turn the data in papers and laboratories into products that can be mass-produced. I wonder if looking at the splicing diagram of the M1 Ultra, there will be long-lasting memories attacking Jen-Hsun Huang.
The first person who tried to solve the 1+1<2 problem was neither Nvidia nor Apple, but TSMC veteran Chiang Shangyi.
In 2009, Zhang Zhongmou, who returned to TSMC, invited back Chiang Shangyi, who had retired. Under the leadership of the latter, TSMC successfully surpassed Samsung and took the lead in mass production of the 28nm process with its "back-gate" technology route. However, during the research and development process, Jiang Shangyi found that the unit manufacturing cost of transistors rose instead of falling, and the cost-effectiveness of process upgrades to improve performance began to decrease.
With a budget of US$100 million approved by Zhang Zhongmou and an engineering team of more than 400 people, Jiang Shangyi led the team to start the "Beyond Moore Project".
Under traditional Internet technology, the transmission rate has hit the ceiling. Jiang Shangyi began to try a new idea:
By packaging two chips together, the physical distance is shortened and the transmission speed is naturally increased. In order to distinguish it from traditional packaging, Jiang Shangyi named it "advanced packaging".
In 2011, TSMC received an order from Xilinx, a major FPGA manufacturer. With CoWoS and jointly developed through silicon via (TSV) and other technologies, TSMC successfully spliced four 28nm FPGA chips together and launched the largest FPGA chip in history.
However, most customers have little interest in CoWoS, and Xilinx's orders are insignificant.
Executives from Qualcomm, an old customer, bluntly stated during lunch with Jiang Shangyi that CoWoS technology is very good, but "I am only willing to spend 1 cent/square millimeter for it," while TSMC's price at the time was 7 cents/square millimeter. mm.
It is said that Nvidia is also one of the first target customers of TSMC CoWoS, because the bottleneck of data transmission has always been a core problem plaguing GPU computing. But after hearing TSMC's offer, Nvidia said on the spot that the old technology could still be used for a few more years.
On the other hand, advanced manufacturing processes are still advancing steadily, and the concept of advanced packaging seems too advanced. After all, leaders are still driving Corollas, so don’t rush to switch to BMWs.
Therefore, the advanced packaging team was once marginalized within TSMC, and was even regarded as a nursing home for veteran cadres. Liang Mengsong, who later switched to Samsung, believed that his transfer to the advanced packaging business was a "decentralization".
Subsequently, TSMC began to subtract from CoWoS and came up with an alternative solution "InFO", replacing the expensive silicon interposer with other materials, sacrificing connection density, but the cost dropped significantly.
Immediately afterwards, TSMC met a super party that could single-handedly change the fate of its suppliers: Apple.
Around 2013, due to competition with Samsung in the mobile phone market, Apple began to hand over chip foundry to TSMC.
With the InFO solution, TSMC has produced an A10 processor with stronger performance than Samsung's 14nm process based on the 16nm process, and contributed to the iPhone 7, the second thinnest and lightest iPhone in history [5].
With Apple's big order, TSMC's advanced packaging business quickly revitalized, and in 2022 it launched the M1 Ultra chip that shocked the industry. At the beginning of 2024, this "glue method" that has been used for more than ten years will be used on Nvidia's new nuclear bomb B200. NVIDIA took advantage of the situation and won the naming rights and named this technology "NV-HBI".
Advanced packaging solutions are still expensive, but for today's NVIDIA, they may have forgotten how to write the word cost.
In addition to CoWoS, HBM, another technology popularized by generative AI, can also be explored ten years ago.
When CoWoS got its first order from Xilinx, Jiang Shangyi was overjoyed, but Xilinx's motive made him dumbfounded: put four old chips together and sell them directly as new products at a higher price, so there is no need to develop new products yourself. [3].
In an interview with the Computer History Museum of the United States, Jiang Shangyi recalled [3]: "My original intention of developing technology was to solve the performance bottleneck problem. In my opinion, my innovation was not used in a good way." It is difficult for technological revolution to promote technological innovation, but technological innovation makes technological revolution possible. Those who create history can never foresee their own coordinates in the course of history.
On the frontier of physics that we have never set foot on, there are countless great innovations that are still unknown in corners.
Reference article:
[1] NVIDIA Blackwell Architecture and B200/B100 Accelerators Announced: Going Bigger With Smaller Data, Anandtech
[2] Apple UltraFusion technology, Xiamen Yuntian Semiconductor
[3] Chiang Shangyi's 10,000-word autobiography reveals TSMC's road to the top, Sprout
[4] This is how TSMC's advanced packaging is made, World Magazine
[5] The new packaging of Apple iPhone 7 A10 processor has had a huge impact both technically and commercially, Yole Development
[6] Apple M1 Ultra Decrypted: The Industry's First GPU Die Integration, How to Implement It, Ji Micronet
[7] Apple Will Help TSMC to Be in the Leading Position in the Next Era, utmel
View more at EASELINK
2023-11-13
2023-09-08
2023-10-12
2023-10-20
2023-10-13
2023-09-22
2023-10-05
2023-10-16
Please leave your message here and we will reply to you as soon as possible. Thank you for your support.
Sell us your Excess here. We buy ICs, Transistors, Diodes, Capacitors, Connectors, Military&Commercial Electronic components.
Leave Your Message