Home News The era of AI chip customization is coming

The era of AI chip customization is coming

2024-10-22

The increasing complexity of AI models and the explosion in the number and variety of networks have left chipmakers treading the fine line between fixed-function acceleration and programmable accelerators, and creating new approaches that encompass both.

In general, general-purpose approaches to AI processing fall short. General-purpose processors are just that. They're not designed or optimized for any specific workload. And because AI consumes a large portion of system power, focusing on specific use cases or workloads can deliver greater power savings and better performance in a smaller footprint.

01 The impact of AI on computing and semiconductors

"AI has had a profound impact on the computing and semiconductor industries over the last decade - to the point where there are now specialized processor architectures and specialized components developed and adopted to serve only the AI market,"said Steven Woo, Rambus Fellow and Distinguished Inventor.

But this specialization comes at a cost. "For ML and AI, the demand for compute is insatiable," said Ian Bratt, Arm Fellow and VP of Machine Learning Technologies. "If you can do 10x more compute, people will use it because you can do better when you run a 10x larger model. Because that demand is insatiable, it drives you to optimize for that workload, and different types of NPUs have been built that achieve very good energy efficiency on specific classes of neural network models, and you get excellent operations per watt and performance in those spaces. However, that comes at the expense of flexibility because no one knows where the models are going to go. So it sacrifices the future-proofing aspect."

As a result, some engineering teams are looking at different optimization approaches. "General-purpose computing platforms like CPUs and GPUs have been adding more internal acceleration for neural networks without sacrificing the general-purpose programmability of those platforms like CPUs," Bratt said. Arm has a CPU instruction roadmap and has been adding architectures and CPUs to improve ML performance for years. "While this is still on a general-purpose platform, you get a lot of things there. It's not as good as a dedicated NPU, but it's a more flexible and future-proof platform," he said.

Improving efficiency is critical, and it affects everything from the energy required to train AI models in hyperscale data centers to the battery life of edge devices doing inference.

"If you take a classic neural network, where you have multiple layers of nodes and information is passed from one node to another, the essential difference between training and execution is that during training, you have backpropagation. You take a data set and run it through the nodes. You then calculate the error function, which is how wrong the answer is compared to the labeled outcome that you know you need to achieve. Then you take that error and backpropagate it and adjust all the weights on the nodes and the connections between them to reduce the error. Then you sweep again with more data and backpropagate the error again. You go back and forth, and that's training. Each time you sweep through you improve the weights, and eventually you hope to converge to a set of trillions of weights and values of nodes, biases, and weights and values that can give reliable outputs. Once you have the weights and all the parameters for each node and you're executing the actual AI algorithm, then you don't need to backpropagate. You don't need to correct it anymore. You just feed in the data and pass it on. It's a much simpler, one-way way of processing data.”

02 Back Propagation

Backpropagation requires a lot of energy to do all the calculations.

"You have to average all the nodes and all the data to form the error function, and then weight it and divide it and so on," Swinnen explained. "Backpropagation requires all the math that doesn't happen in the actual execution (during inference). That’s one of the biggest differences. There's much less math that needs to be done in inference."

However, it still requires a lot of processing, and as AI algorithms become more complex and the number of floating point operations increases, the trend line is only going to point upwards and to the right.

"The number of floating point operations performed by the winning ImageNet 'Top 1' algorithm has increased 100x over the past five years,” said Russ Klein, program director for Advanced Synthesis at Siemens Digital Industries Software. "Certainly, LLMs are setting new records for model parameters. As the computational load increases, it becomes increasingly impractical to run these models on general-purpose CPUs. AI algorithms are often highly data-parallel, meaning operations can be distributed across multiple CPUs. This means performance requirements can be met by simply applying more CPUs to the problem. But the energy required to perform these calculations on a CPU can be very high. GPUs and TPUs typically have higher power consumption but compute faster, reducing energy consumption for the same operation."

Despite this, the need for more processing power continues to grow. Gordon Cooper, product manager for the Solutions Group at Synopsys, noted a sharp rise in the number of benchmark requests for generative AI inference, indicating growing interest. "More than 50% of our recent benchmark requests have at least one generative AI model on the list," he said. 'What's harder to assess is whether they have a specific use case or whether they're hedging their bets and saying, 'This is the trend. I have to tell people I have this.' I think there's a need to claim that the capability is still ahead of the use case."

At the same time, the pace at which these models change is accelerating. "We're a long way from hard-wired AI (i.e., ASICs) to the point where this is it. The standards are set. These are the baselines, and these are going to be the most efficient,'" Cooper said. "So programmability is still critical because you have to be able to provide a certain level of programmability for whatever comes next to make sure you have some flexibility. But if you're too programmable, then you're just a general-purpose CPU or even a GPU, and then you're not taking advantage of the power and area efficiencies of edge devices. The challenge is how to optimize as much as possible while still providing programmability for the future. That's where we and some of our competitors are trying to tread in the realm of being flexible enough. An example is activation functions, like ReLUs (rectified linear units). We used to hard-wire them, but now we find that's ridiculous because we can't guess what they're going to need next time. So now we have a programmable lookup table to support whoever comes next. It took us a couple generations to realize we had to start making it more flexible."

03 AI processing continues to evolve

AI's rapid growth has been fueled by huge advances in computing performance and capacity. "We're in AI 2.0 now," says Rambus' Woo. "AI 1.0 was really about the first attempts to apply AI to the entire computing landscape. Things like voice assistants and recommendation engines started to get attention because they were able to use AI to provide higher-quality results. But looking back, they were limited in some ways. The systems could consume certain types of inputs and outputs, but they didn't really generate the high-quality information that we can generate today. Where we are today is built on top of AI 1.0. AI 2.0 is about systems now being able to create new things from the data they learn from and the inputs they get."

The most important of these technologies are large language models and generative AI, as well as co-pilots and digital assistants that help humans be more productive. "The hallmark of these systems is multimodal inputs and outputs," explains Woo. "They can take many inputs, text, video, speech, even code, and can generate something new from them. In fact, they can generate many types of media from them as well. All of this is another step towards the larger goal of artificial general intelligence (AGI), where we as an industry are working to deliver more human-like behaviors that build on the foundations that AI 1.0 and AI 2.0 set for us. The idea here is to be able to really adapt to our environment and tailor the results to specific users and specific use cases. There will be improvements in the way content is generated, especially in areas like video, and even in the future, using AGI as a way to guide autonomous agents, such as robot assistants that can both learn and adapt."

In the process, the size of AI models has been growing dramatically—about 10 times or more every year. "Today, the largest models available in 2024 have already passed the trillion parameter mark," he said. "This is because larger models provide higher accuracy, and we are still in the early stages of getting models to very efficient levels. This is still a stepping stone to AGI, of course."

Three or four years ago, before the advent of vision transformers and LLMs, SoC specifications for new NPU capabilities were typically limited to a small set of well-known and optimized detectors and image classifiers, such as Resnet50, ImageNet v2, and the traditional VGG16. "Semiconductor companies typically evaluate third-party IP for these networks, but ultimately decide to build their own accelerators for the common building block graph operators in these benchmark networks," said Steve Roddy, chief marketing officer at Quadric. "In fact, the vast majority of AI acceleration in volume SoCs is homegrown accelerators. A teardown of all leading mobile SoCs in 2024 will demonstrate that all six of the top volume mobile SoCs use in-house NPUs."

Many of these are likely to be replaced or supplemented by more flexible commercial NPU designs. "Requests for proposals for new NPU IP typically include 20, 30 or more networks, covering a range of classic CNNs such as Resnet, ResNext, etc., new complex CNNs (i.e. ConvNext), vision transformers (e.g. SWIN transformer and deformable transformer), and GenAI LLM/SLM, with too many model variants to count," said Roddy. "It is not feasible to build hardwired logic to accelerate such a diverse set of networks consisting of hundreds of different AI graph operator variants. As a result, SoC architects are looking for more fully programmable solutions, and most internal teams are looking to external third-party IP vendors who can provide the more powerful compiler toolsets needed to quickly compile new networks, rather than the previous labor-intensive approach of manually porting ML graphs."

04 History repeats itself

This evolution in AI is similar to what has happened in computing over time. "First, computers appeared in the data center, and then computing started to spread outward," said Jason Lawley, director of product marketing for Cadence Neo NPUs. "We moved to the desktop, and then into people's homes, and spread outward. Then we had laptops, and then mobile phones. It's the same with AI. We can look at the intensity of compute required to start doing AI in the data center. We're seeing that now with NVIDIA.

That being said, there will always be a place for mainframes and data centers. We're going to see AI spread outward from the data center, and we're seeing AI spread out from the data center to the edge. As you move to the edge, you get all sorts of different types of applications. Cadence focuses on video, audio and radar, and other computing classes around those, and each of those pillars is an accelerator for the application processor. In each of those pillars, they may need to do more AI, so the AI NPU becomes an accelerator of accelerators."

Customer behavior is also evolving. "More and more system companies and end users have their own proprietary models, or models that are retrained using proprietary data sets," Roddy said. "These OEMs and downstream users cannot or will not release proprietary models to silicon vendors, who then have their porting teams develop new models. Even if you could put NDA protections in place up and down the supply chain, a working model that relies on manual tuning and porting of ML models will not scale well enough to support the entire consumer and industrial electronics ecosystem. The new working model is a fully programmable, compiler-based toolchain that can be handed off to the data scientist or software developer who creates the final application, which is how the toolchains for leading CPUs, DSPs, and GPUs have been deployed for decades."

05 The increasing complexity of algorithms

which puts more pressure on engineering teams

As algorithms continue to grow in complexity, designers are forced to pursue ever-higher levels of acceleration. "The more tailored an accelerator is to a specific model, the faster and more efficient it is, but the less general it is," said Siemens' Klein. "It also becomes less adaptable to changes in applications and requirements."

Figure 1: Power and performance relationships for different execution platforms running AI models, CPU, GPU, TPU, and custom accelerators

Figure 2: Increasing complexity of reasoning

Rambus' Woo also sees a trend toward larger AI models because they can provide higher quality, more powerful and more accurate results. "This trend shows no signs of slowing down, and we expect the demand for larger DRAM capacities and larger DRAM bandwidths to continue to increase significantly in the future. We expect this to continue. We all know that AI training engines are the showpiece of AI, at least from a hardware perspective. Compute engines from companies like NVIDIA and AMD, as well as specialized engines (TPUs) from companies like Google, have made huge advances in the industry's ability to compute and deliver better AI. But these engines have to be fed with a lot of data, and data movement is one of the key factors limiting the speed at which we can train models today. If these high-performance engines are waiting for data, then they are not getting their work done. We have to make sure that the entire pipeline is designed to feed data in a way that allows these engines to keep running.

If we look from left to right, what's typically happening is that there's a lot of data being stored, sometimes in a very unstructured way, so it's stored on devices like SSDs or hard drives, and the job of these systems is to extract the most relevant, most important data to train the model that we're training, and get it into a form that the engine can use. These storage systems also have a lot of conventional memory, which is used for buffers and so on. For example, some of these storage systems can have memory capacities as high as 1TB. Once the data is pulled from storage, it is sent to a set of servers for data preparation. Some people call this the read layer. The idea here is to take this unstructured data and then prepare it so that it can be used in a way that the AI engine can train optimally. "

Meanwhile, alternative number representations can further improve PPA. "Floating point numbers are commonly used for AI training and inference in Python ML frameworks, but they are not an ideal format for these calculations," Klein explained. "Numbers in AI calculations are primarily between -1.0 and 1.0. Data is often normalized to this range. While 32-bit floating point numbers can range from -10 38 to 10 38, this leaves a lot of unused space in numbers and in the operators that perform calculations on those numbers. The hardware for the operators and the memory to store the values take up silicon area and consumes power. "

Google created a 16-bit floating point format called brain float (bfloat) that is targeted for AI calculations. PPA is greatly improved because the storage area for model parameters and intermediate results is reduced by half. Vectorized (SIMD) bfloat instructions are now an optional instruction set extension for RISC-V processors. Some algorithms are deployed using integer or fixed-point representations. Moving from 32-bit floating point numbers to 8-bit integers requires one-fourth the memory area. Data moves through the design four times faster, and multipliers are 97% smaller. Smaller multipliers allow for more operators in the same silicon area and power budget, enabling greater parallelism. "Posits" are another fancy representation that works well on AI algorithms.

"General-purpose AI accelerators, such as those produced by NVIDIA and Google, must support 32-bit floating point numbers because some AI algorithms require them," said Klein. "Also, they could add support for integers of varying sizes, and perhaps floating point or hypotheticals. But supporting each new numeric representation requires operators for that representation, which means more silicon area and power, hurting PPA. Some Google TPUs support 8-bit and 16-bit integer formats in addition to 32-bit floating point. But if the optimal size for an application is 11-bit features and 7-bit weights, that's not a good fit. 16-bit integer operators are needed. But a custom accelerator with 11 x 7 integer multipliers will use about 3.5 times the area and power. For some applications, that would be a strong reason to consider a custom accelerator."

All roads lead to customization, and there are many considerations for chip designers to be aware of about custom AI engines.

"When you license a highly customized or a somewhat customized product, you get different things," said Paul Karazuba, vice president of marketing at Expedera. "It's not a standard product. So, it takes a little bit of time to learn. You're getting boutique products, and those products are going to have some hooks in them that are unique to you as a chip designer. That means there's a learning curve as a chip designer and architect to understand exactly how these products are going to work in your system. There are advantages to that. If there's something in a standard IP like PCIe or USB that you don't want or need, then there are hooks in there that may not be compatible with the architecture that you've chosen as a chip designer."

This is essentially margin in the design, and it affects both performance and power. "When you get a custom AI engine, you can make sure those hooks that you don't like are not there," Karazuba said. "You can make sure that the IP works well in your system. So, there are definitely benefits to that. But there are also disadvantages. You don't get the scale that you get with standard IP. But with something that's highly customized, you have it. You get some customization, which has some benefits to your system, but you need to deal with longer lead times. You may have to deal with some unique things. There are some complications."

The benefits, however, can outweigh the learning curve. In one early customer example, Karazuba recalls, "They had developed their own internal AI network designed to reduce noise in a 4k video stream. They wanted to achieve 4k video rates. This was a network that they had developed in-house. They spent millions of dollars to build it. They were originally going to use the existing NPU on their application processor, which was, as you guessed, a general-purpose NPU. They put their algorithm on that NPU and got a frame rate of two frames per second, which is obviously not video rates. They came to us and we licensed them a targeted, customized version of our IP. They built them a chip with our IP, ran the exact same network, and got 40 frames per second, so they got a 20x performance improvement by building a focused engine. Another benefit was that because it was focused, they were able to run it at half the power that the NPU on the application processor consumed. So they got 20x at less than half the power." times the throughput.

To be fair, it's on the same process node as the application processor, so it's really an apples-to-apples comparison. These are the benefits you see from things like this. Now, there's obviously a cost aspect. It's much more expensive to build your own chip than to use something that's already on a chip that you already bought. But if you can leverage this AI to differentiate your product, and you can get this level of performance, then the extra cost is probably not a barrier."

06 in conclusion

In terms of where the future is headed, Arm's Bratt said there's enough AI/ML going on. "We're going to see, in situations where people really care about energy efficiency and the workloads are slower, like deeply embedded environments, you're going to see these specialized NPUs with highly optimized models for those NPUs, and you're going to get great performance. But in general, programmable platforms like CPUs are going to continue to advance. They're going to continue to advance in ML, and they're going to run those brand new workloads. Maybe you can't map them to an existing NPU because they have new operators or new data types.

But as things settle down, for certain verticals, you're going to take those models that run on programmable platforms and optimize them for NPUs, and you're going to get the best performance in embedded verticals like surveillance cameras or other applications. Those two modes are going to coexist for quite some time to come."

Cadence's Lawley said chip architects and design engineers need to understand the changes that AI processing brings, and it comes down to three things: storing, moving and computing data.

"Fundamentally, those three things haven't changed since the beginning of Moore's Law, but the most important thing they have to be aware of is the trend toward lower power and optimal data usage, and advances in quantization - the ability to pin memory into the system and reuse it efficiently. So what kind of layer fusion should be used in data movement, data storage, and data computation? Software plays as much of a role in this as hardware, so that the algorithms are able to compute things that don't need to be computed correctly and move things that don't need to be moved - that's what we focus on. How do we get the most performance with the least energy? That's a hard problem to solve."

Reference link: https://semiengineering.com/mass-customization-for-ai-inference/

View more at EASELINK

Previous: Who is eligible to buy advanced chips from the United States? Next: LiDAR, China is far ahead

Back to list

HOT NEWS

Understanding the Importance of Signal Buffers in Electronics

AI,models,AI,chips,The,impact,of,AI,on,semiconductors,semi,semiconductors,NPU,GPU,CPU

Have you ever wondered how your electronic devices manage to transmit and receive signals with such precision? The secret lies in a small ...

2023-11-13

Turkish domestically produced microcontrollers about to be put into production

Turkey has become one of the most important non-EU technology and semiconductor producers and distributors in Europe. The European se...

2024-08-14

1 What is PSRRPSRR Power Supply Rejection Ratio, the English name is Power Supply Rejection Ratio, or PSRR for short, ...

2023-09-26

Amazon halts development of a chip

Amazon has stopped developing its Inferentia AI chip and is instead focusing on semiconductors for training AI models, an area the com...

2024-12-10

Survival Guide – AI Chip Unicorn’s

Recently, the world's "AI chip unicorns" have successively announced new developments in their companies and products. Gro...

2024-04-26

Another century of Japanese electronics giant comes to an end

"Toshiba, Toshiba, the Toshiba of the new era!" In the 1980s, this advertising slogan was once popular all over the country.S...

2023-10-13

Understanding the World of Encoders, Decoders, and Converters: A Comprehensive Guide

Encoders play a crucial role in the world of technology, enabling the conversion of analog signals into digital formats.

2023-10-20