Home News Analog chips can also be trained

Analog chips can also be trained

2024-10-15

When analog chips are used for language models, their physical properties limit them to inference. But IBM Research scientists are working on several new algorithms that allow these energy-efficient processors to train models.

Deep neural network training requires multiple processors running simultaneously for days, and as AI systems continue to scale, finding cheaper and more efficient ways to perform this training becomes increasingly important. But IBM Research Scientist Tayfun Gokmen and his team are taking a creative approach to the problem, developing algorithms that enable analog AI devices to accelerate the process of deep neural network training—and do it more energy-efficiently than CPUs or GPUs.

Until now, inference has been the main focus of in-memory computing. But Gokmen believes that even greater energy and computational savings can be made in training, where model training is much more computationally expensive. Unfortunately, when researchers use these in-memory computing devices for training, they don't always perform well. The materials used in these devices, such as atomic filaments used in resistive random access memory or chalcogenide glass used in phase-change memory, have noise and switching issues, so new algorithms must be designed to enable these devices to accelerate deep neural network workloads.

One of the big problems they encountered along the way is that many memory training algorithms require a level of computational fidelity that is impractical on analog devices. The team's approach makes great progress in solving this problem, and their algorithm can meet this requirement.

01 Simulating in-memory computing

Most traditional chip designs, like CPUs or GPUs, have separate memory and processing units, and data must be transferred back and forth between the two. This factor that hinders chip latency is known as the von Neumann bottleneck. However, with analog memory chips, there is no separation between compute and memory, which makes these processors very economical compared to traditional designs. Data (in the case of AI, model weights) does not pass back and forth through the von Neumann bottleneck.

In analog devices, the model weights of a neural network are not stored in transistors, but in devices that store them in physical form. These cells contain special materials that can change their conductivity, or resistance, to encode intermediate values between 0 and 1. This property means that a single analog memory device can hold more values than a single transistor, and the crossbars that fill these devices make efficient use of space. But these analog cells also have disadvantages: AI model training adjusts model weights billions or trillions of times—a simple task for digital transistors that can be switched on and off repeatedly—but these physical memory devices can't handle all those switches. Changing their physical state trillions of times destroys their structure and reduces their computational fidelity.

So training is often done on digital hardware, then the weights are ported to analog devices, where they're locked in for inference and not adjusted further. "It's basically a one-time effort," Gokmen says. "Then you use the same weights over and over again."

Training requires incremental adjustments, however, so the fundamental challenge is how to do those updates efficiently and reliably. The solution they came up with: Use electrical pulses to simultaneously compute each weight gradient and perform a model weight update. But when you do it this way, you're relying on the device to do the update correctly, and they often fail—either because of randomness or because of differences between devices. "One device might update by a certain amount, but when you go to another device, that amount might be different," he says.

Beyond this inconsistency, there are issues with the materials. Depending on where the weight value falls in the conductivity range and how much you’re trying to change it, an analog memory device might be harder to change to accommodate. Specifically, Gokmen says, the increments of change tend to be stronger at the beginning, but once the material reaches a high conductivity, it becomes saturated and it becomes difficult to adjust the value further. Similarly, if you reduce the conductivity of the material, the weight will drop quickly at first, then saturate near the bottom of the range. In short, Gokmen said, these are just a few of more than 10 different factors that can go wrong when training AI models on these types of devices.

02 Memory Training Algorithm

Materials scientists at IBM Research are working to address some of these issues at the physical level, but in the meantime, other researchers like Gokmen and his team are developing algorithms to overcome obstacles in analog devices.

The team took two approaches to the problem of training models on analog memory devices. The algorithms they came up with are called Dynamic Reference Analog Gradient Accumulation (AGAD) and Chopped Tiki-Taka Version 2 (c-TTv2). Both are revisions of the team's existing algorithms—named after the Spanish national team's famous "tiki-taka" soccer style, which involves lots of short passes to maintain possession of the ball.

With these approaches, they address some of the issues that arise from the nonideal properties of in-memory computing devices. This includes the amount of noise, both between cycles and in variability from one device to another. "We can also address the nonlinear switching behavior of the device," Gokmen said, otherwise known as the saturation problem mentioned above. Any of these three issues can lead to inconsistent updates to the model weights during analog memory AI training. The algorithms also help correct for noise in the symmetry point, a measure that describes the level of conductance at which a memory device settles when an electrical pulse is fed to it. “It might not be a fixed point, it might be drifting around, and it might vary from device to device,” Gokmen said.

In simulations of in-memory model training, they found that both AGAD and c-TTv2 achieved significantly lower error rates than the previous TTv2 algorithm.

One of the major advances they made with these algorithms was the ability to perform model weight updates entirely in memory, rather than offloading them to digital devices. "We're really pushing this internally," Gokmen said. "We're ahead of the curve in terms of algorithm development." Now they're preparing to train small models on available analog devices, but those plans are still in the works and depend on having the right analog hardware.

03 Next step

The field of analog computing is still in its infancy. Although the team algorithmically solved about half of the material problems by training in-memory analog processors, test results still showed a performance gap for large neural networks. Gokmen said their follow-up work will explore why this is the case. "We still don't understand why there is such a gap." To build on the research results, Gokmen and his team are working with researchers at Rensselaer Polytechnic Institute to devise a mathematical explanation and demonstration for the effects they observed in their experiments. Scientists at IBM Research are developing hardware to run future AI models, including an experimental core design demonstrated last year that uses analog in-memory computing to handle inference workloads. There are also digital processors that can perform in-memory computing, including IBM's AIU NorthPole chip, which is inspired by the brain. Researchers working on analog in-memory computing believe that deep neural networks will work better on this hardware if their architecture is co-designed with the algorithms that run them on analog devices, and these algorithms are part of our goal to achieve this.

Reference link:

https://research.ibm.com/blog/analog-in-memory-training-algorithms

View more at EASELINK

Previous: Understanding NPU in one article Next: TSMC, making rapid progress

Back to list