Home News NPU, how to see it?

NPU, how to see it?

2024-11-16

Share this article :

Source: Content compiled from quadric

There are dozens of NPU options on the market today. Each has competing and conflicting claims about efficiency, programmability, and flexibility. One of the most obvious differences between these options is the seemingly simple question of what is the "best" choice for memory placement relative to compute in the NPU system hierarchy.

Some NPU architectural styles rely heavily on direct or exclusive access to system DRAM, relying on the relative cost per bit advantage of high-volume commodity DRAM over other memory options, but subject to partitioning issues across multiple chips. Other NPU options rely heavily or exclusively on on-chip SRAM for speed and simplicity, but at the expense of high silicon area costs and lack of flexibility. Still others employ novel memory types (MRAM) or novel analog circuit structures, both of which lack a proven, widely used manufacturing record. Despite the wide variety of NPU options, they generally align with one of three styles of memory locality. These three styles bear a striking resemblance (pun intended) to the children's story "The Three Bears"!

The children's fairy tale "Goldilocks and the Three Bears" describes the adventures of Goldi as she tries to choose between three choices of bedding, chairs, and porridge bowls. One meal was "too hot," another was "too cold," and the last was "just right." If Goldi were faced with making architectural choices for AI processing in a modern edge/device SoC, she would also face three choices regarding the placement of compute power relative to the local memory used to store activations and weights.

 

01 In, at, or near?

The terms compute-in-memory (CIM) and compute-near-memory (CNM) originated from architectural discussions in data center system design. There is a large literature discussing the merits of various architectures. All analysis boils down to trying to minimize the power consumed and latency in moving working data sets between processing and storage elements in the data center.

In the world of system-on-chips (SoCs) optimized specifically for AI inference on edge devices, the same principles apply, but there are three levels of proximity to consider: compute-in-memory, compute-in-memory, and compute-near-memory. Let's quickly examine each level.

 

02 In-memory computing: a mirage

In-memory computing refers to the various attempts over the past decade to fit computing into memory bit cells or memory macros used in SoC designs. Almost all of these attempts employ some kind of analog computing within the bit cells of the DRAM or SRAM (or more exotic memories like MRAM) in question. In theory, these approaches speed up computation and reduce power consumption by performing computations (particularly multiplications) in the analog domain and in an extensively parallel manner. While this seems to be a compelling idea, it has failed to date.

The reasons for this failure are multifaceted. First, the widely used on-chip SRAM has been perfected/optimized for nearly 40 years, as has the off-chip memory DRAM. Using a highly optimized approach results in area and power inefficiencies compared to a pure starting point. It has proven unworkable to inject this new approach into the tried-and-true standard cell design approach used by SoC companies. Another major drawback of in-memory computing is that these analog approaches only perform a very limited subset of the computations required for AI inference - namely, the matrix multiplications at the heart of convolution operations. However, no in-memory compute could be built with enough flexibility to cover all possible convolution variations (size, stride, dilation) and all possible MatMul configurations. In-memory emulated compute also cannot implement the other 2300 operations in the Pytorch model world. Therefore, in-memory compute solutions need to have full-fledged NPU compute capabilities in addition to in-memory emulated enhancements - "enhancements" are burdensome in terms of area and power when using that memory in the traditional way for all the computations that happen on the accompanying digital NPU.

The final analysis showed that the memory solutions for edge device SoCs were "too limited" to be of no use to the intrepid chip designer Goldi.

 

03 Near-Memory Computing: Near Computing Still Far Away

At the other end of the spectrum of SoC inference design approaches is to minimize the use of on-chip SRAM memory and maximize the utilization of low-cost, high-capacity memory (primarily DDR chips) produced in mass production. This concept focuses on the cost advantages of large-scale DRAM production and assumes that with minimal SRAM on the SoC and sufficient bandwidth for low-cost DRAM, the AI inference subsystem can reduce SoC cost, but rely on fast connections to external memory (usually dedicated DDR interfaces managed only by the AI engine) to maintain high performance.

While at first glance, the near-memory approach can successfully reduce the SoC chip area used for AI, thereby slightly reducing system cost, it has two major drawbacks that will weaken system performance. First, the power consumption of such a system is abnormal. Consider the following table, which shows the relative energy cost of moving a 32-bit word of data to or from the multiply-accumulate logic of each AI NPU core:

 


Each data transfer from the SoC to the DDR consumes 225 to 600 times more energy (power) than a transfer locally adjacent to the MAC unit. Even on-chip SRAM, which is fairly "far" from the MAC unit, is 3 to 8 times more energy efficient than off-chip transfers. Since most of these SoCs are power-constrained consumer devices, the power constraints of relying primarily on external memory make the near-memory design point impractical. Furthermore, the latency of always relying on external memory means that as newer, more complex models evolve that are likely to have more irregular data access patterns than an old Resnet, the near-memory solution will suffer severe performance degradation due to latency.

The double whammy of excessive power and low performance meant that the near-memory approach was "too hot" for our chip architect Goldi.

04 At-Memory: Just Right

Just as the children's Goldilocks fable always offers a "just right" alternative, in-memory computing architectures are the just right solution for edge and device SoCs. Referring again to the data transfer energy costs in the table above, the best choice for memory location is clearly the immediately adjacent on-chip SRAM. Saving computed intermediate activation values into local SRAM consumes 200 times less power than pushing that value off-chip. But that doesn't mean you should only use on-chip SRAM. Doing so would place a hard upper limit on the model size (weight size) that can fit in each implementation.

For SoC designers, the best option is to both leverage small local SRAMs (preferably distributed in large quantities across the array of compute elements) and intelligently schedule data movement between these SRAMs and off-chip storage in DDR memory to minimize system power consumption and minimize data access latency.



View more at EASELINK

HOT NEWS

Understanding the Importance of Signal Buffers in Electronics

chips,memory,memory,chips,MRAM,NPU

Have you ever wondered how your electronic devices manage to transmit and receive signals with such precision? The secret lies in a small ...

2023-11-13

How to understand Linear Analog Multipliers and Dividers?

IntroductionLinear analog multipliers and dividers are an advanced-looking device at first glance, but they're actually crucial player...

2023-09-08

Demystifying Data Acquisition ADCs/DACs: Special Purpose Applications

Introduction to Data Acquisition ADCs/DACsUnlocking the potential of data has become an integral part of our ever-evolving technol...

2023-10-12

Another century of Japanese electronics giant comes to an end

"Toshiba, Toshiba, the Toshiba of the new era!" In the 1980s, this advertising slogan was once popular all over the country.S...

2023-10-13

The Future of Linear Amplifiers: Unlocking Breakthroughs in High-Fidelity Audio and Communication

Introduction to Linear AmplifiersWelcome to the world of linear amplifiers, where breakthroughs in high-fidelity audio and communication...

2023-09-22

Understanding the World of Encoders, Decoders, and Converters: A Comprehensive Guide

Encoders play a crucial role in the world of technology, enabling the conversion of analog signals into digital formats.

2023-10-20

In 2023, ASIC chips aim at two major directions

ASIC chip (Application-Specific Integrated Circuit) is an integrated circuit designed and manufactured specifically to meet the need...

2023-10-05

Financial Times Documentary "The Battle for Global Semiconductor Dominance"

On September 28, the Financial Times, a century-old media giant, launched a documentary titled "The race for semiconductor suprema...

2023-10-16

Address: 73 Upper Paya Lebar Road #06-01CCentro Bianco Singapore

chips,memory,memory,chips,MRAM,NPU chips,memory,memory,chips,MRAM,NPU
chips,memory,memory,chips,MRAM,NPU
Copyright © 2023 EASELINK. All rights reserved. Website Map
×

Send request/ Leave your message

Please leave your message here and we will reply to you as soon as possible. Thank you for your support.

send
×

RECYCLE Electronic Components

Sell us your Excess here. We buy ICs, Transistors, Diodes, Capacitors, Connectors, Military&Commercial Electronic components.

BOM File
chips,memory,memory,chips,MRAM,NPU
send

Leave Your Message

Send