IBM chip V.S. GPU

2024-09-30

Lower latency typically comes at the expense of energy efficiency, but in newly published experimental results, IBM's brain-inspired NorthPole research prototype chip achieved significantly lower latency than the next fastest GPU and significantly higher energy efficiency than the next most efficient GPU.

As researchers race to develop the next generation of computer chips, artificial intelligence is front and center. With the recent surge in applications for generative AI, including large language models, it's clear that traditional CPUs and GPUs struggle to deliver the necessary combination of speed and energy efficiency. To deliver AI at scale, especially for agent workflows and digital workers, the hardware running these models needs to run faster. At the same time, the environmental impact of AI power consumption is a pressing issue, so reducing the power consumption of AI is critical. At the IBM Research lab in Almaden, California, a team has been rethinking the foundations of chip architecture to achieve both goals, and their latest work shows how future processors can consume less energy and run faster.

AIU NorthPole is an AI inference accelerator chip that IBM Research first unveiled last year. In inference tests running on a 3-billion-parameter LLM developed based on the IBM Granite-8B-Code-Base model, NorthPole achieved latency below 1 millisecond per token, 46.9 times faster than the next lowest latency GPU. In an off-the-shelf thin-profile 2U server running 16 NorthPole processors communicating over PCIe, the team behind the chip found it could achieve a throughput of 28,356 tokens per second on the same model. It achieved those speed levels while still being 72.7 times more energy efficient than the next most power-efficient GPU.

The research prototype NorthPole achieves lower latency and higher energy efficiency than four GPUs typically used for LLM inference.

The team presented its findings today at the IEEE High Performance Computing Conference. The new performance numbers build on results from last October, when the team showed NorthPole was able to perform neural inference faster and more efficiently than other chips on the market for edge applications. In those experiments, NorthPole was 25 times more energy efficient than common 12 nm GPUs and 14 nm CPUs, measured in frames interpreted per unit of power.

NorthPole is manufactured using a 12nm process, with each chip containing 22 billion transistors within 795 square millimeters. The chip also had lower latency than all other chips tested, even those manufactured using smaller processes, according to the results published in Science. Those tests were run on ResNet-50 image recognition and YOLOv4 object detection models, as the team focused on visual recognition tasks for applications such as self-driving cars. A year later, the new results come from trying out the NorthPole chip on the larger, 3-billion-parameter Granite LLM.

"The most important thing here is the huge improvement in quality. These new results are on par with our scientific results, but in completely different application domains," said Dharmendra Modha, an IBM researcher who led the team that developed the chip. "Given that NorthPole's architecture works so well in completely different domains, these new results not only underscore the broad applicability of the architecture, but also the importance of fundamental research."

A standard 2U server can accommodate four NorthPole cards in each of its four bays

Modha said low latency is critical for AI to run smoothly as enterprises deploy agent workflows, digital employees, and interactive conversations. But there is a fundamental tension between latency and energy efficiency—often, improvements in one area come at the expense of the other.

One of the main obstacles to reducing latency and power consumption for AI inference is the so-called von Neumann bottleneck. Almost all modern microprocessors use a von Neumann architecture, in which memory is physically separated from the processor, including the CPU and GPU. While this design has historically had the advantage of being simple and flexible, shuttling data back and forth between memory and compute limits the speed of the processor. This is especially true for AI models, whose computations are simple but numerous. While processor efficiency has tripled every two years, the bandwidth between memory and compute has only grown at about half that rate. In addition, high-bandwidth memory is expensive.

NorthPole's design eliminates this mismatch by putting memory and processing in the same place, an architecture called memory-on-chip or in-memory computing. Inspired by the brain, NorthPole tightly couples memory with the chip's compute units and control logic. This results in a massive 13 TB per second of on-chip memory bandwidth.

The NorthPole team mapped the 3 billion parameter LLM onto 16 cards: 14 transformer layers on each card and 1 output layer on two cards

The team’s next challenge was to see if NorthPole, designed for edge inference, could work for language models in the datacenter. At first, this seemed like a daunting task, as LLMs wouldn't fit into NorthPole's on-chip memory.

To tackle the challenge, the team chose to run the 3-billion-parameter Granite LLM on a 16-card NorthPole setup. They mapped 14 transformer layers onto each card and the output layer onto the remaining two cards. LLMs are typically limited by memory bandwidth, but in this pipelined parallel setup, very little data needs to be moved between cards — PCIe is sufficient, and no high-speed network is required. This is a result of the on-chip memory that stores the model weights and the so-called key-value (KV) cache, which means less data needs to be passed between separate PCIe cards when generating tokens. The model was quantized to 4-bit weights and activations, and the quantized model was fine-tuned to match accuracy.

Based on the success of the latest experiment, Modha said his team is currently working on building units with more North Pole chips, with plans to map larger models on those units.

IBM Research scientists are working to develop server racks filled with hundreds of NorthPole cards to perform large numbers of inference operations at faster speeds and lower energy consumption than comparable GPU-based hardware.

While the new performance results are groundbreaking, Modha is confident his team will continue to push the cutting edge to the next level to improve NorthPole’s energy efficiency while lowering its latency. The key, he said, is to innovate across the entire vertical stack. This will require co-designing algorithms from scratch to run on next-generation hardware, leveraging technology scaling and packaging, and envisioning entirely new systems and inference devices - advances he and others in IBM Research are currently working on.

Reference link:

https://research.ibm.com/blog/northpole-llm-inference-results

View more at EASELINK

Previous: The U.S. Department of Commerce, the latest step! Next: Introduction of the Altera's FPGA

Back to list

HOT NEWS

Understanding the Importance of Signal Buffers in Electronics

IBM,chip,V.S.,GPU,IBM,chip,and,GPU,IBM,GPU,memory,chip,memory

Have you ever wondered how your electronic devices manage to transmit and receive signals with such precision? The secret lies in a small ...

2023-11-13

Turkish domestically produced microcontrollers about to be put into production

Turkey has become one of the most important non-EU technology and semiconductor producers and distributors in Europe. The European se...

2024-08-14

1 What is PSRRPSRR Power Supply Rejection Ratio, the English name is Power Supply Rejection Ratio, or PSRR for short, ...

2023-09-26

Amazon halts development of a chip

Amazon has stopped developing its Inferentia AI chip and is instead focusing on semiconductors for training AI models, an area the com...

2024-12-10

Survival Guide – AI Chip Unicorn’s

Recently, the world's "AI chip unicorns" have successively announced new developments in their companies and products. Gro...

2024-04-26

Another century of Japanese electronics giant comes to an end

"Toshiba, Toshiba, the Toshiba of the new era!" In the 1980s, this advertising slogan was once popular all over the country.S...

2023-10-13

Understanding the World of Encoders, Decoders, and Converters: A Comprehensive Guide

Encoders play a crucial role in the world of technology, enabling the conversion of analog signals into digital formats.

2023-10-20