Researchers upend AI status quo by eliminating matrix multiplication in LLMs

350

Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural network operations that are currently accelerated by GPU chips. The findings, detailed in a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to "MatMul") is at the center of most neural network computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations in parallel. That ability momentarily made Nvidia the most valuable company in the world last week; the company currently holds an estimated 98 percent market share for data center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.

In the new paper, titled "Scalable MatMul-free Language Modeling," the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar performance to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per second on a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU's power draw). The implication is that a more efficient FPGA "paves the way for the development of more efficient and hardware-friendly architectures," they write.

The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment on resource-constrained hardware like smartphones.

Doing away with matrix math

In the paper, the researchers mention BitNet (the so-called "1-bit" transformer technique that made the rounds as a preprint in October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights in language models, successfully scaling up to 3 billion parameters while maintaining competitive performance.

However, they note that BitNet still relied on matrix multiplications in its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely "MatMul-free" architecture that could maintain performance while eliminating matrix multiplications even in the attention mechanism.

The researchers' approach involves two main innovations: first, they created a custom LLM and constrained it to use only ternary values (-1, 0, 1) instead of traditional floating-point numbers, which allows for simpler computations. Second, the researchers redesigned the computationally expensive self-attention mechanism in traditional language models with a simpler, more efficient unit (that they called a MatMul-free Linear Gated Recurrent Unit—or MLGRU) that processes words sequentially using basic arithmetic operations instead of matrix multiplications.

Third, they adapted a Gated Linear Unit (GLU)—a gating mechanism to control information flow in neural networks—to use ternary weights for channel mixing. Channel mixing refers to the process of combining and transforming different aspects or features of the data the AI is working with, similar to how a DJ might mix different audio channels to create a cohesive song.

These changes, combined with a custom hardware implementation to accelerate ternary operations through the aforementioned FPGA chip, allowed the researchers to achieve what they claim is performance comparable to state-of-the-art models while reducing energy use. Although they ran comparisons on GPUs to benchmark against traditional models, the MatMul-free models are designed to operate efficiently on hardware that is optimized for simpler arithmetic operations, such as FPGAs. This suggests that these models could potentially be run on various types of hardware, including those with more limited computational resources than GPUs.

This chart taken from the paper shows relative performance of the MatMul-free LLM compared to a conventional (Transformer++) LLM on benchmarks. Credit: Zhu, et al

To evaluate their approach, the researchers compared their MatMul-free LM against a reproduced Llama-2-style model (which they call "Transformer++") across three model sizes: 370M, 1.3B, and 2.7B parameters. All models were pre-trained on the SlimPajama dataset, with the larger models trained on 100 billion tokens each. Researchers claim the MatMul-free LM achieved competitive performance against the Llama 2 baseline on several benchmark tasks, including answering questions, commonsense reasoning, and physical understanding.

In addition to power reductions, the researchers' MatMul-free LM significantly reduced memory usage. Their optimized GPU implementation decreased memory consumption by up to 61 percent during training compared to an unoptimized baseline.

To be clear, a 2.7B parameter Llama-2 model is a long way from the current best LLMs on the market, such as GPT-4, which is estimated to have over 1 trillion parameters in aggregate. GPT-3 shipped with 175 billion parameters in 2020. Parameter count generally means more complexity (and, roughly, capability) baked into the model, but at the same time, researchers have been finding ways to achieve higher-level LLM performance with fewer parameters.

So, we're not talking ChatGPT-level processing capability here yet, but the UC Santa Cruz technique does not necessarily preclude that level of performance, given more resources.

Extrapolating into the future

The researchers say that scaling laws observed in their experiments suggest that the MatMul-free LM may also outperform traditional LLMs at very large scales. The researchers project that their approach could theoretically intersect with and surpass the performance of standard LLMs at scales around 10²³ FLOPS, which is roughly equivalent to the training compute required for models like Meta's Llama-3 8B or Llama-2 70B.

However, the authors note that their work has limitations. The MatMul-free LM has not been tested on extremely large-scale models (e.g., 100 billion-plus parameters) due to computational constraints. They call for institutions with larger resources to invest in scaling up and further developing this lightweight approach to language modeling.

The article was updated on June 26, 2024 at 9:20 AM to remove an inaccurate power estimate related to running a LLM locally on a RTX 3060 created by the author.

Listing image: Getty Images

Benj Edwards Senior AI Reporter

Benj Edwards is Ars Technica's Senior AI Reporter and founder of the site's dedicated AI beat in 2022. He's also a widely-cited tech historian. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

350

View Comments