How Cerebras boosted Meta’s Llama to ‘frontier model’ performance

The company also demonstrates initial training of a one-trillion-parameter AI model on a single machine using conventional DDR5 memory chips.

Cerebras Systems announced on Tuesday that it’s made Meta Platforms’s Llama perform as well in a small version as it does on a large version by adding the increasingly popular approach in generative artificial intelligence (AI) known as “chain of thought.” The AI computer maker announced the advance at the start of the annual NeurIPS conference on AI.

“This is a closed-source only capability, but we wanted to bring this capability to the most popular ecosystem, which is Llama,” said James Wang, head of Cerebras’s product marketing effort, in an interview with ZDNET.

The project is the latest in a line of open-source projects Cerebras has done to demonstrate the capabilities of its purpose-built AI computer, the “CS-3,” which it sells in competition with the status quo in AI — GPU chips from the customary vendors, Nvidia and AMD.

The company was able to train the Llama 3.1 open-source AI model that uses only 70 billion parameters to reach the same accuracy or better accuracy on various benchmark tests as the much larger 405-billion parameter version of Llama.

Those tests include the CRUX test of “complex reasoning tasks,” developed at MIT and Meta, and the LiveCodeBench for code generation challenges, developed at U.C. Berkeley, MIT and Cornell University, among others.

Chain of thought can enable models using less training time, data, and computing power, to equal or surpass a large model’s performance.

“Essentially, we’re now beating Llama 3.1 405B, a model that’s some seven times larger, just by thinking more at inference time,” said Wang.

The idea behind chain-of-thought processing is for the AI model to detail the sequence of calculations performed in pursuit of the final answer, to achieve “explainable” AI. Such explainable AI could conceivably give humans greater confidence in AI’s predictions by disclosing the basis for answers.

OpenAI has popularized the chain-of-thought approach with its recently released “o1” large language model.

Cerebras’s answer to o1, dubbed “Cerebras Planning and Optimization,” or CePO, operates by requiring Llama — at the time the prompt is submitted — to “produce a plan to solve the given problem step-by-step,” carry out the plan repeatedly, analyze the responses to each execution, and then select a “best of” answer.

“Unlike a traditional LLM, where the code is just literally token by token by token, this will look at its own code that it generated and see, does it make sense?” Wang explained. “Are there syntax errors? Does it actually accomplish what the person asks for? And it will run this kind of logic loop of plan execution and cross-checking multiple times.”

In addition to matching or exceeding the 405B model of Llama 3.1, Cerebras was able to take the latest Llama version, 3.3, and make it perform at the level of “frontier” large language models such as Anthropic’s Claude 3.5 Sonnet and OpenAI’s GPT-4 Turbo.

“This is the first time, I think, anyone has taken a 70B model, which is generally considered medium-sized, and achieved a frontier-level performance,” said Wang.

Humorously, Cerebras also put Llama to the “Strawberry Test,” a prompt that alludes to the “strawberry” code name for OpenAI’s o1. When the numbers of “r” are multiplied, such as “strrrawberrry,” and language models are prompted to tell the number of r’s, they often fail. The Llama 3.1 was able to accurately relate varying numbers of r’s using chain of thought.

From a corporate perspective, Cerebras is eager to demonstrate the hardware and software advantage of its AI computer, the CS-3.

The work on Llama was done on CS-3s using Cerebras’s WSE3 chip, the world’s largest semiconductor. The company was able to run the Llama 3.1 70B model, as well as the newer Llama 3.3, on chain of thought without the typical lag induced in o1 and other models running on Nvidia and AMD chips, said Wang.

“The classical view was that improvements would plateau and you would need algorithmic breakthroughs,” he said. “Scaling laws say, ‘No, you can just throw more compute at it with no practical limit.’ The type of neural network, reasoning method, etc. affects the rate of improvement, but not its scalable nature.”

In different implementations, chain of thought can output either a verbose series of its intermediate results or a kind of status message saying something like “thinking.” Asked which Cerebras opted for, Wang said that he had not himself seen the actual output, but that “it’s probably verbose. When we release stuff that’s designed to serve Llama and open-source models, people like to see the intermediate results.”

The CS-3 system, Cerebras claims, would replace 287 of Nvidia’s top-of-the-line “Grace Blackwell 200” combo CPU and GPU chips that are needed in order to access equivalent memory.

The combination of the one CS-3 and the MemX takes up two standard telco racks of equipment, said Wang. The company claims that this takes less than one percent of the space and power of the equivalent GPU arrangement.

The MemX device uses commodity DRAM, known as DDR-5, in contrast to the GPU cards that have more expensive “high-bandwidth memory,” or, HBM.

“It does not touch the HBM supply chain so it’s extremely easy to procure, and it’s inexpensive,” said Wang.

Cerebras is betting the real payoff is in the programming model. To program the hundreds of GPUs in concert, said Wang, a total of 20,507 lines of code are needed to coordinate an AI models’ Python, C, and C++ and shell code, and other resources. The same task can be carried out on the CS-3 machine with 565 lines of code.

“This is not just a need from a hardware perspective, it’s so much simpler from a programming perspective,” he said, “because you can drop this trillion-parameter model directly into this block of memory,” whereas the GPUs involve “managing” across “thousands of 80-gigabyte blocks” of HBM memory to coordinate parameters.

The research project trained the AI model, which is not disclosed, across 50 training steps, though it did not yet train it to “convergence,” meaning, to a finished state. To train a trillion-parameter model to convergence would require many more machines and more time.

However, Cerebras subsequently worked with Sandia to run the training on 16 of the CS-3 machines. Performance increased in a “linear” fashion, said Wang, whereby training accuracy increases in proportion to the number of computers put into the cluster.

“The GPU has always claimed linear scaling, but it’s very, very difficult to achieve,” said Wang. “The whole point of our wafer-scale cluster is that because memory is this unified block, compute is separate, and we have a fabric in between, you do not have to worry about that.”

Although the work with Sandia did not train the model to convergence, such large-model training “is very important to our customers,” said Wang. “This is literally step one before you do a large run which costs so much money,” meaning, full convergence, he said.

One of the company’s largest customers, investment firm G42 of the United Arab Emirates, “is very much motivated to achieve a world-class result,” he said. “They want to train a very, very large model.”

Sandia will probably publish on the experiment when they have some “final results,” said Wang.

The NeurIPS conference is one of the premier events in AI, often featuring the first public disclosure of breakthroughs. The full schedule for the one-week event can be found on the NeurIPS Web site.

How Cerebras boosted Meta’s Llama to ‘frontier model’ performance

The company also demonstrates initial training of a one-trillion-parameter AI model on a single machine using conventional DDR5 memory chips.

Recent Posts

CONTACT US

FUND ADMINISTRATION