Cerebras Systems, the pioneer in innovative compute solutions for Artificial Intelligence (AI), unveiled the world’s first brain-scale AI solution. The human brain contains on the order of 100 trillion synapses.
The largest AI hardware clusters were on the order of 1% of human brain scale, or about 1 trillion synapse equivalents, called parameters. At only a fraction of full human brain-scale, these clusters of graphics processors consume acres of space and megawatts of power, and require dedicated teams to operate.
The technology enables a single CS-2 accelerator—the size of a dorm room refrigerator—to support models of over 120 trillion parameters in size. Cerebras’ new technology portfolio contains four industry-leading innovations: Cerebras Weight Streaming, a new software execution architecture; Cerebras MemoryX, a memory extension technology; Cerebras SwarmX, a high-performance interconnect fabric technology; and Selectable Sparsity, a dynamic sparsity harvesting technology.
Cerebras Weight Streaming technology enables – for the first time – the ability to store model parameters off-chip while delivering the same training and inference performance as if they were on chip. This new execution model disaggregates compute and parameter storage – allowing researchers to flexibly scale size and speed independently – and eliminates the latency and memory bandwidth issues that challenge large clusters of small processors. This dramatically simplifies the workload distribution model, and is designed so users can scale from using 1 to up to 192 CS-2s with no software changes.
Cerebras MemoryX is a memory extension technology. MemoryX will provide the second-generation Cerebras Wafer Scale Engine (WSE-2) up to 2.4 Petabytes of high performance memory, all of which behaves as if it were on-chip. With MemoryX, CS-2 can support models with up to 120 trillion parameters.
Cerebras SwarmX is a high-performance, AI-optimized communication fabric that extends the Cerebras Swarm on-chip fabric to off-chip. SwarmX is designed to enable Cerebras to connect up to 163 million AI optimized cores across up to 192 CS-2s, working in concert to train a single neural network.
Selectable Sparsity enables users to select the level of weight sparsity in their model and provides a direct reduction in FLOPs and time-to-solution. Weight sparsity is an exciting area of ML research that has been challenging to study as it is extremely inefficient on graphics processing units. Selectable sparsity enables the CS-2 to accelerate work and use every available type of sparsity—including unstructured and dynamic weight sparsity—to produce answers in less time.
This combination of technologies will allow users to unlock brain-scale neural networks and distribute work over enormous clusters of AI-optimized cores with push-button ease. With this, Cerebras sets the new benchmark in model size, compute cluster horsepower, and programming simplicity at scale.
“Today, Cerebras moved the industry forward by increasing the size of the largest networks possible by 100 times,” said Andrew Feldman, CEO and co-founder of Cerebras. “Larger networks, such as GPT-3, have already transformed the natural language processing (NLP) landscape, making possible what was previously unimaginable. The industry is moving past 1 trillion parameter models, and we are extending that boundary by two orders of magnitude, enabling brain-scale neural networks with 120 trillion parameters.”
“The last several years have shown us that, for NLP models, insights scale directly with parameters – the more parameters, the better the results,” says Rick Stevens, Associate Director, Argonne National Laboratory. “Cerebras’ inventions, which will provide a 100x increase in parameter capacity, may have the potential to transform the industry. For the first time we will be able to explore brain-sized models, opening up vast new avenues of research and insight.”
“One of the largest challenges of using large clusters to solve AI problems is the complexity and time required to set up, configure and then optimize them for a specific neural network,” said Karl Freund, founder and principal analyst, Cambrian AI. “The Weight Streaming execution model is so elegant in its simplicity, and it allows for a much more fundamentally straightforward distribution of work across the CS-2 clusters’ incredible compute resources. With Weight Streaming, Cerebras is removing all the complexity we have to face today around building and efficiently using enormous clusters – moving the industry forward in what I think will be a transformational journey.”
Cerebras Weight Streaming: Disaggregating Memory and Compute
The Cerebras CS-2 is powered by the Wafer Scale Engine (WSE-2), the largest chip ever made and the fastest AI processor. Purpose-built for AI work, the 7nm-based WSE-2 delivers a massive leap forward for AI compute. The WSE-2 is a single wafer-scale chip with 2.6 trillion transistors and 850,000 AI optimized cores. By comparison, the largest graphics processing unit has only 54 billion transistors, 2.55 trillion fewer transistors than the WSE-2. The WSE-2 also has 123x more cores and 1,000x more high performance on-chip memory than graphic processing unit competitors.
Cerebras Weight Streaming builds on the foundation of the massive size of the WSE. It is a new software execution mode where compute and parameter storage are fully disaggregated from each other. A small parameter store can be linked with many wafers housing tens of millions of cores, or 2.4 Petabytes of storage enabling 120 trillion parameter models can be allocated to a single CS-2.
In Weight Streaming, the model weights are held in a central off-chip storage location. They are streamed onto the wafer where they are used to compute each layer of the neural network. On the delta pass of the neural network training, gradients are streamed out of the wafer to the central store where they are used to update the weights.
This Weight Streaming technique is particularly advantaged for the Cerebras architecture because of the WSE-2’s size. Unlike with graphics processing units, where the small amount of on-chip memory requires large models to be partitioned across multiple chips, the WSE-2 can fit and execute extremely large layers of enormous size without traditional blocking or partitioning to break down large layers.
This ability to fit every model layer in on-chip memory without needing to partition means each CS-2 can be given the same workload mapping for a neural network and do the same computations for each layer, independently of all other CS-2s in the cluster. For users, this simplicity allows them to scale their model from running on a single CS-2, to running on a cluster of arbitrary size without any software changes.
Cerebras MemoryX: Enabling Hundred-Trillion Parameter Models
Over the past three years, the size of the largest AI models have increased their parameter count by three orders of magnitude, with the largest models now using 1 trillion parameters. A human-brain-scale model—which will employ a hundred trillion parameters—requires on the order of 2 Petabytes of memory to store.
Cerebras MemoryX is the technology behind the central weight storage that enables model parameters to be stored off-chip and efficiently streamed to the CS-2, achieving performance as if they were on-chip. It contains both the storage for the weights and the intelligence to precisely schedule and perform weight updates to prevent dependency bottlenecks. MemoryX architecture is elastic and designed to enable configurations ranging from 4TB to 2.4PB, supporting parameter sizes from 200 billion to 120 trillion.
Cerebras SwarmX: Providing Bigger, More Efficient Clusters
The Cerebras SwarmX technology extends the boundary of AI clusters by expanding Cerebras’ on-chip fabric to off-chip. Historically, bigger AI clusters came with a significant performance and power penalty. In compute terms, performance has scaled sub-linearly while power and cost scaled super-linearly. As more graphics processors were added to a cluster, each contributed less and less to solving the problem.
Cerebras SwarmX fabric enables clusters to achieve near linear performance scaling, meaning that 10 CS-2s are expected to achieve the same solution 10x faster than a single CS-2. The SwarmX fabric scales independently of MemoryX resources – a single MemoryX unit can be used to target any number of CS-2s. In this fully disaggregated mode, the SwarmX fabric is designed to scale from 2 CS-2 systems to up to 192 systems and, since each CS-2 delivers 850,000 AI-optimized cores, will enable clusters of up to 163 million AI-optimized cores.
Cerebras Sparsity: Smarter Math for Reduced Time-to-Answer
Cerebras is also enabling new algorithms to reduce the amount of computational work necessary to find the solution, and thereby reducing time-to-answer. Sparsity is one of the most powerful levers to make computation more efficient. Evolution selected for sparsity in the human brain: neurons have “activation sparsity” in that not all neurons are firing at the same time. They have “weight sparsity” in that not all synapses are fully connected. Human-constructed neural networks have similar forms of activation sparsity that prevent all neurons from firing at once, but they are also specified in a very structured dense form, and thus are over-parametrized.
With sparsity, the premise is simple: multiplying by zero is a bad idea, especially when it consumes time and electricity. And yet, graphics processing units multiply be zero routinely. In neural networks, there are many types of sparsity. Sparsity can be in the activations as well as in the parameters, and sparsity can be structured or unstructured. As the AI community grapples with the exponentially increasing cost to train large models, the use of sparsity and other algorithmic techniques to reduce the compute FLOPs required to train a model to state-of-the-art accuracy is increasingly important.
The Cerebras WSE is based on a fine-grained data flow architecture. Its 850,000 AI optimized compute cores are capable of individually ignoring zeros regardless of the pattern in which they arrive. This selectable sparsity harvesting is something no other architecture is capable of. The dataflow scheduling and tremendous memory bandwidth unique to the Cerebras architecture enables this type of fine-grained processing to accelerate all forms of sparsity. The result is that the CS-2 can select and dial in sparsity to produce a specific level of FLOP reduction, and therefore a reduction in time-to-answer.
Push Button Configuration of Massive AI Clusters
Large clusters have historically been plagued by set up and configuration challenges, often taking months to fully prepare before they are ready to run real applications. Preparing and optimizing a neural network to run on large clusters of GPUs takes yet more time. To achieve reasonable utilization on a GPU cluster takes painful, manual work from researchers who typically need to partition the model, spreading it across the many tiny compute units; manage both data parallel and model parallel partitions; manage memory size and memory bandwidth constraints; and deal with synchronization overheads. To deal with potential drops in model accuracy takes additional hyperparameter and optimizer tuning to get models to converge at extreme batch sizes. And this task needs to be repeated for each network.
By bringing together the technologies in Weight Streaming, MemoryX and SwarmX, Cerebras makes the process of large cluster building push-button simple. Cerebras’ approach is not to hide distribution complexity by papering over it with software. Cerebras has instead developed a fundamentally different architecture which removes the scaling complexity altogether. Because of the size of the WSE-2, there is no need to partition the layers of a neural network across multiple CS-2’s – even today’s largest network layers can be mapped to a single CS-2.
Unlike in GPU clusters where each graphics processor has a different part of the neural network, each CS-2 in a Cerebras cluster will have the same software configuration. Adding another CS-2 changes almost nothing in the execution of the work, so running a neural network on dozens of CS-2s will look the same to a researcher as running on a single system. Setting up a cluster will be as easy as compiling a workload for a single machine and applying that same mapping to all the machines in the desired cluster size.
Cerebras Weight Streaming technology enables users to run neural network applications on massive clusters of CS-2 systems with the programming ease of a single graphics processing unit.