Groq’s Tensor Streaming Processor (TSP) architecture is capable of 1 PetaOp/s performance on a single chip implementation. The Groq architecture is the first in the world to achieve this level of performance, which is equivalent to one quadrillion operations per second, or 1e15 ops/s. Groq’s architecture is also capable of up to 250 trillion floating-point operations per second (FLOPS).
The Groq architecture is many multiples faster than anything else available for inference, in terms of both low latency and inferences per second. Inspired by a software-first mindset, Groq’s TSP architecture provides a new paradigm for achieving both compute flexibility and massive parallelism without synchronization overhead of traditional GPU and CPU architectures.
The architecture can support both traditional and new machine learning models, and is currently in operation on customer sites in both x86 and non-x86 systems. It is designed specifically for the performance requirements of computer vision, machine learning and other AI-related workloads. Execution planning happens in software, freeing up valuable silicon real estate otherwise dedicated to dynamic instruction execution.
The tight control provided by this architecture provides deterministic processing that is especially valuable for applications where safety and accuracy are paramount. Compared to complex traditional architectures based on CPUs, GPUs and FPGAs, Groq’s chip also streamlines qualification and deployment, enabling customers to simply and quickly implement scalable, high performance-per-watt systems.
Ideal for deep learning inference processing for a wide range of applications, the Groq solution is designed for a broad class of workloads. Its performance, coupled with its simplicity, makes it an ideal platform for any high-performance, data- or compute-intensive workload.