SnapJolt

Cerebras' 1.2 Trillion Transistor Deep Learning Processor

08:49PM EDT - Some of the big news of today is Cerebras announcing its wafer-scale 1.2 trillion transistor solution for deep learning. The talk today goes into detail about the technology.

08:51PM EDT - Wafer scale chip, over 46,225 mm2, 1.2 trillion transistors, 400k AI cores, fed by 18GB of on-chip SRAM

08:51PM EDT - TSMC 16nm

08:51PM EDT - 215mm x 215mm - 8.5 inches per side

08:51PM EDT - 56 times larger than the largest GPU today

08:52PM EDT - Built for Deep Learning

08:52PM EDT - DL training is hard (ed: this is an understatement)

08:52PM EDT - Peta-to-exa scale compute range

08:53PM EDT - The shape of the problem is difficult to scale

08:53PM EDT - Fine grain has a lot of parallelism

08:53PM EDT - Coarse grain is inherently serial

08:53PM EDT - Training is the process of applying small changes, serially

08:53PM EDT - Size and shape of the problem makes training NN really hard

08:53PM EDT - Today we have dense vector compute

08:54PM EDT - For Coarse Grained, require high speed interconnect to run mutliple instances. Still limited

08:54PM EDT - Scaling is limited and costly

08:54PM EDT - Specialized accelerators are the answer

08:55PM EDT - NN: what is the right architecture

08:55PM EDT - Need a core to be optimized for NN primitives

08:55PM EDT - Need a programmable NN core

08:55PM EDT - Needs to do sparse compute fast

08:55PM EDT - Needs fast local memory

08:55PM EDT - All of the cores should be connected with a fast interconnect

08:56PM EDT - Cerebras uses flexible cores. Flexible general ops for control processing

08:56PM EDT - Core should handle tensor operations very efficiency

08:56PM EDT - Forms the bulk fo the compute in any neural network

08:56PM EDT - Tensors as first class operands

08:57PM EDT - fmac native op

08:57PM EDT - NN naturally creates sparse networks

08:58PM EDT - The core has native sparse processing in the hardware with dataflow scheduling

08:58PM EDT - All the compute is triggered by the data

08:58PM EDT - Filters all the sparse zeros, and filters the work

08:58PM EDT - saves the power and energy, and get performance and acceleration by moving onto the next useful work

08:58PM EDT - Enabled because arch has fine-grained execution datapaths

08:58PM EDT - Many small cores with independent instructions

08:59PM EDT - Allows for very non-uniform work

08:59PM EDT - Next is memory

08:59PM EDT - Traditional memory architectures are not optimized for DL

08:59PM EDT - Traditional memory requires high data reuse for performane

09:00PM EDT - Normal matrix multiply has low end data reuse

09:00PM EDT - Translating Mat*Vec into Mat*Mat, but changes the training dynamics

09:00PM EDT - Cerebras has high-perf, fully distributed on-chip SRAM next to the cores

09:01PM EDT - Getting orders of magnitude higher bandwidth

09:01PM EDT - ML can be done the way it wants to be done

09:01PM EDT - High bandwidth, low latency interconnect

09:01PM EDT - fast and fully configurable fabric

09:01PM EDT - all hw based communication avoicd sw overhead

09:02PM EDT - 2D mesh topology

09:02PM EDT - higher utlization and efficiency than global topologies

09:02PM EDT - Need more than a single die

09:02PM EDT - Solition is a wafer scale

09:03PM EDT - Build Big chips

09:03PM EDT - Cluster scale perf on a single chip

09:03PM EDT - GB of fast memory (SRAM) 1 clock cycle from the core

09:03PM EDT - That's impossible with off-chip memory

09:03PM EDT - Full on-chip interconnect fabric

09:03PM EDT - Model parallel, linear performance scaling

09:04PM EDT - Map the entire neural network onto the chip at once

09:04PM EDT - One instance of NN, don't have to increase batch size to get cluster scale perf

09:04PM EDT - Vastly lower power and less space

09:04PM EDT - Can use TensorFlow and PyTorch

09:05PM EDT - Performs placing and routing to map neural network layers to fabric

09:05PM EDT - Entire wafer operates on the single neural network

09:05PM EDT - Challenges of wafer scale

09:05PM EDT - Need cross-die connectivity, yield, thermal expansion

09:06PM EDT - Scribe line separates the die. On top of the scribe line, create wires

09:07PM EDT - Extends 2D mesh fabric across all die

09:07PM EDT - Same connectivity between cores and between die

09:07PM EDT - More efficient than off-chip

09:07PM EDT - Full BW at the die level

09:08PM EDT - Redundancy helps yield

09:08PM EDT - Redundant cores and redundant fabric links

09:08PM EDT - Reconnect the fabric with links

09:08PM EDT - Drive yields high

09:09PM EDT - Transparent to software

09:09PM EDT - Thermal expansion

09:09PM EDT - Normal tech, too much mechanical stress via thermal expansion

09:09PM EDT - Custom connector developed

09:09PM EDT - Connector absorbs the variation in thermal expansion

09:10PM EDT - All components need to be held with precise alignment - custom packaging tools

09:10PM EDT - Power and Cooling

09:11PM EDT - Power planes don't work - isn't enough copper in the PCB to do it that way

09:11PM EDT - Heat density too high for direct air cooling

09:12PM EDT - Bring current perpendicular to the wafer. Water cooled perpendicular too

09:14PM EDT - Q&A Time

09:14PM EDT - Q and A

09:14PM EDT - Already in use? Yes

09:15PM EDT - Can you make a round chip? Square is more convenient

09:15PM EDT - Yield? Mature processes are quite good and uniform

09:16PM EDT - Does it cost less than a house? Everything is amortised across the wafer

09:17PM EDT - Regular processor for housekeeping? They can all do it

09:17PM EDT - Is it fully synchronous? No

09:20PM EDT - Clock rate? Not disclosed

09:20PM EDT - That's a wrap. Next is Habana

ncG1vNJzZmivp6x7orrAp5utnZOde6S7zGiqoaenZH51g5RxZqGnpGKwqbXPrGRsaV2htrexjJujqJ%2BjYrCmvsSbqZqrXayup7HRZqqcmZyaeqWxxKlkpZ2Rp7uqusY%3D

Martina Birk

Update: 2024-02-12