32 private links
Convolutional networks and Transformers: Tensor Cores > FLOPs > Memory Bandwidth > 16-bit capability
Recurrent networks: Memory Bandwidth > 16-bit capability > Tensor Cores > FLOPs
A simple and effective way to think about matrix multiplication AB=C is that it is memory bandwidth bound: Copying memory of A, B unto the chip is more costly than to do the computations of AB. This means memory bandwidth is the most important feature of a GPU if you want to use LSTMs and other recurrent networks that do lots of small matrix multiplications. The smaller the matrix multiplications, the more important is memory bandwidth.
On the contrary, convolution is bound by computation speed. Thus TFLOPs on a GPU is the best indicator for the performance of ResNets and other convolutional architectures. Tensor Cores can increase FLOPs dramatically.