32 private links
The results are quite impressive! We compared against compression algorithms on MNIST, where sparse momentum outperforms most other methods. This is a pretty good result given that compression methods start from a dense network and usually retrain repetitively while we train a sparse network from scratch! Another impressive result is that we can match or even exceed the performance of dense networks by using 20% of weights (80% sparsity). On CIFAR-10, we compare against Single-shot Network Pruning which is designed for simplicity and not performance — so it is not surprising that sparse momentum does better. However, what is interesting is that we can train both VGG16-D (a version of VGG16 with two fully connected layers) and Wide Residual Network (WRN) 16-10 (16 layers deep and very wide WRN) to dense performance levels with just 5% of weights. For other networks, sparse momentum comes close to dense performance levels. Furthermore, as I will show later, with an optimized sparse convolution algorithm, we would be able to train a variety of networks to yield the same performance levels while training between 3.0-5.6x faster!