32 private links
a bewildering array of comparisons, useful as a literature review of sorts
The concept of non-linearity in a Neural Network is introduced by an activation function which serves an integral role in the training and performance evaluation of the network. Over the years of theoretical research, many activation functions have been proposed, however, only a few are widely used in mostly all applications which include ReLU (Rectified Linear Unit), TanH (Tan Hyperbolic), Sigmoid, Leaky ReLU and Swish. In this work, a novel neural activation function called as Mish is proposed. The experiments show that Mish tends to work better than both ReLU and Swish along with other
standard activation functions in many deep networks across challenging datasets. For instance, in Squeeze Excite Net- 18 for CIFAR 100 classification, the network with Mish had an increase in Top-1 test accuracy by 0.494% and
1.671% as compared to the same network with Swish and ReLU respectively. The similarity to Swish along with providing a boost in performance and its simplicity in implementation makes it easier for researchers and developers to
use Mish in their Neural Network Models.
We propose a decentralized variant of Monte Carlo tree search (MCTS) that is suitable for a variety of tasks in multi-robot active perception. Our algorithm allows each robot to optimize its own actions by maintaining a probability distribution over plans in the joint-action space. Robots periodically communicate a compressed form of their search trees, which are used to update the joint distribution using a distributed optimization approach inspired by variational methods. Our method admits any objective function defined over robot action sequences, assumes intermittent communication, is anytime, and is suitable for online replanning.
Deep reinforcement learning has achieved great successes inrecent years, however, one main challenge is the sample in-efficiency. In this paper, we focus on how to use action guid-ance by means of a non-expert demonstrator to improve sam-ple efficiency in a domain with sparse, delayed, and pos-sibly deceptive rewards: the recently-proposed multi-agentbenchmark of Pommerman. We propose a new frameworkwhere even a non-expert simulated demonstrator, e.g., plan-ning algorithms such as Monte Carlo tree search with a smallnumber rollouts, can be integrated within asynchronous dis-tributed deep reinforcement learning methods. Compared to avanilla deep RL algorithm, our proposed methods both learnfaster and converge to better policies on a two-player miniversion of the Pommerman game.
The results are quite impressive! We compared against compression algorithms on MNIST, where sparse momentum outperforms most other methods. This is a pretty good result given that compression methods start from a dense network and usually retrain repetitively while we train a sparse network from scratch! Another impressive result is that we can match or even exceed the performance of dense networks by using 20% of weights (80% sparsity). On CIFAR-10, we compare against Single-shot Network Pruning which is designed for simplicity and not performance — so it is not surprising that sparse momentum does better. However, what is interesting is that we can train both VGG16-D (a version of VGG16 with two fully connected layers) and Wide Residual Network (WRN) 16-10 (16 layers deep and very wide WRN) to dense performance levels with just 5% of weights. For other networks, sparse momentum comes close to dense performance levels. Furthermore, as I will show later, with an optimized sparse convolution algorithm, we would be able to train a variety of networks to yield the same performance levels while training between 3.0-5.6x faster!
DeepMind's new architecture -- MEMO -- solves novel reasoning tasks with less computation than several baseline AI models.
Convolutional networks and Transformers: Tensor Cores > FLOPs > Memory Bandwidth > 16-bit capability
Recurrent networks: Memory Bandwidth > 16-bit capability > Tensor Cores > FLOPs
A simple and effective way to think about matrix multiplication AB=C is that it is memory bandwidth bound: Copying memory of A, B unto the chip is more costly than to do the computations of AB. This means memory bandwidth is the most important feature of a GPU if you want to use LSTMs and other recurrent networks that do lots of small matrix multiplications. The smaller the matrix multiplications, the more important is memory bandwidth.
On the contrary, convolution is bound by computation speed. Thus TFLOPs on a GPU is the best indicator for the performance of ResNets and other convolutional architectures. Tensor Cores can increase FLOPs dramatically.
Ax and BoTorch leverage probabilistic models that make efficient use of data and are able to meaningfully quantify the costs and benefits of exploring new regions of problem space. In these cases, probabilistic models can offer significant benefits over standard deep learning methods such as neural networks, which often require large amounts of data to make accurate predictions and don’t provide good estimates of uncertainty.
Wide neural networks with random weights and biases are Gaussian processes,
as originally observed by Neal (1995) and more recently by Lee et al. (2018)
and Matthews et al. (2018) for deep fully-connected networks, as well as by
Novak et al. (2019) and Garriga-Alonso et al. (2019) for deep convolutional
networks. We show that this Neural Network-Gaussian Process correspondence
surprisingly extends to all modern feedforward or recurrent neural networks
composed of multilayer perceptron, RNNs (e.g. LSTMs, GRUs), (nD or graph)
convolution, pooling, skip connection, attention, batch normalization, and/or
layer normalization. More generally, we introduce a language for expressing
neural network computations, and our result encompasses all such expressible
neural networks. This work serves as a tutorial on the tensor programs
technique formulated in Yang (2019) and elucidates the Gaussian Process results
obtained there. We provide open-source implementations of the Gaussian Process
kernels of simple RNN, GRU, transformer, and batchnorm+ReLU network at
github.com/thegregyang/GP4A.
We show that this Neural Network-Gaussian Process correspondence surprisingly extends to all modern feedforward or recurrent neural networks composed of multilayer perceptron, RNNs (e.g. LSTMs, GRUs), (nD or graph) convolution, pooling, skip connection, attention, batch normalization, and/or layer normalization. More generally, we introduce a language for expressing neural network computations, and our result encompasses all such expressible neural networks. This work serves as a tutorial on the tensor programs technique formulated in Yang (2019) and elucidates the Gaussian Process results obtained there. We provide open-source implementations of the Gaussian Process kernels of simple RNN, GRU, transformer, and batchnorm+ReLU network at this http URL.
Planning problems are among the most important and well-studied problems in
artificial intelligence. They are most typically solved by tree search
algorithms that simulate ahead into the future, evaluate future states, and
back-up those evaluations to the root of a search tree. Among these algorithms,
Monte-Carlo tree search (MCTS) is one of the most general, powerful and widely
used. A typical implementation of MCTS uses cleverly designed rules, optimized
to the particular characteristics of the domain. These rules control where the
simulation traverses, what to evaluate in the states that are reached, and how
to back-up those evaluations. In this paper we instead learn where, what and
how to search. Our architecture, which we call an MCTSnet, incorporates
simulation-based search inside a neural network, by expanding, evaluating and
backing-up a vector embedding. The parameters of the network are trained
end-to-end using gradient-based optimisation. When applied to small searches in
the well known planning problem Sokoban, the learned search algorithm
significantly outperformed MCTS baselines.
"it achieves state-of-the-art results in 12 summarization tasks spanning news, science, stories, instructions, emails, patents, and legislative bills, and that it shows “surprising” performance on low-resource summarization, surpassing previous top results on six data sets with only 1,000 examples."