Theoretical17 VI - More about Behaviors of DNN
Presenter | Papers | Paper URL | Our Slides |
---|---|---|---|
SE | Equivariance Through Parameter-Sharing, ICML17 1 | ||
SE | Why Deep Neural Networks for Function Approximation?, ICLR17 2 | ||
SE | Geometry of Neural Network Loss Surfaces via Random Matrix Theory, 3ICML17 | ||
Sharp Minima Can Generalize For Deep Nets, ICML17 4 |
-
Sharp Minima Can Generalize For Deep Nets, ICML17 / Laurent Dinh, Razvan Pascanu, Samy Bengio, Yoshua Bengio / Despite their overwhelming capacity to overfit, deep learning architectures tend to generalize relatively well to unseen data, allowing them to be deployed in practice. However, explaining why this is the case is still an open area of research. One standing hypothesis that is gaining popularity, e.g. Hochreiter & Schmidhuber (1997); Keskar et al. (2017), is that the flatness of minima of the loss function found by stochastic gradient based methods results in good generalization. This paper argues that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization. Specifically, when focusing on deep networks with rectifier units, we can exploit the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit to build equivalent models corresponding to arbitrarily sharper minima. Furthermore, if we allow to reparametrize a function, the geometry of its parameters can change drastically without affecting its generalization properties. ↩
-
Measuring Sample Quality with Kernels, NIPS15/ To improve the efficiency of Monte Carlo estimation, practitioners are turning to biased Markov chain Monte Carlo procedures that trade off asymptotic exactness for computational speed. The reasoning is sound: a reduction in variance due to more rapid sampling can outweigh the bias introduced. However, the inexactness creates new challenges for sampler and parameter selection, since standard measures of sample quality like effective sample size do not account for asymptotic bias. To address these challenges, we introduce a new computable quality measure based on Stein’s method that quantifies the maximum discrepancy between sample and target expectations over a large class of test functions. We use our tool to compare exact, biased, and deterministic sample sequences and illustrate applications to hyperparameter selection, convergence rate assessment, and quantifying bias-variance tradeoffs in posterior inference. ↩
-
Equivariance Through Parameter-Sharing, ICML17/ We propose to study equivariance in deep neural networks through parameter symmetries. In particular, given a group G that acts discretely on the input and output of a standard neural network layer ϕW:ℜM→ℜN, we show that ϕW is equivariant with respect to G-action iff G explains the symmetries of the network parameters W. Inspired by this observation, we then propose two parameter-sharing schemes to induce the desirable symmetry on W. Our procedures for tying the parameters achieve G-equivariance and, under some conditions on the action of , they guarantee sensitivity to all other permutation groups outside. ↩
-
Geometry of Neural Network Loss Surfaces via Random Matrix Theory, ICML17 / Understanding the geometry of neural network loss surfaces is important for the development of improved optimization algorithms and for building a theoretical understanding of why deep learning works. In this paper, we study the geometry in terms of the distribution of eigenvalues of the Hessian matrix at critical points of varying energy. We introduce an analytical framework and a set of tools from random matrix theory that allow us to compute an approximation of this distribution under a set of simplifying assumptions. The shape of the spectrum depends strongly on the energy and another key parameter, ϕ, which measures the ratio of parameters to data points. Our analysis predicts and numerical simulations support that for critical points of small index, the number of negative eigenvalues scales like the 3/2 power of the energy. We leave as an open problem an explanation for our observation that, in the context of a certain memorization task, the energy of minimizers is well-approximated by the function 1/2(1−ϕ)^2. ↩