Optimization I  Understanding DNN Optimization
31 Oct 2017 4Optimization optimization Curriculum DifferentiationPresenter  Papers  Paper URL  Our Slides 

Ceyer  An overview of gradient optimization algorithms, ^{1}  
Shijia  Osborne  Probabilistic numerics for deep learning ^{2}  DLSS 2017 + Video  PDF / PDF2 
Jack  Automated Curriculum Learning for Neural Networks, ICML17 ^{3}  
DLSS17  Johnson  Automatic Differentiation ^{4}  slide + video 

_{ An overview of gradient optimization algorithms, by S Ruder  2016 Cite as: arXiv:1609.04747 / Gradient descent optimization algorithms, while increasingly popular, are often used as blackbox optimizers, as practical explanations of their strengths and weaknesses are hard to come by. This article aims to provide the reader with intuitions with regard to the behaviour of different algorithms that will allow her to put them to use. In the course of this overview, we look at different variants of gradient descent, summarize challenges, introduce the most common optimization algorithms, review architectures in a parallel and distributed setting, and investigate additional strategies for optimizing gradient descent. } ↩

_{ Osborne  Probabilistic numerics for deep learning / http://probabilisticnumerics.org/ Numerical algorithms, such as methods for the numerical solution of integrals and ordinary differential equations, as well as optimization algorithms can be interpreted as estimation rules. They estimate the value of a latent, intractable quantity – the value of an integral, the solution of a differential equation, the location of an extremum – given the result of tractable computations (“observations”, such as function values of the integrand, evaluations of the differential equation, function values of the gradient of an objective). So these methods perform inference, and are accessible to the formal frameworks of probability theory. They are learning machines. Taking this observation seriously, a probabilistic numerical method is a numerical algorithm that takes in a probability distribution over its inputs, and returns a probability distribution over its output. Recent research shows that it is in fact possible to directly identify existing numerical methods, including some real classics, with specific probabilistic models. Interpreting numerical methods as learning algorithms offers various benefits. It can offer insight into the algebraic assumptions inherent in existing methods. As a joint framework for methods developed in separate communities, it allows transfer of knowledge among these areas. But the probabilistic formulation also explicitly provides a richer output than simple convergence bounds. If the probability measure returned by a probabilistic method is wellcalibrated, it can be used to monitor, propagate and control the quality of computations. } ↩

_{ Johnson  Automatic Differentiation and more: The simple essence of automatic differentiation / https://arxiv.org/abs/1804.00746 / Automatic differentiation (AD) in reverse mode (RAD) is a central component of deep learning and other uses of largescale optimization. Commonly used RAD algorithms such as backpropagation, however, are complex and stateful, hindering deep understanding, improvement, and parallel execution. This paper develops a simple, generalized AD algorithm calculated from a simple, natural specification. The general algorithm is then specialized by varying the representation of derivatives. In particular, applying wellknown constructions to a naive representation yields two RAD algorithms that are far simpler than previously known. In contrast to commonly used RAD implementations, the algorithms defined here involve no graphs, tapes, variables, partial derivatives, or mutation. They are inherently parallelfriendly, correct by construction, and usable directly from an existing programming language with no need for new data types or programming style, thanks to use of an ADagnostic compiler plugin. } ↩

_{ Automated Curriculum Learning for Neural Networks, ICML17 / lex Graves, Marc G. Bellemare, Jacob Menick, Remi Munos, Koray Kavukcuoglu/ We introduce a method for automatically selecting the path, or syllabus, that a neural network follows through a curriculum so as to maximise learning efficiency. A measure of the amount that the network learns from each data sample is provided as a reward signal to a nonstationary multiarmed bandit algorithm, which then determines a stochastic syllabus. We consider a range of signals derived from two distinct indicators of learning progress: rate of increase in prediction accuracy, and rate of increase in network complexity. Experimental results for LSTM networks on three curricula demonstrate that our approach can significantly accelerate learning, in some cases halving the time required to attain a satisfactory performance level. } ↩