Optimization17 Optimization in DNN
Presenter  Papers  Paper URL  Our Slides 

Muthu  Optimization Methods for LargeScale Machine Learning, Léon Bottou, Frank E. Curtis, Jorge Nocedal ^{1}  
Muthu  Fast Training of Recurrent Networks Based on EM Algorithm (1998) ^{2}  
Muthu  FitNets: Hints for Thin Deep Nets, ICLR15 ^{3}  
Muthu  Two NIPS 2015 Deep Learning Optimization Papers  
Muthu  Difference Target Propagation (2015) ^{4} 

_{ Optimization Methods for LargeScale Machine Learning, Léon Bottou, Frank E. Curtis, Jorge Nocedal / This paper provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications. Through case studies on text classification and the training of deep neural networks, we discuss how optimization problems arise in machine learning and what makes them challenging. A major theme of our study is that largescale machine learning represents a distinctive setting in which the stochastic gradient (SG) method has traditionally played a central role while conventional gradientbased nonlinear optimization techniques typically falter. Based on this viewpoint, we present a comprehensive theory of a straightforward, yet versatile SG algorithm, discuss its practical behavior, and highlight opportunities for designing algorithms with improved performance. This leads to a discussion about the next generation of optimization methods for largescale machine learning, including an investigation of two main streams of research on techniques that diminish noise in the stochastic directions and methods that make use of secondorder derivative approximations. } ↩

_{ Fast Training of Recurrent Networks Based on EM Algorithm (1998) / In this work, a probabilistic model is established for recurrent networks. The expectationmaximization (EM) algorithm is then applied to derive a new fast training algorithm for recurrent networks through meanfield approximation. This new algorithm converts training a complicated recurrent network into training an array of individual feedforward neurons. These neurons are then trained via a linear weighted regression algorithm. The training time has been improved by five to 15 times on benchmark problems. Published in: IEEE Transactions on Neural Networks ( Volume: 9 , Issue: 1 , Jan 1998 ) } ↩

_{ FitNets: Hints for Thin Deep Nets, ICLR15 / Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, Yoshua Bengio (Submitted on 19 Dec 2014 (v1), last revised 27 Mar 2015 (this version, v4)) While depth tends to improve network performances, it also makes gradientbased training more difficult since deeper networks tend to be more nonlinear. The recently proposed knowledge distillation approach is aimed at obtaining small and fasttoexecute models, and it has shown that a student network could imitate the soft output of a larger teacher network or ensemble of networks. In this paper, we extend this idea to allow the training of a student that is deeper and thinner than the teacher, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student. Because the student intermediate hidden layer will generally be smaller than the teacher’s intermediate hidden layer, additional parameters are introduced to map the student hidden layer to the prediction of the teacher hidden layer. This allows one to train deeper students that can generalize better or run faster, a tradeoff that is controlled by the chosen student capacity. For example, on CIFAR10, a deep student network with almost 10.4 times less parameters outperforms a larger, stateoftheart teacher network. } ↩

_{ Difference Target Propagation (2015) / 13 pages, 8 figures, Accepted in ECML/PKDD 2015 / DongHyun Lee, Saizheng Zhang, Asja Fischer, Yoshua Bengio/ Backpropagation has been the workhorse of recent successes of deep learning but it relies on infinitesimal effects (partial derivatives) in order to perform credit assignment. This could become a serious issue as one considers deeper and more nonlinear functions, e.g., consider the extreme case of nonlinearity where the relation between parameters and cost is actually discrete. Inspired by the biological implausibility of backpropagation, a few approaches have been proposed in the past that could play a similar credit assignment role. In this spirit, we explore a novel approach to credit assignment in deep networks that we call target propagation. The main idea is to compute targets rather than gradients, at each layer. Like gradients, they are propagated backwards. In a way that is related but different from previously proposed proxies for backpropagation which rely on a backwards network with symmetric weights, target propagation relies on autoencoders at each layer. Unlike backpropagation, it can be applied even when units exchange stochastic bits rather than real numbers. We show that a linear correction for the imperfectness of the autoencoders, called difference target propagation, is very effective to make target propagation actually work, leading to results comparable to backpropagation for deep networks with discrete and continuous units and denoising autoencoders and achieving state of the art for stochastic networks. } ↩