Deep Learning Readings We Covered in 2017 Team Meetings


2017Reads (Index of Posts):

No. Read Date Title and Information We Read @
1 2017, Jan, 12 Basic16- Basic DNN Reads we finished for NLP/Text 2017-team
2 2017, Jan, 12 Basic16- Basic DNN Embedding we read for Ranking/QA 2017-team
3 2017, Jan, 18 Basic16- Basic Deep NN with Memory 2017-team
4 2017, Jan, 19 Basic16- Basic Deep NN and Robustness 2017-team
5 2017, Jan, 20 Basic16- DNN to be Scalable 2017-team
6 2017, Jan, 21 Basic BioMed Application Reads we finished before 2017 2017-team
7 2017, Feb, 22 Reliable17-Secure Machine Learning 2017-team
8 2017, Apr, 22 Optimization17- Optimization in DNN 2017-team
9 2017, May, 22 Generative17- Generative Deep Networks 2017-team
10 2017, Jun, 2 Structures17 -Adaptive Deep Networks I 2017-team
11 2017, Jun, 22 Structures17 - Adaptive Deep Networks II 2017-team
12 2017, Jul, 22 Reliable17-Testing and Machine Learning Basics 2017-team

Basic16- Basic DNN Reads we finished for NLP/Text

0Basics 2Architecture 9DiscreteApp embedding text BERT seq2seq attention NLP curriculum BackProp relational
Presenter Papers Paper URL Our Slides
NLP A Neural Probabilistic Language Model PDF  
Text Bag of Tricks for Efficient Text Classification PDF  
Text Character-level Convolutional Networks for Text Classification PDF  
NLP BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding PDF  
seq2seq Neural Machine Translation by Jointly Learning to Align and Translate PDF  
NLP Natural Language Processing (almost) from Scratch PDF  
Train Curriculum learning PDF  
Muthu NeuroIPS Embedding Papers survey 2012 to 2015 NIPS PDF
Basics Efficient BackProp PDF  

Basic16- Basic DNN Embedding we read for Ranking/QA

0Basics 2Architecture 9DiscreteApp embedding recommendation QA graph relational
Presenter Papers Paper URL Our Slides
QA Learning to rank with (a lot of) word features PDF  
Relation A semantic matching energy function for learning with multi-relational data PDF  
Relation Translating embeddings for modeling multi-relational data PDF  
QA Reading wikipedia to answer open-domain questions PDF  
QA Question answering with subgraph embeddings PDF  

Basic16- Basic Deep NN with Memory

0Basics 2Architecture 7MetaDomain memory NTM seq2seq pointer set attention meta-learning Few-Shot matching net metric-learning
Presenter Papers Paper URL Our Slides
seq2seq Sequence to Sequence Learning with Neural Networks PDF  
Set Pointer Networks PDF  
Set Order Matters: Sequence to Sequence for Sets PDF  
Point Attention Multiple Object Recognition with Visual Attention PDF  
Memory End-To-End Memory Networks PDF Jack Survey
Memory Neural Turing Machines PDF  
Memory Hybrid computing using a neural network with dynamic external memory PDF  
Muthu Matching Networks for One Shot Learning (NIPS16) 1 PDF PDF
Jack Meta-Learning with Memory-Augmented Neural Networks (ICML16) 2 PDF PDF
Metric ICML07 Best Paper - Information-Theoretic Metric Learning PDF  
  1. Matching Networks for One Shot Learning (NIPS16): Learning from a few examples remains a key challenge in machine learning. Despite recent advances in important domains such as vision and language, the standard supervised deep learning paradigm does not offer a satisfactory solution for learning new concepts rapidly from little data. In this work, we employ ideas from metric learning based on deep neural features and from recent advances that augment neural networks with external memories. Our framework learns a network that maps a small labelled support set and an unlabelled example to its label, obviating the need for fine-tuning to adapt to new class types. We then define one-shot learning problems on vision (using Omniglot, ImageNet) and language tasks. Our algorithm improves one-shot accuracy on ImageNet from 87.6% to 93.2% and from 88.0% to 93.8% on Omniglot compared to competing approaches. We also demonstrate the usefulness of the same model on language modeling by introducing a one-shot task on the Penn Treebank.  

  2. Meta-Learning with Memory-Augmented Neural Networks (ICML16) Despite recent breakthroughs in the applications of deep neural networks, one setting that presents a persistent challenge is that of “one-shot learning.” Traditional gradient-based networks require a lot of data to learn, often through extensive iterative training. When new data is encountered, the models must inefficiently relearn their parameters to adequately incorporate the new information without catastrophic interference. Architectures with augmented memory capacities, such as Neural Turing Machines (NTMs), offer the ability to quickly encode and retrieve new information, and hence can potentially obviate the downsides of conventional models. Here, we demonstrate the ability of a memory-augmented neural network to rapidly assimilate new data, and leverage this data to make accurate predictions after only a few samples. We also introduce a new method for accessing an external memory that focuses on memory content, unlike previous methods that additionally use memory location-based focusing mechanisms.  

Basic16- Basic Deep NN and Robustness

0Basics 3Reliable Adversarial-Examples robustness visualizing Interpretable Certified-Defense
Presenter Papers Paper URL Our Slides
AE Intriguing properties of neural networks / PDF  
AE Explaining and Harnessing Adversarial Examples PDF  
AE Towards Deep Learning Models Resistant to Adversarial Attacks PDF  
AE DeepFool: a simple and accurate method to fool deep neural networks PDF  
AE Towards Evaluating the Robustness of Neural Networks by Carlini and Wagner PDF PDF
Data Basic Survey of ImageNet - LSVRC competition URL PDF
Understand Understanding Black-box Predictions via Influence Functions PDF  
Understand Deep inside convolutional networks: Visualising image classification models and saliency maps PDF  
Understand BeenKim, Interpretable Machine Learning, ICML17 Tutorial [^1] PDF  
provable Provable defenses against adversarial examples via the convex outer adversarial polytope, Eric Wong, J. Zico Kolter, URL  

[^1] Notes about Interpretable Machine Learning

Notes of Interpretability in Machine Learning from Been Kim Tutorial

by Brandon Liu

Important Criteria in ML Systems
  • Safety
  • Nondiscrimination
  • Avoiding technical debt
  • Providing the right to explanation
  • Ex. Self driving cars and other autonomous vehicles - almost impossible to come up with all possible unit tests.
What is interpretability?
  • The ability to give explanations to humans.
Two Branches of Interpretability
  • In the context of an application: if the system is useful in either a practical application or a simplified version of it, then it must be somehow interpretable.
  • Via a quantifiable proxy: a researcher might first claim that some model class—e.g. sparse linear models, rule lists, gradient boosted trees—are interpretable and then present algorithms to optimize within that class.
Before building any model
  • Visualization
  • Exploratory data analysis
Building a new model
  • Rule-based, per-feature-based
  • Case-based
  • Sparsity
  • Monotonicity
After building a model
  • Sensitivity analysis, gradient-based methods
  • mimic/surrogate models
  • Investigation on hidden layers

Basic16- DNN to be Scalable

0Basics 2Architecture 8Scalable scalable random sparsity binary hash compression low-rank distributed dimension reduction pruning sketch Parallel
Presenter Papers Paper URL Our Slides
scalable Sanjiv Kumar (Columbia EECS 6898), Lecture: Introduction to large-scale machine learning 2010 [^1] PDF  
data scalable Alex Smola - Berkeley SML: Scalable Machine Learning: Syllabus 2012 [^2] PDF 2014 + PDF  
Binary Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1    
Model Binary embeddings with structured hashed projections 1 PDF PDF
Model Deep Compression: Compressing Deep Neural Networks (ICLR 2016) 2 PDF PDF

Syllabus from Sanjiv Kuman Course

+ Randomized Algorithms
+ Matrix Approximations I (low-rank approximation, decomposition)
+ Matrix Approximations II (sparse matrices, matrix completion)
+ Approximate Nearest Neighbor Search I (trees)
+ Approximate Nearest Neighbor Search II (hashes)
+ Fast Optimization (first-order methods)
+ Kernel Methods I (fast training)
+ Kernel Methods II (fast testing)
+ Dimensionality Reduction (linear and nonlinear methods)
+ Sparse Methods/Streaming (sparse coding...)

Syllabus from Smola Course

SML: Systems
  • Hardware: Processor, RAM, buses, GPU, disk, SSD, network, switches, racks, server centers / Bandwidth, latency and faults
  • Basic parallelization paradigms: Trees, stars, rings, queues / Hashing (consistent, proportional) / Distributed hash tables and P2P
  • Storage: RAID / Google File System / HadoopFS / Distributed (key, value) storage
  • Processing: MapReduce / Dryad / S4 / stream processing
  • Structured access beyond SQL: BigTable / Cassandra
SML: Data Streams
  • Random subsets: Hash functions / Approximate moments / Alon-Matias-Szegedy sketch
  • Heavy hitter detection: Lossy counting / Space saving
  • Randomized statistics: Flajolet counter / Bloom filter and logarithmic randomized techniques / CountMin sketch
SML: optimization:
  • Batch methods: Distributed subgradient / Bundle methods
  • Online methods: Unconstrained subgradient / Gradient projections / Parallel optimization
  • Efficient Kernel algorithms: Dual space (using α) / Reduced dimensionality (low rank expanions) / Function space (using fast Kα) / Primal space (hashing & random kitchen sinks)
  1. Deep Compression: Compressing Deep Neural Networks (ICLR 2016) / Song Han, Huizi Mao, William J. Dally / conference paper at ICLR 2016 (oral) / Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, we introduce “deep compression”, a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman coding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9x to 13x; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x, from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG-16 by 49x from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory. Our compression method also facilitates the use of complex neural networks in mobile applications where application size and download bandwidth are constrained. Benchmarked on CPU, GPU and mobile GPU, compressed network has 3x to 4x layerwise speedup and 3x to 7x better energy efficiency.  

  2. Binary embeddings with structured hashed projections /Anna Choromanska, Krzysztof Choromanski, Mariusz Bojarski, Tony Jebara, Sanjiv Kumar, Yann LeCun (Submitted on 16 Nov 2015 (v1), last revised 1 Jul 2016 (this version, v5))/ We consider the hashing mechanism for constructing binary embeddings, that involves pseudo-random projections followed by nonlinear (sign function) mappings. The pseudo-random projection is described by a matrix, where not all entries are independent random variables but instead a fixed “budget of randomness” is distributed across the matrix. Such matrices can be efficiently stored in sub-quadratic or even linear space, provide reduction in randomness usage (i.e. number of required random values), and very often lead to computational speed ups. We prove several theoretical results showing that projections via various structured matrices followed by nonlinear mappings accurately preserve the angular distance between input high-dimensional vectors. To the best of our knowledge, these results are the first that give theoretical ground for the use of general structured matrices in the nonlinear setting. In particular, they generalize previous extensions of the Johnson-Lindenstrauss lemma and prove the plausibility of the approach that was so far only heuristically confirmed for some special structured matrices. Consequently, we show that many structured matrices can be used as an efficient information compression mechanism. Our findings build a better understanding of certain deep architectures, which contain randomly weighted and untrained layers, and yet achieve high performance on different learning tasks. We empirically verify our theoretical findings and show the dependence of learning via structured hashed projections on the performance of neural network as well as nearest neighbor classifier.  

Basic BioMed Application Reads we finished before 2017

9DiscreteApp DNA RNA protein invariant binary random relational
Presenter Papers Paper URL Our Slides
DeepBind Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning PDF  
DeepSEA Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk PDF  
DeepSEA Deep Supervised and Convolutional Generative Stochastic Network for Protein Secondary Structure Prediction, ICML 2014    
BioBasics A method for integrating and ranking the evidence for biochemical pathways by mining reactions from text, Bioinformatics13    
BioBasics Efficient counting of k-mers in DNA sequences using a Bloom filter. Melsted P, Pritchard JK. BMC Bioinformatics. 2011    
BioBasics Fast String Kernels using Inexact Matching for Protein Sequence, JMLR 2004    
BioBasics NIPS09: Locality-Sensitive Binary Codes from Shift-Invariant Kernels    
MedSignal Segmenting Time Series: A Survey and Novel Approach, PDF  

Reliable17-Secure Machine Learning

3Reliable secure Privacy Cryptography
Presenter Papers Paper URL Our Slides
Tobin Summary of A few Papers on: Machine Learning and Cryptography, (e.g., learning to Protect Communications with Adversarial Neural Cryptography) 1 PDF PDF
Tobin Privacy Aware Learning (NIPS12) 2 PDF PDF
Tobin Can Machine Learning be Secure?(2006) PDF PDF
  1. Learning to protect communications with adversarial neural cryptography Abadi & Anderson, arXiv 2016: We ask whether neural networks can learn to use secret keys to protect information from other neural networks. Specifically, we focus on ensuring confidentiality properties in a multiagent system, and we specify those properties in terms of an adversary. Thus, a system may consist of neural networks named Alice and Bob, and we aim to limit what a third neural network named Eve learns from eavesdropping on the communication between Alice and Bob. We do not prescribe specific cryptographic algorithms to these neural networks; instead, we train end-to-end, adversarially. We demonstrate that the neural networks can learn how to perform forms of encryption and decryption, and also how to apply these operations selectively in order to meet confidentiality goals.  

  2. Privacy Aware Learning (NIPS12) / John C. Duchi, Michael I. Jordan, Martin J. Wainwright/ We study statistical risk minimization problems under a privacy model in which the data is kept confidential even from the learner. In this local privacy framework, we establish sharp upper and lower bounds on the convergence rates of statistical estimation procedures. As a consequence, we exhibit a precise tradeoff between the amount of privacy the data preserves and the utility, as measured by convergence rate, of any statistical estimator or learning procedure.  

Optimization17- Optimization in DNN

4Optimization optimization scalable EM propagation mimic
Presenter Papers Paper URL Our Slides
Muthu Optimization Methods for Large-Scale Machine Learning, Léon Bottou, Frank E. Curtis, Jorge Nocedal 1 PDF PDF
Muthu Fast Training of Recurrent Networks Based on EM Algorithm (1998) 2 PDF PDF
Muthu FitNets: Hints for Thin Deep Nets, ICLR15 3 PDF PDF
Muthu Two NIPS 2015 Deep Learning Optimization Papers PDF PDF
Muthu Difference Target Propagation (2015) 4 PDF PDF
  1. Optimization Methods for Large-Scale Machine Learning, Léon Bottou, Frank E. Curtis, Jorge Nocedal / This paper provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications. Through case studies on text classification and the training of deep neural networks, we discuss how optimization problems arise in machine learning and what makes them challenging. A major theme of our study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient (SG) method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter. Based on this viewpoint, we present a comprehensive theory of a straightforward, yet versatile SG algorithm, discuss its practical behavior, and highlight opportunities for designing algorithms with improved performance. This leads to a discussion about the next generation of optimization methods for large-scale machine learning, including an investigation of two main streams of research on techniques that diminish noise in the stochastic directions and methods that make use of second-order derivative approximations.  

  2. Fast Training of Recurrent Networks Based on EM Algorithm (1998) / In this work, a probabilistic model is established for recurrent networks. The expectation-maximization (EM) algorithm is then applied to derive a new fast training algorithm for recurrent networks through mean-field approximation. This new algorithm converts training a complicated recurrent network into training an array of individual feedforward neurons. These neurons are then trained via a linear weighted regression algorithm. The training time has been improved by five to 15 times on benchmark problems. Published in: IEEE Transactions on Neural Networks ( Volume: 9 , Issue: 1 , Jan 1998 )  

  3. FitNets: Hints for Thin Deep Nets, ICLR15 / Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, Yoshua Bengio (Submitted on 19 Dec 2014 (v1), last revised 27 Mar 2015 (this version, v4)) While depth tends to improve network performances, it also makes gradient-based training more difficult since deeper networks tend to be more non-linear. The recently proposed knowledge distillation approach is aimed at obtaining small and fast-to-execute models, and it has shown that a student network could imitate the soft output of a larger teacher network or ensemble of networks. In this paper, we extend this idea to allow the training of a student that is deeper and thinner than the teacher, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student. Because the student intermediate hidden layer will generally be smaller than the teacher’s intermediate hidden layer, additional parameters are introduced to map the student hidden layer to the prediction of the teacher hidden layer. This allows one to train deeper students that can generalize better or run faster, a trade-off that is controlled by the chosen student capacity. For example, on CIFAR-10, a deep student network with almost 10.4 times less parameters outperforms a larger, state-of-the-art teacher network.  

  4. Difference Target Propagation (2015) / 13 pages, 8 figures, Accepted in ECML/PKDD 2015 / Dong-Hyun Lee, Saizheng Zhang, Asja Fischer, Yoshua Bengio/ Back-propagation has been the workhorse of recent successes of deep learning but it relies on infinitesimal effects (partial derivatives) in order to perform credit assignment. This could become a serious issue as one considers deeper and more non-linear functions, e.g., consider the extreme case of nonlinearity where the relation between parameters and cost is actually discrete. Inspired by the biological implausibility of back-propagation, a few approaches have been proposed in the past that could play a similar credit assignment role. In this spirit, we explore a novel approach to credit assignment in deep networks that we call target propagation. The main idea is to compute targets rather than gradients, at each layer. Like gradients, they are propagated backwards. In a way that is related but different from previously proposed proxies for back-propagation which rely on a backwards network with symmetric weights, target propagation relies on auto-encoders at each layer. Unlike back-propagation, it can be applied even when units exchange stochastic bits rather than real numbers. We show that a linear correction for the imperfectness of the auto-encoders, called difference target propagation, is very effective to make target propagation actually work, leading to results comparable to back-propagation for deep networks with discrete and continuous units and denoising auto-encoders and achieving state of the art for stochastic networks.  

Generative17- Generative Deep Networks

5Generative generative GAN
Presenter Papers Paper URL Our Slides
Tobin Energy-Based Generative Adversarial Network 1 PDF PDF
Jack Three Deep Generative Models PDF PDF
  1. Energy-Based Generative Adversarial Network, Junbo Zhao, Michael Mathieu, Yann LeCun (Submitted on 11 Sep 2016 (v1), last revised 6 Mar 2017 (this version, v4))/ We introduce the “Energy-based Generative Adversarial Network” model (EBGAN) which views the discriminator as an energy function that attributes low energies to the regions near the data manifold and higher energies to other regions. Similar to the probabilistic GANs, a generator is seen as being trained to produce contrastive samples with minimal energies, while the discriminator is trained to assign high energies to these generated samples. Viewing the discriminator as an energy function allows to use a wide variety of architectures and loss functionals in addition to the usual binary classifier with logistic output. Among them, we show one instantiation of EBGAN framework as using an auto-encoder architecture, with the energy being the reconstruction error, in place of the discriminator. We show that this form of EBGAN exhibits more stable behavior than regular GANs during training. We also show that a single-scale architecture can be trained to generate high-resolution images.  

Structures17 -Adaptive Deep Networks I

2Architecture 8Scalable low-rank binary dynamic learn2learn optimization
Presenter Papers Paper URL Our Slides
Arshdeep HyperNetworks, David Ha, Andrew Dai, Quoc V. Le ICLR 2017 1 PDF PDF
Arshdeep Learning feed-forward one-shot learners 2 PDF PDF
Arshdeep Learning to Learn by gradient descent by gradient descent 3 PDF PDF
Arshdeep Dynamic Filter Networks 4 https://arxiv.org/abs/1605.09673 PDF PDF
  1. HyperNetworks, David Ha, Andrew Dai, Quoc V. Le ICLR 2017 / This work explores hypernetworks: an approach of using a one network, also known as a hypernetwork, to generate the weights for another network. Hypernetworks provide an abstraction that is similar to what is found in nature: the relationship between a genotype - the hypernetwork - and a phenotype - the main network. Though they are also reminiscent of HyperNEAT in evolution, our hypernetworks are trained end-to-end with backpropagation and thus are usually faster. The focus of this work is to make hypernetworks useful for deep convolutional networks and long recurrent networks, where hypernetworks can be viewed as relaxed form of weight-sharing across layers. Our main result is that hypernetworks can generate non-shared weights for LSTM and achieve near state-of-the-art results on a variety of sequence modelling tasks including character-level language modelling, handwriting generation and neural machine translation, challenging the weight-sharing paradigm for recurrent networks. Our results also show that hypernetworks applied to convolutional networks still achieve respectable results for image recognition tasks compared to state-of-the-art baseline models while requiring fewer learnable parameters.  

  2. Learning feed-forward one-shot learners, Arxiv 2016, Luca Bertinetto, João F. Henriques, Jack Valmadre, Philip H. S. Torr, Andrea Vedaldi/ One-shot learning is usually tackled by using generative models or discriminative embeddings. Discriminative methods based on deep learning, which are very effective in other learning scenarios, are ill-suited for one-shot learning as they need large amounts of training data. In this paper, we propose a method to learn the parameters of a deep model in one shot. We construct the learner as a second deep network, called a learnet, which predicts the parameters of a pupil network from a single exemplar. In this manner we obtain an efficient feed-forward one-shot learner, trained end-to-end by minimizing a one-shot classification objective in a learning to learn formulation. In order to make the construction feasible, we propose a number of factorizations of the parameters of the pupil network. We demonstrate encouraging results by learning characters from single exemplars in Omniglot, and by tracking visual objects from a single initial exemplar in the Visual Object Tracking benchmark.  

  3. Learning to Learn by gradient descent by gradient descent, Arxiv 2016/ The move from hand-designed features to learned features in machine learning has been wildly successful. In spite of this, optimization algorithms are still designed by hand. In this paper we show how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way. Our learned algorithms, implemented by LSTMs, outperform generic, hand-designed competitors on the tasks for which they are trained, and also generalize well to new tasks with similar structure. We demonstrate this on a number of tasks, including simple convex problems, training neural networks, and styling images with neural art.  

  4. Dynamic Filter Networks, Bert De Brabandere, Xu Jia, Tinne Tuytelaars, Luc Van Gool (Submitted on 31 May 2016 (v1), last revised 6 Jun 2016 (this version, v2))/ In a traditional convolutional layer, the learned filters stay fixed after training. In contrast, we introduce a new framework, the Dynamic Filter Network, where filters are generated dynamically conditioned on an input. We show that this architecture is a powerful one, with increased flexibility thanks to its adaptive nature, yet without an excessive increase in the number of model parameters. A wide variety of filtering operations can be learned this way, including local spatial transformations, but also others like selective (de)blurring or adaptive feature extraction. Moreover, multiple such layers can be combined, e.g. in a recurrent architecture. We demonstrate the effectiveness of the dynamic filter network on the tasks of video and stereo prediction, and reach state-of-the-art performance on the moving MNIST dataset with a much smaller model. By visualizing the learned filters, we illustrate that the network has picked up flow information by only looking at unlabelled training data. This suggests that the network can be used to pretrain networks for various supervised tasks in an unsupervised way, like optical flow and depth estimation.  

Structures17 - Adaptive Deep Networks II

2Architecture 8Scalable low-rank binary dynamic learn2learn optimization
Presenter Papers Paper URL Our Slides
Arshdeep Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction 1 PDF PDF
Arshdeep Decoupled Neural Interfaces Using Synthetic Gradients 2 PDF PDF
Arshdeep Diet Networks: Thin Parameters for Fat Genomics 3 PDF PDF
Arshdeep Metric Learning with Adaptive Density Discrimination 4 PDF PDF
  1. Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction / Hyeonwoo Noh, Paul Hongsuck Seo, Bohyung Han (Submitted on 18 Nov 2015)/ We tackle image question answering (ImageQA) problem by learning a convolutional neural network (CNN) with a dynamic parameter layer whose weights are determined adaptively based on questions. For the adaptive parameter prediction, we employ a separate parameter prediction network, which consists of gated recurrent unit (GRU) taking a question as its input and a fully-connected layer generating a set of candidate weights as its output. However, it is challenging to construct a parameter prediction network for a large number of parameters in the fully-connected dynamic parameter layer of the CNN. We reduce the complexity of this problem by incorporating a hashing technique, where the candidate weights given by the parameter prediction network are selected using a predefined hash function to determine individual weights in the dynamic parameter layer. The proposed network—joint network with the CNN for ImageQA and the parameter prediction network—is trained end-to-end through back-propagation, where its weights are initialized using a pre-trained CNN and GRU. The proposed algorithm illustrates the state-of-the-art performance on all available public ImageQA benchmarks.  

  2. Decoupled Neural Interfaces Using Synthetic Gradients / Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, David Silver, Koray Kavukcuoglu (Submitted on 18 Aug 2016 (v1), last revised 3 Jul 2017 (this version, v2))/ Training directed neural networks typically requires forward-propagating data through a computation graph, followed by backpropagating error signal, to produce weight updates. All layers, or more generally, modules, of the network are therefore locked, in the sense that they must wait for the remainder of the network to execute forwards and propagate error backwards before they can be updated. In this work we break this constraint by decoupling modules by introducing a model of the future computation of the network graph. These models predict what the result of the modelled subgraph will produce using only local information. In particular we focus on modelling error gradients: by using the modelled synthetic gradient in place of true backpropagated error gradients we decouple subgraphs, and can update them independently and asynchronously i.e. we realise decoupled neural interfaces. We show results for feed-forward models, where every layer is trained asynchronously, recurrent neural networks (RNNs) where predicting one’s future gradient extends the time over which the RNN can effectively model, and also a hierarchical RNN system with ticking at different timescales. Finally, we demonstrate that in addition to predicting gradients, the same framework can be used to predict inputs, resulting in models which are decoupled in both the forward and backwards pass – amounting to independent networks which co-learn such that they can be composed into a single functioning corporation.  

  3. Diet Networks: Thin Parameters for Fat Genomics / Adriana Romero, Pierre Luc Carrier, Akram Erraqabi, Tristan Sylvain, Alex Auvolat, Etienne Dejoie, Marc-André Legault, Marie-Pierre Dubé, Julie G. Hussin, Yoshua Bengio / ICLR17/ Learning tasks such as those involving genomic data often poses a serious challenge: the number of input features can be orders of magnitude larger than the number of training examples, making it difficult to avoid overfitting, even when using the known regularization techniques. We focus here on tasks in which the input is a description of the genetic variation specific to a patient, the single nucleotide polymorphisms (SNPs), yielding millions of ternary inputs. Improving the ability of deep learning to handle such datasets could have an important impact in precision medicine, where high-dimensional data regarding a particular patient is used to make predictions of interest. Even though the amount of data for such tasks is increasing, this mismatch between the number of examples and the number of inputs remains a concern. Naive implementations of classifier neural networks involve a huge number of free parameters in their first layer: each input feature is associated with as many parameters as there are hidden units. We propose a novel neural network parametrization which considerably reduces the number of free parameters. It is based on the idea that we can first learn or provide a distributed representation for each input feature (e.g. for each position in the genome where variations are observed), and then learn (with another neural network called the parameter prediction network) how to map a feature’s distributed representation to the vector of parameters specific to that feature in the classifier neural network (the weights which link the value of the feature to each of the hidden units). We show experimentally on a population stratification task of interest to medical studies that the proposed approach can significantly reduce both the number of parameters and the error rate of the classifier.  

  4. Metric Learning with Adaptive Density Discrimination / ICLR 2016 / Oren Rippel, Manohar Paluri, Piotr Dollar, Lubomir Bourdev/ Distance metric learning (DML) approaches learn a transformation to a representation space where distance is in correspondence with a predefined notion of similarity. While such models offer a number of compelling benefits, it has been difficult for these to compete with modern classification algorithms in performance and even in feature extraction. In this work, we propose a novel approach explicitly designed to address a number of subtle yet important issues which have stymied earlier DML algorithms. It maintains an explicit model of the distributions of the different classes in representation space. It then employs this knowledge to adaptively assess similarity, and achieve local discrimination by penalizing class distribution overlap. We demonstrate the effectiveness of this idea on several tasks. Our approach achieves state-of-the-art classification results on a number of fine-grained visual recognition datasets, surpassing the standard softmax classifier and outperforming triplet loss by a relative margin of 30-40%. In terms of computational performance, it alleviates training inefficiencies in the traditional triplet loss, reaching the same error in 5-30 times fewer iterations. Beyond classification, we further validate the saliency of the learnt representations via their attribute concentration and hierarchy recovery properties, achieving 10-25% relative gains on the softmax classifier and 25-50% on triplet loss in these tasks.  

Reliable17-Testing and Machine Learning Basics

3Reliable software-testing white-box black-box robustness Metamorphic Influence Functions
Presenter Papers Paper URL Our Slides
GaoJi A few useful things to know about machine learning PDF PDF
GaoJi A few papers related to testing learning, e.g., Understanding Black-box Predictions via Influence Functions PDF PDF
GaoJi Automated White-box Testing of Deep Learning Systems 1 PDF PDF
GaoJi Testing and Validating Machine Learning Classifiers by Metamorphic Testing 2 PDF PDF
GaoJi Software testing: a research travelogue (2000–2014) PDF PDF
  1. DeepXplore: Automated Whitebox Testing of Deep Learning Systems / Kexin Pei, Yinzhi Cao, Junfeng Yang, Suman Jana / published in SOSP’17/ Deep learning (DL) systems are increasingly deployed in safety- and security-critical domains including self-driving cars and malware detection, where the correctness and predictability of a system’s behavior for corner case inputs are of great importance. Existing DL testing depends heavily on manually labeled data and therefore often fails to expose erroneous behaviors for rare inputs. We design, implement, and evaluate DeepXplore, the first whitebox framework for systematically testing real-world DL systems. First, we introduce neuron coverage for systematically measuring the parts of a DL system exercised by test inputs. Next, we leverage multiple DL systems with similar functionality as cross-referencing oracles to avoid manual checking. Finally, we demonstrate how finding inputs for DL systems that both trigger many differential behaviors and achieve high neuron coverage can be represented as a joint optimization problem and solved efficiently using gradient-based search techniques. DeepXplore efficiently finds thousands of incorrect corner case behaviors (e.g., self-driving cars crashing into guard rails and malware masquerading as benign software) in state-of-the-art DL models with thousands of neurons trained on five popular datasets including ImageNet and Udacity self-driving challenge data. For all tested DL models, on average, DeepXplore generated one test input demonstrating incorrect behavior within one second while running only on a commodity laptop. We further show that the test inputs generated by DeepXplore can also be used to retrain the corresponding DL model to improve the model’s accuracy by up to 3%.  

  2. Testing and Validating Machine Learning Classifiers by Metamorphic Testing / 2011/ Abstract: Machine learning algorithms have provided core functionality to many application domains - such as bioinformatics, computational linguistics, etc. However, it is difficult to detect faults in such applications because often there is no ‘‘test oracle’’ to verify the correctness of the computed outputs. To help address the software quality, in this paper we present a technique for testing the implementations of machine learning classification algorithms which support such applications. Our approach is based on the technique ‘‘metamorphic testing’’, which has been shown to be effective to alleviate the oracle problem. Also presented include a case study on a real-world machine learning application framework, and a discussion of how programmers implementing machine learning algorithms can avoid the common pitfalls discovered in our study. We also conduct mutation analysis and cross-validation, which reveal that our method has high effectiveness in killing mutants, and that observing expected cross-validation result alone is not sufficiently effective to detect faults in a supervised classification program. The effectiveness of metamorphic testing is further confirmed by the detection of real faults in a popular open-source classification program.  

BackTop