Structure II  DNN with Varying Structures
21 Sep 2017 2Architecture 8Scalable sparsity blocking nonparametric structured QA InterpretablePresenter  Papers  Paper URL  Our Slides 

Shijia  Outrageously Large Neural Networks: The SparselyGated MixtureofExperts Layer, (Dean), ICLR17 ^{1}  
Ceyer  Sequence Modeling via Segmentations, ICML17 ^{2}  
Arshdeep  Input Switched Affine Networks: An RNN Architecture Designed for Interpretability, ICML17 ^{3} 
</sup></sub>

_{ Outrageously Large Neural Networks: The SparselyGated MixtureofExperts Layer, (Dean), ICLR17/ The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a perexample basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a SparselyGated MixtureofExperts layer (MoE), consisting of up to thousands of feedforward subnetworks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than stateoftheart at lower computational cost.} ↩

_{ Sequence Modeling via Segmentations, ICML17/ Segmental structure is a common pattern in many types of sequences such as phrases in human languages. In this paper, we present a probabilistic model for sequences via their segmentations. The probability of a segmented sequence is calculated as the product of the probabilities of all its segments, where each segment is modeled using existing tools such as recurrent neural networks. Since the segmentation of a sequence is usually unknown in advance, we sum over all valid segmentations to obtain the final probability for the sequence. An efficient dynamic programming algorithm is developed for forward and backward computations without resorting to any approximation. We demonstrate our approach on text segmentation and speech recognition tasks. In addition to quantitative results, we also show that our approach can discover meaningful segments in their respective application contexts. } ↩

_{ Input Switched Affine Networks: An RNN Architecture Designed for Interpretability, ICML17/ There exist many problem domains where the interpretability of neural network models is essential for deployment. Here we introduce a recurrent architecture composed of inputswitched affine transformations  in other words an RNN without any explicit nonlinearities, but with inputdependent recurrent weights. This simple form allows the RNN to be analyzed via straightforward linear methods: we can exactly characterize the linear contribution of each input to the model predictions; we can use a changeofbasis to disentangle input, output, and computational hidden unit subspaces; we can fully reverseengineer the architecture’s solution to a simple task. Despite this ease of interpretation, the input switched affine network achieves reasonable performance on a text modeling tasks, and allows greater computational efficiency than networks with standard nonlinearities. } ↩