Generative II - Deep Generative Models

5Generative generative attention Composition graphical-model Autoregressive structured
Presenter Papers Paper URL Our Slides
ChaoJiang Courville - Generative Models II DLSS17Slide + video PDF
GaoJi Attend, Infer, Repeat: Fast Scene Understanding with Generative Models, NIPS16 1 PDF + talk PDF
Arshdeep Composing graphical models with neural networks for structured representations and fast inference, NIPS16 2 PDF PDF
  Johnson - Graphical Models and Deep Learning DLSSSlide + video  
  Parallel Multiscale Autoregressive Density Estimation, ICML17 3 PDF  
Beilun Conditional Image Generation with Pixel CNN Decoders, NIPS16 4 PDF PDF
Shijia Marrying Graphical Models & Deep Learning DLSS17 + Video PDF
  1. Attend, Infer, Repeat: Fast Scene Understanding with Generative Models, NIPS16 / Google DeepMind/ We present a framework for efficient inference in structured image models that explicitly reason about objects. We achieve this by performing probabilistic inference using a recurrent neural network that attends to scene elements and processes them one at a time. Crucially, the model itself learns to choose the appropriate number of inference steps. We use this scheme to learn to perform inference in partially specified 2D models (variable-sized variational auto-encoders) and fully specified 3D models (probabilistic renderers). We show that such models learn to identify multiple objects - counting, locating and classifying the elements of a scene -without any supervision, e.g., decomposing 3D images with various numbers of objects in a single forward pass of a neural network at unprecedented speed. We further show that the networks produce accurate inferences when compared to supervised counterparts, and that their structure leads to improved generalization.  

  2. Composing graphical models with neural networks for structured representations and fast inference, NIPS16 / We propose a general modeling and inference framework that combines the complementary strengths of probabilistic graphical models and deep learning methods. Our model family composes latent graphical models with neural network observation likelihoods. For inference, we use recognition networks to produce local evidence potentials, then combine them with the model distribution using efficient message-passing algorithms. All components are trained simultaneously with a single stochastic variational inference objective. We illustrate this framework by automatically segmenting and categorizing mouse behavior from raw depth video, and demonstrate several other example models.  

  3. Parallel Multiscale Autoregressive Density Estimation, ICML17 / , Nando de Freitas/ PixelCNN achieves state-of-the-art results in density estimation for natural images. Although training is fast, inference is costly, requiring one network evaluation per pixel; O(N) for N pixels. This can be sped up by caching activations, but still involves generating each pixel sequentially. In this work, we propose a parallelized PixelCNN that allows more efficient inference by modeling certain pixel groups as conditionally independent. Our new PixelCNN model achieves competitive density estimation and orders of magnitude speedup - O(log N) sampling instead of O(N) - enabling the practical generation of 512x512 images. We evaluate the model on class-conditional image generation, text-to-image synthesis, and action-conditional video generation, showing that our model achieves the best results among non-pixel-autoregressive density models that allow efficient sampling.  

  4. Conditional Image Generation with Pixel CNN Decoders, NIPS16 / Google DeepMind/ This work explores conditional image generation with a new image density model based on the PixelCNN architecture. The model can be conditioned on any vector, including descriptive labels or tags, or latent embeddings created by other networks. When conditioned on class labels from the ImageNet database, the model is able to generate diverse, realistic scenes representing distinct animals, objects, landscapes and structures. When conditioned on an embedding produced by a convolutional network given a single image of an unseen face, it generates a variety of new portraits of the same person with different facial expressions, poses and lighting conditions. We also show that conditional PixelCNN can serve as a powerful decoder in an image autoencoder. Additionally, the gated convolutional layers in the proposed model improve the log-likelihood of PixelCNN to match the state-of-the-art performance of PixelRNN on ImageNet, with greatly reduced computational cost.