RL IV - RL with varying structures

4 minute read

Presenter	Papers	Paper URL	Our Slides
Ceyer	Reinforcement Learning with Unsupervised Auxiliary Tasks, ICLR17 ¹	PDF	PDF
Beilun	Why is Posterior Sampling Better than Optimism for Reinforcement Learning? Ian Osband, Benjamin Van Roy ²	PDF	PDF
Ji	Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction, ICML17 ³	PDF	PDF
Xueying	End-to-End Differentiable Adversarial Imitation Learning, ICML17 ⁴	PDF	PDF
	Deep Value Networks Learn to Evaluate and Iteratively Refine Structured Outputs, ICML17	PDF
	FeUdal Networks for Hierarchical Reinforcement Learning, ICML17 ⁵	PDF

_{^{Reinforcement Learning with Unsupervised Auxiliary Tasks, ICLR17 / Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, Koray Kavukcuoglu/ Deep reinforcement learning agents have achieved state-of-the-art results by directly maximising cumulative reward. However, environments contain a much wider variety of possible training signals. In this paper, we introduce an agent that also maximises many other pseudo-reward functions simultaneously by reinforcement learning. All of these tasks share a common representation that, like unsupervised learning, continues to develop in the absence of extrinsic rewards. We also introduce a novel mechanism for focusing this representation upon extrinsic rewards, so that learning can rapidly adapt to the most relevant aspects of the actual task. Our agent significantly outperforms the previous state-of-the-art on Atari, averaging 880\% expert human performance, and a challenging suite of first-person, three-dimensional \emph{Labyrinth} tasks leading to a mean speedup in learning of 10× and averaging 87\% expert human performance on Labyrinth.}} ↩
_{^{Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction, ICML17/ Researchers have demonstrated state-of-the-art performance in sequential decision making problems (e.g., robotics control, sequential prediction) with deep neural network models. One often has access to near-optimal oracles that achieve good performance on the task during training. We demonstrate that AggreVaTeD — a policy gradient extension of the Imitation Learning (IL) approach of (Ross & Bagnell, 2014) — can leverage such an oracle to achieve faster and better solutions with less training data than a less-informed Reinforcement Learning (RL) technique. Using both feedforward and recurrent neural network predictors, we present stochastic gradient procedures on a sequential prediction task, dependency-parsing from raw image data, as well as on various high dimensional robotics control problems. We also provide a comprehensive theoretical study of IL that demonstrates we can expect up to exponentially lower sample complexity for learning with AggreVaTeD than with RL algorithms, which backs our empirical findings. Our results and theory indicate that the proposed approach can achieve superior performance with respect to the oracle when the demonstrator is sub-optimal.}} ↩
_{^{End-to-End Differentiable Adversarial Imitation Learning, ICML17/ Generative Adversarial Networks (GANs) have been successfully applied to the problem of policy imitation in a model-free setup. However, the computation graph of GANs, that include a stochastic policy as the generative model, is no longer differentiable end-to-end, which requires the use of high-variance gradient estimation. In this paper, we introduce the Model-based Generative Adversarial Imitation Learning (MGAIL) algorithm. We show how to use a forward model to make the computation fully differentiable, which enables training policies using the exact gradient of the discriminator. The resulting algorithm trains competent policies using relatively fewer expert samples and interactions with the environment. We test it on both discrete and continuous action domains and report results that surpass the state-of-the-art.}} ↩
_{^{Deep Value Networks Learn to Evaluate and Iteratively Refine Structured Outputs, ICML17/ We approach structured output prediction by optimizing a deep value network (DVN) to precisely estimate the task loss on different output configurations for a given input. Once the model is trained, we perform inference by gradient descent on the continuous relaxations of the output variables to find outputs with promising scores from the value network. When applied to image segmentation, the value network takes an image and a segmentation mask as inputs and predicts a scalar estimating the intersection over union between the input and ground truth masks. For multi-label classification, the DVN’s objective is to correctly predict the F1 score for any potential label configuration. The DVN framework achieves the state-of-the-art results on multi-label prediction and image segmentation benchmarks.}} ↩
_{^{FeUdal Networks for Hierarchical Reinforcement Learning, ICML17 / Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, Koray Kavukcuoglu/ We introduce FeUdal Networks (FuNs): a novel architecture for hierarchical reinforcement learning. Our approach is inspired by the feudal reinforcement learning proposal of Dayan and Hinton, and gains power and efficacy by decoupling end-to-end learning across multiple levels – allowing it to utilise different resolutions of time. Our framework employs a Manager module and a Worker module. The Manager operates at a lower temporal resolution and sets abstract goals which are conveyed to and enacted by the Worker. The Worker generates primitive actions at every tick of the environment. The decoupled structure of FuN conveys several benefits – in addition to facilitating very long timescale credit assignment it also encourages the emergence of sub-policies associated with different goals set by the Manager. These properties allow FuN to dramatically outperform a strong baseline agent on tasks that involve long-term credit assignment or memorisation. We demonstrate the performance of our proposed system on a range of tasks from the ATARI suite and also from a 3D DeepMind Lab environment.}} ↩

Twitter Facebook LinkedIn

Dr. Yanjun Qi

RL IV - RL with varying structures

You May Also Enjoy

Safety Benchmark WMDP

KV Cache and Tooling

Advanced Transformer Architectures

LLM fine tuning