6Reinforcement


Recent Readings for RL and Deep RL (since 2017) (Index of Posts):

No. Read Date Title and Information We Read @
1 2022, Dec, 3 RLHF + InstructGPT 2022-W6
2 2022, Jun, 3 Decision Transformers 2022-W3
3 2022, May, 3 A Generalist Agent + offline RL + UniMask 2022-W1
4 2020, Mar, 5 Deep Reinforcement Learning 2020-W3
5 2019, Dec, 8 deep2reproduce 2019 Fall - 6Reinforcement papers 2019-fall Students deep2reproduce
6 2018, Aug, 13 Application18- DNNs in a Few BioMedical Tasks 2018-team
7 2018, Aug, 3 Reliable18- Testing and Verifying DNNs 2018-team
8 2017, Nov, 30 RL IV - RL with varying structures 2017-W15
9 2017, Nov, 28 RL III - Basic tutorial RLSS17 (2) 2017-W14
10 2017, Nov, 21 RL II - Basic tutorial RLSS17 2017-W14
11 2017, Aug, 29 Reinforcement I - Pineau - RL Basic Concepts 2017-W2


Here is a detailed list of posts!



[1]: RLHF + InstructGPT


RL AGI language model Human Alignment
Papers Paper URL Abstract
Training language models to follow instructions with human feedback URL “further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT.”
Deep reinforcement learning from human preferences URL “explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function”

[2]: Decision Transformers


RL AGI

Decision Transformer: Reinforcement Learning via Sequence Modeling

  • Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch
  • https://arxiv.org/abs/2106.01345
  • We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.

Prompting Decision Transformer for Few-Shot Policy Generalization

  • Mengdi Xu, Yikang Shen, Shun Zhang, Yuchen Lu, Ding Zhao, Joshua B. Tenenbaum, Chuang Gan
  • https://arxiv.org/abs/2206.13499
  • Humans can leverage prior experience and learn novel tasks from a handful of demonstrations. In contrast to offline meta-reinforcement learning, which aims to achieve quick adaptation through better algorithm design, we investigate the effect of architecture inductive bias on the few-shot learning capability. We propose a Prompt-based Decision Transformer (Prompt-DT), which leverages the sequential modeling ability of the Transformer architecture and the prompt framework to achieve few-shot adaptation in offline RL. We design the trajectory prompt, which contains segments of the few-shot demonstrations, and encodes task-specific information to guide policy generation. Our experiments in five MuJoCo control benchmarks show that Prompt-DT is a strong few-shot learner without any extra finetuning on unseen target tasks. Prompt-DT outperforms its variants and strong meta offline RL baselines by a large margin with a trajectory prompt containing only a few timesteps. Prompt-DT is also robust to prompt length changes and can generalize to out-of-distribution (OOD) environments.

[3]: A Generalist Agent + offline RL + UniMask


RL AGI
Papers Paper URL Abstract
A Generalist Agent URL Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens.
Why should we prefer offline reinforcement learning over behavioral cloning? ICLR 2022 URL natural to ask: when can an offline RL method outperform BC with an equal amount of expert data, even when BC is a natural choice?
Uni[MASK]: Unified Inference in Sequential Decision Problems URL show how sequential decision making tasks can be thought of in terms of corresponding input maskings, enabling the training of a single model to perform all tasks at once. applies naturally to sequential decision making, where many well-studied tasks like behavior cloning, offline RL, inverse dynamics, and waypoint conditioning correspond to different sequence maskings over a sequence of states, actions, and returns.

[4]: Deep Reinforcement Learning


RL Generalization
Index Papers Our Slides
1 Actor-Critic Methods for Control Jake Survey
2 Generalization in Deep Reinforcement Learning Jake Survey
3 Sample Efficient RL (Part 1) Jake Survey
4 Sample Efficient RL (Part 2) Jake Survey
5 Model-Free Value Methods in Deep RL Jake Survey
6 Investigating Human Priors for Playing Video Games Arsh Survey

[5]: deep2reproduce 2019 Fall - 6Reinforcement papers


verification RL
Team INDEX Title & Link Tags Our Slide
T1 Safe Reinforcement Learning via Shielding RL, safety, verification OurSlide

[6]: Application18- DNNs in a Few BioMedical Tasks


brain RNA DNA Genomics generative
Presenter Papers Paper URL Our Slides
Arshdeep DeepLesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning. PDF PDF
Arshdeep Solving the RNA design problem with reinforcement learning, PLOSCB 1 PDF PDF
Arshdeep Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk 2 PDF PDF
Arshdeep Towards Gene Expression Convolutions using Gene Interaction Graphs, Francis Dutil, Joseph Paul Cohen, Martin Weiss, Georgy Derevyanko, Yoshua Bengio 3 PDF PDF
Brandon Kipoi: Accelerating the Community Exchange and Reuse of Predictive Models for Genomics PDF PDF
Arshdeep Feedback GAN (FBGAN) for DNA: a Novel Feedback-Loop Architecture for Optimizing Protein Functions 2 PDF PDF

[7]: Reliable18- Testing and Verifying DNNs


RL Fuzzing Adversarial-Examples verification software-testing black-box white-box
Presenter Papers Paper URL Our Slides
GaoJi Deep Reinforcement Fuzzing, Konstantin Böttinger, Patrice Godefroid, Rishabh Singh PDF PDF
GaoJi Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks, Guy Katz, Clark Barrett, David Dill, Kyle Julian, Mykel Kochenderfer PDF PDF
GaoJi DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars, Yuchi Tian, Kexin Pei, Suman Jana, Baishakhi Ray PDF PDF
GaoJi A few Recent (2018) papers on Black-box Adversarial Attacks, like Prior Convictions: Black-Box Adversarial Attacks with Bandits and Priors 1 PDF PDF
GaoJi A few Recent papers of Adversarial Attacks on reinforcement learning, like Adversarial Attacks on Neural Network Policies (Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, Pieter Abbeel) PDF PDF
Testing DeepXplore: Automated Whitebox Testing of Deep Learning Systems PDF  

[8]: RL IV - RL with varying structures


Auxiliary Sampling Value-Networks structured Imitation-Learning Hierarchical
Presenter Papers Paper URL Our Slides
Ceyer Reinforcement Learning with Unsupervised Auxiliary Tasks, ICLR17 1 PDF PDF
Beilun Why is Posterior Sampling Better than Optimism for Reinforcement Learning? Ian Osband, Benjamin Van Roy 2 PDF PDF
Ji Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction, ICML17 3 PDF PDF
Xueying End-to-End Differentiable Adversarial Imitation Learning, ICML17 4 PDF PDF
  Deep Value Networks Learn to Evaluate and Iteratively Refine Structured Outputs, ICML17 PDF  
  FeUdal Networks for Hierarchical Reinforcement Learning, ICML17 5 PDF  

[9]: RL III - Basic tutorial RLSS17 (2)


alphaGO Planning Temporal-Difference
Presenter Papers Paper URL Our Slides
Anant The Predictron: End-to-End Learning and Planning, ICLR17 1 PDF PDF
ChaoJiang Szepesvari - Theory of RL 2 RLSS.pdf + Video PDF
GaoJi Mastering the game of Go without human knowledge / Nature 2017 3 PDF PDF
  Thomas - Safe Reinforcement Learning RLSS17.pdf + video  
  Sutton - Temporal-Difference Learning RLSS17.pdf + Video  

[10]: RL II - Basic tutorial RLSS17


RL Multi-Task
Presenter Papers Paper URL Our Slides
Jack Hasselt - Deep Reinforcement Learning RLSS17.pdf + video PDF
Tianlu Roux - RL in the Industry RLSS17.pdf + video PDF / PDF-Bandit
Xueying Singh - Steps Towards Continual Learning pdf + video PDF
GaoJi Distral: Robust Multitask Reinforcement Learning 1 PDF PDF

[11]: Reinforcement I - Pineau - RL Basic Concepts


RL

Pineau - RL Basic Concepts

Presenter Papers Paper URL Our Slides
DLSS16 video    
RLSS17 slideRaw + video+ slide    



Here is a name list of posts!


RLHF + InstructGPT

less than 1 minute read

Papers Paper URL Abstract Training language models to follow instructions with human feedback URL ...

Decision Transformers

1 minute read

Decision Transformer: Reinforcement Learning via Sequence Modeling Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Piet...