RLHF + InstructGPT

less than 1 minute read

Papers	Paper URL	Abstract
Training language models to follow instructions with human feedback	URL	“further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT.”
Deep reinforcement learning from human preferences	URL	“explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function”

Twitter Facebook LinkedIn

Safety Benchmark WMDP

1 minute read

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatt...

KV Cache and Tooling

3 minute read

KV Caching in LLM:

Advanced Transformer Architectures

25 minute read

In this session, our readings cover:

LLM fine tuning

29 minute read