24.1.24

We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human...

https://arxiv.org/abs/2401.10020v1

https://mp.weixin.qq.com/s/tBVosNn07shQZxfvtSlaOw

大模型自我奖励：Meta让Llama2自己给自己微调
作者对 Llama 2 70B 进行了三个迭代的微调，生成的模型在 AlpacaEval 2.0 排行榜上优于一众现有重要大模型，包括 Claude 2、Gemini Pro 和 GPT-4。

we show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model tat outperforms many existing systems on the AlapcaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613 . While only a preliminary study, this work opens the door to the possibility of models that can continually improve in both axes.

figure 1: self-rewarding language models

Our self-alignment method consists of 2 steps: (i) Self-Instruction creation: newly created prompts are used to generate candidate responses from model , which also predicts its own rewards via LLM-as-a-Judge prompting

(ii)Instruction following training: preference pairs are selected from the generated data, which are used for training via DPO, resulting in model .

This whole procedure can then be iterated resulting in both improved instruction following and reward modeling ability.

https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo

PPO

Proximal Policy Optimization

We’re releasing a new class of reinforcement learning algorithms, Proximal Policy Optimization (PPO), which perform comparably or better than state-of-the-art approaches while being much simpler to implement and tune. PPO has become the default reinforcement learning algorithm at OpenAI because of its ease of use and good performance.

https://openai.com/research/openai-baselines-ppo

Rethinking the Role of PPO in RLHF

The BAIR Blog

https://bair.berkeley.edu/blog/2023/10/16/p3o/

bair.berkeley

figure 1 : The diagram illustrates the difference between reinforcement learning from absolute feedback and relative feedback. by incorporating a new component - pairwise policy gradient, we can unify the reward modeling stage and RL stage, enable direct updates based n pairwise responses.

figure 2

A description of the three stages of RLHF from an OpenAI blog post. Note that the third stage falls under Reinforcement Learning with Absolute Feedback as shown on the left side of Figure 1.

Aligning language models to follow instructions

We’ve trained language models that are much better at following user intentions than GPT-3 while also making them more truthful and less toxic, using techniques developed through our alignment research. These InstructGPT models, which are trained with humans in the loop, are now deployed as the default language models on our API.