1.
2.
大模型自我奖励:Meta让Llama2自己给自己微调作者对 Llama 2 70B 进行了三个迭代的微调,生成的模型在 AlpacaEval 2.0 排行榜上优于一众现有重要大模型,包括 Claude 2、Gemini Pro 和 GPT-4。
we show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model tat outperforms many existing systems on the AlapcaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613 . While only a preliminary study, this work opens the door to the possibility of models that can continually improve in both axes.
figure 1:
self-rewarding language models
Our self-alignment method consists of 2 steps:
(i) Self-Instruction creation: newly created prompts are used to generate candidate responses from model , which also predicts its own rewards via LLM-as-a-Judge prompting
(ii)Instruction following training: preference pairs are selected from the generated data, which are used for training via DPO, resulting in model .
This whole procedure can then be iterated resulting in both improved instruction following and reward modeling ability.
3.
4.
PPO
bair.berkeley
figure 1 :
The diagram illustrates the difference between
reinforcement learning from absolute feedback and relative feedback.
by incorporating a new component - pairwise policy gradient,
we can unify the reward modeling stage and RL stage, enable direct updates based n pairwise responses.
figure 2
A description of the three stages of RLHF from an OpenAI blog post. Note that the third stage falls under Reinforcement Learning with Absolute Feedback as shown on the left side of Figure 1.
who provide guidance to labelers through written instructions, direct feedback on specific examples, and informal conversations.
5.
DPO(Direct Preference Optimization)
6.
Pairwise Policy Gradient
人大
7.
8.
环境模型(environment dynamics)
环境模型一般可以从数学上抽象为状态转移函数 (transition function) 和奖励函数 (reward function)
环境模型一般可以从数学上抽象为状态转移函数 (transition function) 和奖励函数 (reward function)
9.
Real Toxicity Prompts
real-toxicity-prompts
allenai • Updated Jan 23, 2024
10.
11.
instructGPT
InstructGPT: Training Language Models to Follow Instructions with Human Feedback
12.
13.
https://scale.com (again)
14.
https://opendilab.github.io/DI-engine/02_algo/model_based_rl_zh.html#:~:text=基于模型的强化学习(Model-Based%20Reinforcement%20Learning%2C,或者利用模型进行规划。
15.
3.2.3 潜在动力学模型
基于图像的动力学模型的缺点在于需要重构图像,即使训练好模型,测试时也需要重构图像,耗费算力,因此研究者们提出Representation Learning结合潜在的动力学模型(latent Dynamics Model)缓解此问题
16.
对比学习(Contrastive Learning)
对比损失
17.
mbrl-lib
facebookresearch • Updated Dec 22, 2024
18.