23.12.21
◼️

23.12.21

Date
Dec 21, 2023
Parent item
Sub-item
Tags
1.

Researchers uncover on/off switch for breast cancer metastasis

New research from Stanford and the Arc Institute could lead to a new and more effective immunotherapy and help clinicians better predict patient response to existing medicines.
2.
3.
4.
5.

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

note
Comment: NeurIPS 2023 Datasets and Benchmarks Track
“We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them.” (Zheng 等, 2023, p. 1) (pdf) 🔤我们考察了LLM - as - a - Judge的使用情况和局限性,包括位置、冗长度和自我增强偏差,以及有限的推理能力,并提出了缓解其中一些问题的解决方案。🔤
“Additionally, we show our benchmark and traditional benchmarks complement each other by evaluating several variants of LLaMA and Vicuna.” (Zheng 等, 2023, p. 1) (pdf) 🔤此外,我们通过评估LLaMA和Vicuna的几个变体,展示了我们的基准和传统基准的互补性。🔤
“MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform.” (Zheng 等, 2023, p. 1) (pdf) 🔤MT-bench,一个多轮问题集;和聊天机器人Arena,一种众包作战平台。🔤
“We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform.” (Zheng 等, 2023, p. 1) (pdf) 🔤然后,我们通过引入两个基准测试集:MT-bench,一个多轮问题集来验证LLM评委与人类偏好之间的一致性;和聊天机器人Arena,一种众包作战平台。🔤
“traditional LLM benchmarks – benchmarks like MMLU [19] and HELM [24] cannot effectively tell the difference between these aligned models and the base models. This phenomenon suggests that there is a fundamental discrepancy between user perceptions of the usefulness of chatbots and the criteria adopted by conventional benchmarks. We argue that this discrepancy primarily arises due to existing evaluation that o” (Zheng 等, 2023, p. 1) (pdf) 🔤传统的LLM基准- -如MMLU [ 19 ]、HELM [ 24 ]等基准无法有效区分这些对齐模型与基准模型之间的差异。这一现象表明,用户对聊天机器人有用性的感知与传统基准所采用的标准之间存在根本性的差异。我们认为,这种差异主要是由于现有的评估o🔤
“Multi-turn dialogues between a user and two AI assistants—LLaMA-13B (Assistant A) and Vicuna-13B (Assistant B)—initiated by a question from the MMLU benchmark and a follow-up instruction. GPT-4 is then presented with the context to determine which assistant answers better.” (Zheng 等, 2023, p. 2) (pdf) 🔤一个用户和两个AI助手LLaMA - 13B ( Assistant A )和Vicuna - 13B ( Assistant B )之间的多轮对话由MMLU基准题和后续指令引发。然后将GPT - 4与上下文一起呈现,以确定哪个辅助答案更好。🔤
“GPT-4 Judgment: Assistant A provided an incorrect response to the user's question about how the Federal Reserve buying bonds in the secondary market affects daily life. The answer given is repetitive and lacks clear examples of how the action impacts daily life. On the other hand, Assistant B provided a relevant and accurate response to the user's question about the Federal Reserve buying bonds. The answer includes three clear examples of how the action impacts daily life, such as interest rates, inflation, and employment. Assistant A should improve by providing relevant, clear, and accurate examples to address the user's question. Assistant B's response is more helpful, relevant, and detailed.” (Zheng 等, 2023, p. 2) (pdf) 🔤GPT - 4判断:助手A对用户关于美联储在二级市场购买债券如何影响日常生活的问题给出了错误的回答。给出的答案是重复的,并且缺乏行动如何影响日常生活的明确例子。另一方面,助理B对用户关于美联储购买债券的问题给出了相关且准确的回应。答案包括三个明确的例子,说明该行动如何影响日常生活,如利率、通货膨胀和就业。助理A应该通过提供相关的、清晰的、准确的例子来解决用户的问题。助理B的回应更有帮助性、针对性和详细性。🔤
“MT-bench and Chatbot Arena” (Zheng 等, 2023, p. 3) (pdf) 🔤Mt - Bench和聊天机器人竞技场🔤
“However, evaluating their broad capabilities also becomes more challenging. Despite the availability of numerous benchmarks for language models, they primarily focus on evaluating models on closed-ended questions with short responses.” (Zheng 等, 2023, p. 3) (pdf) 🔤然而,评估他们的广泛能力也变得更加具有挑战性。尽管有许多语言模型的基准,但它们主要集中于评价那些回答简短的封闭式问题的模型。🔤
“• Core-knowledge benchmarks, including MMLU [19], HellaSwag [50], ARC [9], WinoGrande [36], HumanEval [6], GSM-8K [10], and AGIEval [51], evaluate the core capabilities of pre-trained LLMs using zero-shot and few-shot benchmark sets. They typically require LLMs to generate a short, specific answer to benchmark questions that can be automatically validated. • Instruction-following benchmarks, such as Flan [27, 46], Self-instruct [44], NaturalInstructions [28], Super-NaturalInstructions [45], expand to slightly more open-ended questions and more diverse tasks and are used to evaluate LLMs after instruction fine-tuning. • Conversational benchmarks, like CoQA [35], MMDialog [15] and OpenAssistant [23], are closest to our intended use cases. However, the diversity and complexity of their questions often fall short in challenging the capabilities of the latest chatbots.” (Zheng 等, 2023, p. 3) (pdf) 🔤·核心知识基准,包括MMLU [ 19 ],HellaSwag [ 50 ],ARC [ 9 ],WinoGrande [ 36 ],HumanEval [ 6 ],GSM-8K [ 10 ]和AGIEval [ 51 ],使用零样本和少样本基准集评估预训练LLM的核心能力。它们通常需要LLMs生成一个简短的、特定的、可自动验证的基准问题答案;·遵循指令的基准,如Flan [ 27、46 ],Self-instruct [ 44 ],Natural Instructions [ 28 ],Super- Natural Instructions [ 45 ],扩展到稍微开放的问题和更多样化的任务,并在指令微调后用于评估LLMs;·会话基准,如Co QA [ 35 ],MMDialog [ 15 ]和Open Assistant [ 23 ],与我们的预期用例最为接近。然而,他们追问的多样性和复杂性🔤
“To bridge this gap, we introduce two novel benchmarks expressly tailored to assess human preferences. Simultaneously, these benchmarks are designed to distinguish the core capabilities of state-of-the-art models.” (Zheng 等, 2023, p. 3) (pdf) 🔤为了弥补这一缺陷,我们引入了两个新的基准来评估人类的偏好。同时,这些基准旨在区分最先进模型的核心能力。🔤
“Chatbot Arena Our second approach is Chatbot Arena, a crowdsourcing benchmark platform featuring anonymous battles. On this platform, users can interact with two anonymous models simultaneously, posing the same question to both. They vote for which model provides the preferred response, with the identities of the models disclosed post-voting. After running Chatbot Arena for one month, we have collected around 30K votes. Since the platform does not use pre-defined questions, it allows gathering a wide range of unrestricted use cases and votes in the wild, based on the diverse interests of users. A screenshot of the platform can be found at Appendix C.2.” (Zheng 等, 2023, p. 4) (pdf) 🔤聊天机器人Arena我们的第二种方法是聊天机器人Arena,一种以匿名战斗为特征的众包基准平台。在该平台上,用户可以同时与两个匿名模型进行交互,对两者提出同样的问题。他们投票支持哪种模型提供偏好的响应,并在投票后披露模型的身份。运行聊天机器人竞技场一个月后,我们已经收集了大约30K张选票。由于该平台不使用预先定义的问题,因此可以根据用户的多样化兴趣,在野外收集广泛的不受限制的用例和投票。平台的截图可以在附录C中找到。2 .🔤
“As LLMs continue to improve, they show potential in replacing human annotators in many tasks [17, 20]. Specifically, we are interested in whether LLMs can effectively evaluate the responses of chat assistants and match human preferences. Next, we discuss the use and limitations of LLM-as-a-judge.” (Zheng 等, 2023, p. 4) (pdf) 🔤随着LLMs的不断改进,它们在许多任务[ 17、20 ]中显示出替代人类注释器的潜力。具体来说,我们感兴趣的是LLMs是否能够有效地评估聊天助手的反应并匹配人类的偏好。接下来,我们讨论了LLM - as - a - Judge的使用和局限性。🔤
“LLM-as-a-judge offers two key benefits: scalability and explainability” (Zheng 等, 2023, p. 4) (pdf) 🔤LLM - as - a - Judge提供了两个关键的好处:可扩展性和可解释性🔤
“biases and limitations of LLM judges” (Zheng 等, 2023, p. 4) (pdf)
“Verbosity bias is when an LLM judge favors longer, verbose responses, even if they are not as clear, high-quality, or accurate as shorter alternatives.” (Zheng 等, 2023, p. 5) (pdf) 🔤动词性偏差是指当LLM法官倾向于选择较长的、冗长的答案时,即使他们不像较短的答案那样清晰、高质量或准确。🔤
“Self-enhancement bias.” (Zheng 等, 2023, p. 5) (pdf) 🔤自我增强偏差。🔤
“We adopt the term “self-enhancement bias” from social cognition literature [4] to describe the effect that LLM judges may favor the answers generated by themselves.” (Zheng 等, 2023, p. 5) (pdf) 🔤我们采用社会认知文献[ 4 ]中的"自我增强偏差" ( self-enhanced bias )一词来描述LLM法官可能会偏袒自己产生的答案的效果。🔤
“Addressing limitations” (Zheng 等, 2023, p. 6) (pdf) 🔤应对局限🔤
“The position bias can be addressed by simple solutions. A conservative approach is to call a judge twice by swapping the order of two answers and only declare a win when an answer is preferred in both orders. If the results are inconsistent after swapping, we can call it a tie. Another more aggressive approach is to assign positions randomly, which can be effective at a large scale with the correct expectations. In the following experiments, we use the conservative one” (Zheng 等, 2023, p. 6) (pdf)
“Metrics. We define the agreement between two types of judges as the probability of randomly selected individuals (but not identical) of each type agreeing on a randomly selected question. See more explanation in Appendix D.3. Average win rate is the average of win rates against all other players. These metrics can be computed with or without including tie votes.” (Zheng 等, 2023, p. 7) (pdf)
“7 Conclusion In this paper, we propose LLM-as-a-judge for chatbot evaluation and systematically examine its efficacy using human preference data from 58 experts on MT-bench, as well as thousands of crowdusers on Chatbot Arena. Our results reveal that strong LLMs can achieve an agreement rate of over 80%, on par with the level of agreement among human experts, establishing a foundation for an LLM-based evaluation framework.” (Zheng 等, 2023, p. 10) (pdf)
notion image

6.

OpenPipe company Mistral model

We started by evaluating existing Mistral variants to see how they’d perform as a base model. After playing around with a number of models we selected six that seemed promising: OpenHermes 2.5, Zephyr, Cybertron, Intel Neural Chat, Hermes Neural, and Metamath Cybertron Starling. We created a fine-tuned version of each of these models on each of the 3 evaluation datasets, using a development build of OpenPipe that supports custom base models. This gave us 18 new models in total.
Beauty in the eye of GPT-4 (Evals)
To test each model’s performance, we used our recently released automated LLM-as-judge evals scored by GPT-4, which allowed us to quickly compare our fine-tunes to each other and gauge their strength.
notion image
notion image
7.
8.
tl:
9.

GO MPG model

10.
11.
Beeper mini
12.
notion image
13.
14.
Autonomous chemical research with large language models
15.
talk to any ArXiv paper just by changing the URL
Talk to any ArXiv paper using ChatGPT!
notion image