当前位置：首页 > news >正文

TÜLU 3: Pushing Frontiers in Open Language Model Post-Training

news 2026/5/12 17:11:50

论文封面

基本信息

📝 原文链接: https://arxiv.org/abs/2411.15124
👥 作者: Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, Hannaneh Hajishirzi
🏷️ 关键词: TÜLU 3
📚 分类: 机器学习, 自然语言处理

摘要

中文摘要

语言模型的后训练应用于提升各种最近语言模型的行为并解锁新技能，但公开的后训练应用方法落后于专有方法。后训练的基础数据和配方既是这个谜题中最重要的部分，同时也是最缺乏透明度的部分。为了弥合这一差距，我们推出了T“ULU 3，这是一系列完全开放的先进后训练模型，包括其数据、代码和训练配方，作为现代后训练技术的全面指南。T“ULU 3基于Llama 3.1基础模型构建，其成果超越了Llama 3.1指令版本、Qwen 2.5、Mistral，甚至是GPT-4o-mini和Claude 3.5-Haiku等封闭模型。我们模型的训练算法包括监督微调（SFT）、直接偏好优化（DPO）以及我们称之为可验证奖励强化学习（RLVR）的新方法。随着T“ULU 3的推出，我们引入了一个多任务评估方案，用于后训练配方，包括开发评估和未见评估、标准基准实现以及在此基准上对现有开放数据集的实质性净化。最后，我们对那些未能可靠提高性能的训练方法进行了分析和讨论。

除了T“ULU 3模型权重和演示，我们还发布了完整的配方，包括用于各种核心技能的数据集、用于数据整理和评估的强大工具包、训练代码和基础设施，最重要的是，一份详细的报告，用于复制和进一步适应T“ULU 3方法到更多领域。

原文摘要

Language model post-training is applied to refine behaviors and unlock new skills across a wide range of recent language models, but open recipes for applying these techniques lag behind proprietary ones. The underlying training data and recipes for post-training are simultaneously the most important pieces of the puzzle and the portion with the least transparency. To bridge this gap, we introduce T"ULU 3, a family of fully-open state-of-the-art post-trained models, alongside its data, code, and training recipes, serving as a comprehensive guide for modern post-training techniques. T"ULU 3, which builds on Llama 3.1 base models, achieves results surpassing the instruct versions of Llama 3.1, Qwen 2.5, Mistral, and even closed models such as GPT-4o-mini and Claude 3.5-Haiku. The training algorithms for our models include supervised finetuning (SFT), Direct Preference Optimization (DPO), and a novel method we call Reinforcement Learning with Verifiable Rewards (RLVR). With T"ULU 3, we introduce a multi-task evaluation scheme for post-training recipes with development and unseen evaluations, standard benchmark implementations, and substantial decontamination of existing open datasets on said benchmarks. We conclude with analysis and discussion of training methods that did not reliably improve performance. In addition to the T"ULU 3 model weights and demo, we release the complete recipe – including datasets for diverse core skills, a robust toolkit for data curation and evaluation, the training code and infrastructure, and, most importantly, a detailed report for reproducing and further adapting the T"ULU 3 approach to more domains.

论文解读

一句话总结

这篇论文介绍了TÜLU 3，一个开源的先进语言模型后训练模型系列，通过开放数据和训练方法，推动了开放语言模型后训练的发展。

问题1：这篇论文想要解决什么具体问题？

• 问题背景：语言模型后训练技术已广泛应用于各种语言模型，但开放的后训练技术方案落后于专有方案，且训练数据和方案缺乏透明度。
• 现有方案不足：开放源代码的后训练模型通常依赖于简单的管道和较便宜的数据，且在许多指标上已经过时。
• 研究目标：开发一个开源的、先进的后训练模型系列TÜLU 3，包括数据、代码和训练方案，以推动开放语言模型后训练的发展。

问题2：论文的核心创新点是什么？

• 技术创新：TÜLU 3基于Llama 3.1基模型，结合了监督微调（SFT）、直接偏好优化（DPO）和强化学习与可验证奖励（RLVR）等新技术。
• 方法改进：TÜLU 3引入了新的数据集、评估框架和训练流程，优化了数据混合、方法和参数。
• 优势：TÜLU 3在多个基准测试中超越了同类模型，包括Llama 3.1 Instruct、Qwen 2.5 Instruct、Mistral-Instruct等，并在大型70B模型中与闭源模型如Claude 3.5 Haiku和GPT-4o mini相媲美。

问题3：实验结果如何验证了方法的有效性？

• 关键实验：TÜLU 3在多个基准测试中进行了评估，包括MMLU、PopQA、TruthfulQA、BigBenchHard、DROP、MATH、GSM8K、HumanEval、IFEval、AlpacaEval 2和Safety。
• 性能提升：TÜLU 3在大多数基准测试中均超过了基线模型，并在某些任务中实现了显著的性能提升。
• 对比结果：TÜLU 3在70B模型中甚至超过了闭源模型如Claude 3.5 Haiku和GPT-4o mini。

问题4：这个研究的实际应用价值是什么？

• 应用场景：TÜLU 3可以应用于各种自然语言处理任务，如问答、文本生成、机器翻译、代码生成等。
• 实施建议：TÜLU 3的开源性质使得研究人员可以轻松地将其应用于各种任务，并进一步改进和扩展其功能。
• 局限与展望：TÜLU 3目前主要针对英语数据，未来可以扩展到多语言支持。此外，可以进一步研究长上下文和多轮对话等能力。