当前位置：首页 > news >正文

Constitutional AI

news 2026/5/17 17:58:24

用中文以结构树的方式列出这篇讲稿的知识点：
Although you can use a reward model to eliminate the need for human evaluation during RLHF fine tuning, the human effort required to produce the trained reward model in the first place is huge. The labeled data set used to train the reward model typically requires large teams of labelers, sometimes many thousands of people to evaluate many prompts each. This work requires a lot of time and other resources which can be important limiting factors. As the number of models and use cases increases, human effort becomes a limited resource. Methods to scale human feedback are an active area of research. One idea to overcome these limitations is to scale through model self supervision. Constitutional AI is one approach of scale supervision. First proposed in 2022 by researchers at Anthropic, Constitutional AI is a method for training models using a set of rules and principles that govern the model's behavior. Together with a set of sample prompts, these form the constitution. You then train the model to self critique and revise its responses to comply with those principles. Constitutional AI is useful not only for scaling feedback, it can also help address some unintended consequences of RLHF. For example, depending on how the prompt is structured, an aligned model may end up revealing harmful information as it tries to provide the most helpful response it can. As an example, imagine you ask the model to give you instructions on how to hack your neighbor's WiFi. Because this model has been aligned to prioritize helpfulness, it actually tells you about an app that lets you do this, even though this activity is illegal. Providing the model with a set of constitutional principles can help the model balance these competing interests and minimize the harm. Here are some example rules from the research paper that Constitutional AI I asks LLMs to follow. For example, you can tell the model to choose the response that is the most helpful, honest, and harmless. But you can play some bounds on this, asking the model to prioritize harmlessness by assessing whether it's response encourages illegal, unethical, or immoral activity. Note that you don't have to use the rules from the paper, you can define your own set of rules that is best suited for your domain and use case. When implementing the Constitutional AI method, you train your model in two distinct phases. In the first stage, you carry out supervised learning, to start your prompt the model in ways that try to get it to generate harmful responses, this process is called red teaming. You then ask the model to critique its own harmful responses according to the constitutional principles and revise them to comply with those rules. Once done, you'll fine-tune the model using the pairs of red team prompts and the revised constitutional responses. Let's look at an example of how one of these prompt completion pairs is generated. Let's return to the WiFi hacking problem. As you saw earlier, this model gives you a harmful response as it tries to maximize its helpfulness. To mitigate this, you augment the prompt using the harmful completion and a set of predefined instructions that ask the model to critique its response. Using the rules outlined in the Constitution, the model detects the problems in its response. In this case, it correctly acknowledges that hacking into someone's WiFi is illegal. Lastly, you put all the parts together and ask the model to write a new response that removes all of the harmful or illegal content. The model generates a new answer that puts the constitutional principles into practice and does not include the reference to the illegal app. The original red team prompt, and this final constitutional response can then be used as training data. You'll build up a data set of many examples like this to create a fine-tuned NLM that has learned how to generate constitutional responses. The second part of the process performs reinforcement learning. This stage is similar to RLHF, except that instead of human feedback, we now use feedback generated by a model. This is sometimes referred to as reinforcement learning from AI feedback or RLAIF. Here you use the fine-tuned model from the previous step to generate a set of responses to your prompt. You then ask the model which of the responses is preferred according to the constitutional principles. The result is a model generated preference dataset that you can use to train a reward model. With this reward model, you can now fine-tune your model further using a reinforcement learning algorithm like PPO, as discussed earlier. Aligning models is a very important topic and an active area of research. The foundations of RLHF that you've explored in this lesson will allow you to follow along as the field evolves. I'm really excited to see what new discoveries researchers make in this area. I encourage you to keep an eye out for any new methods and best practices that emerge in the coming months and years

RLHF
- Use of reward model to eliminate need for human evaluation
  - Large human effort required to produce trained reward model
    - Large teams of labelers needed for labeled data set used to train reward model
  - Human effort becomes limited resource as number of models and use cases increases
  - Methods to scale human feedback an active area of research
- Constitutional AI as approach to scale through model self supervision
  - Method for training models using set of rules and principles that govern model's behavior and form constitution
  - Train model to self critique and revise responses to comply with principles
  - Can help address unintended consequences of RLHF, such as revealing harmful information
  - Example constitutional principles/rules:
    - Choose most helpful, honest, and harmless response
    - Prioritize harmlessness by assessing whether response encourages illegal, unethical, or immoral activity
    - Can define own set of rules suited for domain/use case
  - Train model using two distinct phases:
    - Supervised learning to generate harmful responses and critique and revise them according to constitutional principles (red teaming)
    - Reinforcement learning using feedback generated by model to train reward model
Fine-tuned NLM
Reinforcement learning algorithms (PPO)

深度强化学习 (Deep Reinforcement Learning)
奖励模型 (Reward Model)
人工评估 (Human Evaluation)
训练奖励模型的数据集 (Labeled Dataset)
大规模标签队伍 (Large Teams of Labelers)
自我监督 (Self Supervision)
宪法型人工智能 (Constitutional AI)
宪法中的规则和原则 (Rules and Principles in the Constitution)
RLHF的意外后果 (Unintended Consequences of RLHF)
宪法中的规则示例 (Example Rules in the Constitution)
监督学习 (Supervised Learning)
红队测试 (Red Teaming)
奖励模型的训练 (Training of Reward Model)
强化学习 (Reinforcement Learning)
AI反馈的强化学习 (RLAIF)

使用奖励模型消除RLHF微调过程中人工评估的需求
为训练奖励模型需要大量的人力资源
通过模型自我监督来扩展人类反馈的方法
宪法AI是一种扩展反馈的方法，通过一组规则和原则来训练模型的行为
使用宪法AI能够避免RLHF的一些意外后果
宪法AI的规则可以根据领域和用例的需要进行定义和调整
使用宪法AI的方法进行训练分为两个阶段：第一阶段进行有监督学习，第二阶段进行强化学习
在强化学习阶段，使用奖励模型进行模型反馈，称为RLAIF
定期关注领域内新的方法和最佳实践

Constitutional AI

相关文章：

Constitutional AI

TDengine 资深研发整理：基于 SpringBoot 多语言实现 API 返回消息国际化

数据结构-冒泡排序Java实现

完整教程：Java+Vue+Websocket实现OSS文件上传进度条功能

【微服务 SpringCloud】实用篇 · 服务拆分和远程调用

Linux 下I/O操作

C#内映射lua表

android studio检测不到真机

【Eclipse】设置自动提示

单片机TDL的功能、应用与技术特点 | 百能云芯

解决笔记本无线网络5G比2.4还慢的奇怪问题

GitHub Action 通过SSH 自动部署到云服务器上

【AOP系列】7.数据校验

黑马JVM总结（三十七）

企业如何通过媒体宣传扩大自身影响力

处理vue直接引入图片地址时显示不出来的问题 src=“[object Module]“

vue3 v-md-editor markdown编辑器（VMdEditor）和预览组件（VMdPreview ）的使用

java正则表达式及应用场景爬虫,捕获分组非捕获分组

基于 Debian 稳定分支发行版的Zephix 7 发布

MBR20100CT-ASEMI肖特基MBR20100CT参数、规格、尺寸

AI智能体编排平台：从任务自动化到生态协作的架构与实践

别再只盯着wx.login了！SpringBoot后端实战：用getPhoneNumber接口搞定小程序用户手机号绑定

SmarterRouter：基于软件定义与模块化构建智能路由器系统

深入Transformer内部：LoRA到底改动了哪部分权重才让模型“学会”新任务？

Windows驱动清理终极指南：用DriverStore Explorer安全释放数十GB磁盘空间

Arm CoreLink PCK-600电源管理架构与寄存器编程详解

终极指南：如何用WarcraftHelper让魔兽争霸3在现代电脑上完美运行 [特殊字符]

gnamiblast-skill：基于技能化与管道化的智能文本处理工具解析

Legacy-iOS-Kit完整指南：如何让老旧iPhone和iPad重获新生

从仿生结构到步态算法：8自由度并联腿机器狗行走全解析