当前位置：首页 > news >正文

Transformer-Bert---散装知识点---mlm，nsp

news 2026/5/11 14:19:15

本文记录的是笔者在了解了transformer结构后嗑bert中记录的一些散装知识点，有时间就会整理收录，希望最后能把transformer一个系列都完整的更新进去。

1.自监督学习
bert与原始的transformer不同，bert是使用大量无标签的数据进行预训练，下游则使用少量的标注数据进行微调。预训练使用的就是自监督学习。
自监督学习直白来说就是对原始数据添加辅助任务来使得数据能够根据自身生成标签。

举几个简单的例子来解释一下常见的自监督学习：（ps:插一嘴，bert使用的是mlm，会在最后的例子中解释）

1.1图像类：

1.1.1填充：

将图片扣掉一块，让模型进行填充。
输入：扣掉一块的图片
输出：填充部分
标签：原图扣掉的部分

1.1.2拼图

选取图片中的一部分图片A以及其相邻的某一部分图片B作为输入，预测图B于图A的相对位置。
输入： (A图) + (B图)
输出：1-8之间的整数，代表图B相对于图A的位置
标签：5(对应原图中数字5的部分)
这类辅助任务就旨在训练模型对于局部特征分布位置的识别能力。

1.2 文本类

1.2.1 完形填空
简单的来说就是在原始数据中扣掉一个或多个单词，让模型进行补充。

原始数据：All the world's a stage, and all the men and women merely players.
输入：All the world's a stage, and all the __ and women merely players.
输出：预测的单词
标签：men

1.2.2 Masked Language Model (MLM)（划重点拉）
MLM模型会随机的选择需要掩盖的单词（大概15%）(主要用于让模型习得语义、语法)
ps:由于是随机的一般我们都会指定一个参数max_pred用来表示一个句子最多被掩盖单词的数量

原始数据：All the world's a stage, and all the men and women merely players.
输入：All the world's a stage, and all the MASK and MASK merely players.
输出：预测的单词
标签：men, women

为了更好的适应下游任务，bert的作者对与MLM的规则进行了一定的微调。
被替换的单词：men ： MASK-------------------80%
apple(随机单词)------10%
men(保持不变--)------10%
依然还是对标注为MASK的单词进行预测。
下面是论文原文对于这段的描述附上中英文对照

        为了训练一个深度双向表示，我们简单地随机遮盖输入标记的一定比例，然后预测这些被遮盖的标记。我们称这个过程为“遮盖语言建模”（Masked Language Modeling，MLM），尽管文献中通常称之为Cloze任务（Taylor, 1953）。在这种情况下，对应于遮盖标记的最终隐藏向量被馈送到一个标准语言模型中的词汇表上的输出softmax层。在所有实验中，我们随机遮盖每个序列中所有WordPiece标记的15%。与去噪自编码器（Vincent et al., 2008）不同，我们仅预测遮盖的单词，而不是重构整个输入。

        In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random, and then predict those masked tokens. We refer to this procedure as a “masked LM” (MLM), although it is often referred to as a Cloze task in the literature (Taylor, 1953). In this case, the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM. In all of our experiments, we mask 15% of all WordPiece tokens in each sequence at random. In contrast to denoising auto-encoders (Vincent et al., 2008), we only predict the masked words rather than reconstructing the entire input.
        尽管这使我们能够获得一个双向预训练模型，但其缺点是在预训练和微调之间创建了不匹配，因为在微调过程中不存在[MASK]标记。为了减轻这一问题，我们并不总是用实际的[MASK]标记替换“遮盖”的单词。训练数据生成器随机选择15%的标记位置进行预测。如果选择第i个标记，则有80%的概率将第i个标记替换为[MASK]标记，10%的概率将其替换为随机标记，以及10%的概率保持不变。然后，使用交叉熵损失来预测原始标记。我们在附录C.2中比较了这一过程的变化。

Although this allows us to obtain a bidirectional pre-trained model, a downside is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning. To mitigate this, we do not always replace “masked” words with the actual [MASK] token. The training data generator chooses 15% of the token positions at random for prediction. If the i-th token is chosen, we replace the i-th token with (1) the [MASK] token 80% of the time (2) a random token 10% of the time (3) the unchanged i-th token 10% of the time. Then, T i will be used to predict the original token with cross entropy loss. We compare variations of this procedure in Appendix C.2.

2.NSP任务

Bert中的NSP实质上就是一个二分类任务。
主要就是预测句子2是否是句子1的下一句，其中句子2有50%是真，50%是从句库中随机挑选的句子。目的就是为了让模型学习到句子之间的关系。
输入：句子1 'esp' 句子2
ps:esp是词向量层中的特殊符号，表示一句话的结束，也常用来分割句子
输出：0或1
标签：0或1

Transformer-Bert---散装知识点---mlm，nsp

相关文章：

Transformer-Bert---散装知识点---mlm，nsp

基于术语词典干预的机器翻译挑战赛笔记 Task3 #Datawhale AI 夏令营

定制QCustomPlot 带有ListView的QCustomPlot 全网唯一份

Fast Planner规划算法（一）—— Fast Planner前端

问题记录-SpringBoot 2.7.2 整合 Swagger 报错

【视觉SLAM】十四讲ch5习题

Webpack基础学习-Day01

如何防止热插拔烧坏单片机

JQuery+HTML+JavaScript：实现地图位置选取和地址模糊查询

ArcGIS Pro SDK （九）几何 13 多部件

【Node】npm i --legacy-peer-deps，解决依赖冲突问题

h5点击电话号跳转手机拨号

从数据湖到湖仓一体：统一数据架构演进之路

Electron 渲染进程直接调用主进程的API库@electron/remote引用讲解

在python中使用正则表达式

华清数据结构day4 24-7-19

【深度学习图像】拼接图的切分

Covalent（CXT）运营商网络规模扩大 42%，以满足激增的需求

Java 集合框架：HashMap 的介绍、使用、原理与源码解析

单周期CPU（三）译码模块（minisys）（verilog）（vivado）

高效Kolmogorov-Arnold网络：PyTorch实现终极指南 [特殊字符]

5分钟告别百度网盘提取码烦恼：智能获取工具全解析

终极指南：如何在Windows上轻松模拟游戏控制器 - ViGEmBus驱动完整教程

AI智能体如何通过区块链钱包实现自动化加密云存储

DeepSeek V4的突破：探索未来AI意识的可能性

Error response from daemon: client version 1.52 is too new. Maximum supported API version is 1.43

【Keras+TensorFlow+Yolo3】从零构建自定义目标检测模型：实战标注、训练与部署（TF2避坑指南）

GitHub Actions 工作流中的输出处理

调试STM32双CAN通信的5个常见坑：从TJA1050供电到过滤器配置的避坑指南

手把手教你学Simulink--基于Simulink的三相锁相环（SRF-PLL）在单相逆变器中扩展仿真示例