当前位置：首页 > news >正文

transformer学习笔记-自注意力机制（2）

news 2025/6/18 0:11:58

经过上一篇transformer学习笔记-自注意力机制（1）原理学习，这一篇对其中的几个关键知识点代码演示：

1、整体qkv注意力计算

先来个最简单未经变换的QKV处理：

import torch  
Q = torch.tensor([[3.0, 3.0,0.0],[0.5, 4.0,0.0]])
K = Q.T
V = Qscores = Q @ K #计算内积
weights = torch.softmax(scores, dim=0)
print(f"概率分布：{weights}")
newQ = weights @ V
print(f"输出：{newQ}")

再来个输入经过Wq/Wk/Wv变换的：

import torch  
Q = torch.tensor([[3.0, 3.0,0.0],[0.5, 4.0,0.0]])
torch.manual_seed(123)  
d_q, d_k, d_v = 4, 4, 5 # W_query, W_key, W_value 的维度  
d = Q.shape[1] #  W_query, W_key, W_value 的行数等于输入token的维度
# 获取W_query, W_key, W_value(随机生成)
W_query = torch.nn.Parameter(torch.rand(d, d_q))  
W_key = torch.nn.Parameter(torch.rand(d, d_k))  
W_value = torch.nn.Parameter(torch.rand(d, d_v))print("W_query:", W_query)
print("W_key:", W_key)
print("W_value:", W_value)#先只计算苹果对整个句子的注意力，看看效果
apple = Q[0]
query_apple = apple @ W_query  
keys = Q @ W_key  
values = Q @ W_value  
print(f"query_apple:{query_apple}")
print(f"keys:{keys}")
print(f"values:{values}")
scores = query_apple @ keys.T
print(f"scores:{scores}")
weights = torch.softmax(scores, dim=0)
print(f"weights:{weights}")
newQ = weights @ values
print(f"newQ:{newQ}")#再看下整体的
querys = Q @ W_query
all_scores = querys @ keys.T
print(f"all_scores:{all_scores}")
all_weights = torch.softmax(all_scores, dim=-1)
print(f"all_weights:{all_weights}")
output = all_weights @ values
print(f"output:{output}")

最终生成的output的维度与W_value 的维度一致。

2、调换顺序结果不变

import torchdef simple_attention(Q):K = Q.TV = Qscores = Q @ K #计算内积weights = torch.softmax(scores, dim=-1)print(f"概率分布：{weights}")newQ = weights @ Vprint(f"输出：{newQ}")Q = torch.tensor([[3.0, 3.0,0.0],[0.5, 4.0,0.0]])
Q1 = torch.tensor([[0.5, 4.0,0.0],[3.0, 3.0,0.0]])
print("模拟‘苹果梨’：")
simple_attention(Q)
print("模拟‘梨苹果’：")
simple_attention(Q1)

可以看到“苹果梨”、“梨苹果”即便换了词token的顺序，并不会影响新的梨和新的苹果的向量数值。这里我们用了softmax函数求概率分布，因此跟上一篇文章的示例数值不一样，不要在意这个细节。

3、softmax：

import numpy as npdef softmax(x):e_x = np.exp(x)return e_x / e_x.sum(axis=0)def softmax_with_temperature(x,T):e_x = np.exp(x/T)return e_x / e_x.sum(axis=0)# 示例使用
if __name__ == "__main__":input_vector = np.array([2.0, 1.0, 0.1])output = softmax(input_vector)print("Softmax Output:", output)print("Softmax with Temperature 0.5 Output:", softmax_with_temperature(input_vector,0.5))print("Softmax with Temperature 1 Output:", softmax_with_temperature(input_vector,1))print("Softmax with Temperature 5 Output:", softmax_with_temperature(input_vector,5))

可以看到随着T的不断加大，概率分布不断趋于均匀分布。

4、softmax除以 $\sqrt{d_k}$

还是用上面的softmax函数，演示下除以 $\sqrt{d_k}$ 的效果：

        # 高维输入向量input_vector_high_dim = np.random.randn(100) * 10  # 生成一个100维的高斯分布随机向量，乘以10增加内积output_high_dim = softmax(input_vector_high_dim)print("High Dimension Softmax Output:", output_high_dim)# 打印高维输出的概率分布print("Max Probability in High Dimension:", np.max(output_high_dim))print("Min Probability in High Dimension:", np.min(output_high_dim))# 高维输入向量除以10input_vector_high_dim_div10 = input_vector_high_dim / 10output_high_dim_div10 = softmax(input_vector_high_dim_div10)print("High Dimension Softmax Output (Divided by 10):", output_high_dim_div10)# 打印高维输出的概率分布print("Max Probability in High Dimension (Divided by 10):", np.max(output_high_dim_div10))print("Min Probability in High Dimension (Divided by 10):", np.min(output_high_dim_div10))# 绘制高维概率分布曲线plt.figure(figsize=(10, 6))# 绘制图形plt.plot(output_high_dim, label='High Dim')plt.plot(output_high_dim_div10, label='High Dim Divided by 10')plt.legend()plt.title('High Dimension Softmax Output Comparison')plt.xlabel('Index')plt.ylabel('Probability')plt.show()

在除以 $\sqrt{d_k}$ 之前，由于内积变大，导致概率分布变得尖锐，趋近0的位置梯度基本消失，softmax 函数的损失函数的导数在输出接近 0 时接近零，在反向传播过程中，无法有效地更新权重。有兴趣的话可以试试对softmax 函数的损失函数求导。

继续上面的代码，来看下softmax的输出的损失函数求梯度：

        def test_grad( dim_vertor):import numpy as npimport torchimport torch.nn.functional as F# 假设的输入z = torch.tensor(dim_vertor, requires_grad=True)print(z)# 计算 softmax 输出p = F.softmax(z, dim=0)true_label = np.zeros(100)true_label[3] = 1# 模拟损失函数（例如交叉熵）y = torch.tensor(true_label)  # one-hot 编码的真实标签loss = -torch.sum(y * torch.log(p))# 反向传播并获取梯度loss.backward()# print(z.grad)  # 输出梯度return z.gradgrad_div10 = test_grad(input_vector_high_dim_div10)grad = test_grad(input_vector_high_dim)print(f"grad_div10:{grad_div10}")print(f"grad:{grad}")

明显看出，没有除以 $\sqrt{d_k}$ 求出的梯度，基本为0；上面的代码是torch已经实现的。当然也可以根据损失函数自己求导，这里我们只为演示效果，点到即止：

5、多头注意力：

import torch
import torch.nn as nntorch.manual_seed(123)# 输入矩阵 Q
Q = torch.tensor([[3.0, 3.0, 0.0],[0.5, 4.0, 0.0]])# 维度设置
d_q, d_k, d_v = 4, 4, 5  # 每个头的 query, key, value 的维度
d_model = Q.shape[1]     # 输入 token 的维度
num_heads = 2            # 头的数量# 初始化每个头的权重矩阵
W_query = nn.ParameterList([nn.Parameter(torch.rand(d_model, d_q)) for _ in range(num_heads)])
W_key = nn.ParameterList([nn.Parameter(torch.rand(d_model, d_k)) for _ in range(num_heads)])
W_value = nn.ParameterList([nn.Parameter(torch.rand(d_model, d_v)) for _ in range(num_heads)])# 输出权重矩阵
W_output = nn.Parameter(torch.rand(num_heads * d_v, d_model))# 打印权重矩阵
for i in range(num_heads):print(f"W_query_{i+1}:\n{W_query[i]}")print(f"W_key_{i+1}:\n{W_key[i]}")print(f"W_value_{i+1}:\n{W_value[i]}")# 计算每个头的 Q, K, V
queries = [Q @ W_query[i] for i in range(num_heads)]
keys = [Q @ W_key[i] for i in range(num_heads)]
values = [Q @ W_value[i] for i in range(num_heads)]# 计算每个头的注意力分数和权重
outputs = []
for i in range(num_heads):scores = queries[i] @ keys[i].T / (d_k ** 0.5)weights = torch.softmax(scores, dim=-1)output = weights @ values[i]outputs.append(output)# 拼接所有头的输出
concat_output = torch.cat(outputs, dim=-1)
print(f"concat_output:\n{concat_output}")
# 最终线性变换
final_output = concat_output @ W_output# 打印结果
print(f"Final Output:\n{final_output}")

6、掩码注意力：

import torch# 原始 Q 矩阵
Q = torch.tensor([[3.0, 3.0, 0.0],[0.5, 4.0, 0.0],[1.0, 2.0, 0.0],[2.0, 1.0, 0.0]])torch.manual_seed(123)
d_q, d_k, d_v = 4, 4, 5  # query, key, value 的维度
d = Q.shape[1]           # query, key, value 的行数等于输入 token 的维度# 初始化权重矩阵
W_query = torch.nn.Parameter(torch.rand(d, d_q))
W_key = torch.nn.Parameter(torch.rand(d, d_k))
W_value = torch.nn.Parameter(torch.rand(d, d_v))print("W_query:", W_query)
print("W_key:", W_key)
print("W_value:", W_value)# 计算 Q, K, V
querys = Q @ W_query
keys = Q @ W_key
values = Q @ W_valueprint(f"querys:\n{querys}")
print(f"keys:\n{keys}")
print(f"values:\n{values}")# 计算注意力分数
all_scores = querys @ keys.T / (d_k ** 0.5)
print(f"all_scores:\n{all_scores}")# 生成掩码
seq_len = Q.shape[0]
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()
masked_scores = all_scores.masked_fill(mask, float('-inf'))print(f"Mask:\n{mask}")
print(f"Masked Scores:\n{masked_scores}")# 计算权重
all_weights = torch.softmax(masked_scores, dim=-1)
print(f"all_weights:\n{all_weights}")# 计算输出
output = all_weights @ values
print(f"output:\n{output}")

主要看下生成的掩码矩阵，和通过掩码矩阵处理的权重分布：

transformer学习笔记-自注意力机制（2）

经过上一篇transformer学习笔记-自注意力机制（1）原理学习，这一篇对其中的几个关键知识点代码演示： 1、整体qkv注意力计算先来个最简单未经变换的QKV处理： import torch Q torch.tensor([[3.0, 3.0,0.0],[0.5, 4…...

编程日记 2024/12/15 11:49:54

呼叫中心呼入大模型如何对接传统呼叫中心系统？

呼叫中心呼入大模型如何对接传统呼叫中心系统？ 原作者：开源呼叫中心FreeIPCC，其Github：https://github.com/lihaiya/freeipcc 呼叫中心呼入大模型与传统呼叫中心系统的对接是一个复杂而细致的过程，涉及技术实现、流程…...

编程日记 2024/12/15 11:47:51

[Unity] Text文本首行缩进两个字符

Text文本首行缩进两个字符的方法比较简单。通过代码把"\u3000\u3000"加到文本字符串前面即可。比如： 效果： 代码： TMPtext1.text "\u3000\u3000" "选择动作类型：";...

编程日记 2024/12/15 11:45:49

springboot 对接 ollama

spring ai 对接 ollama 引入依赖 <dependency><groupId>io.springboot.ai</groupId><artifactId>spring-ai-ollama-spring-boot-starter</artifactId><version>1.0.0</version> </dependency>这里因为使用的是快照版本所以需…...

编程日记 2024/12/15 11:42:45

【数据库】选择题+填空+简答

1.关于冗余数据的叙述中，不正确的是（） A.冗余的存在容易破坏数据库的完整新 B.冗余的存在给数据库的维护增加困难 C.不应该在数据库中存储任何冗余数据 D.冗余数据是指由基本数据导出的数据 C 2.最终用户使用的数据视图称为（&…...

编程日记 2024/12/15 11:33:35

从0开始写android 之xwindow

模拟实现android的窗口系统本质上还是在ubuntu 上实现自己的窗口系统， xwindow是一套成熟的解决方案。在ubuntu上使用xwindow的好处之一是ubuntu自带xwindow的库， 直接引用头文件和库文件。下面来了解下 xwindow的基本函数接口。参考 https://tronche…...

编程日记 2024/12/15 11:30:33

The Past, Present and Future of Apache Flink

摘要：本文整理自阿里云开源大数据负责人王峰（莫问）在 Flink Forward Asia 2024上海站主论坛开场的分享，今年正值Flink开源项目诞生的第10周年，借此时机，王峰回顾了Flink在过去10年的发展历程以及 Flink社区…...

编程日记 2024/12/15 11:29:32

多模块应用、发布使用第三方库（持续更新中）

目录: 1、多模块概述（HAP、HSP、HAR） HAR与HSP两种共享包的主要区别体现在： 2、三类模块： 3、创建项目：项目名：meituan （1）创建Ability类型的Module，编译后为HAP文件…...

编程日记 2024/12/15 11:24:26

An error happened while trying to locate the file on the Hub and we cannot f

An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on. 关于上述comfy ui使用control net预处理器的报错问…...

编程日记 2024/12/15 11:18:17

UE5安装Fab插件

今天才知道原来Fab也有类似Quixel Bridge的插件，于是立马就安装上了，这里分享一下安装方法在Epic客户端 - 库 - Fab Library 搜索 Fab 即可安装Fab插件然后重启引擎，在插件面板勾选即可然后在窗口这就有了引擎左下角也会多出一个Fab图标…...

编程日记 2024/12/15 11:17:16

Linux C语言操作sqlite3数据库

一、环境配置 1、下载源码：sqlite-autoconf-3470200.tar.gz 2、解压，cd到源码主目录 3、配置参数 ./configure --prefix/usr/local/ 如果是交叉编译环境 ./configure CC/opt/rk3288/gcc-linaro/bin/arm-linux-gnueabihf-gcc --hostarm-linux --pre…...

编程日记 2024/12/15 11:16:15

【人工智能】因果推断与数据分析：用Python探索数据间的因果关系

解锁Python编程的无限可能：《奇妙的Python》带你漫游代码世界因果推断是数据科学领域的一个重要方向，旨在发现变量间的因果关系，而不仅仅是相关性。本篇文章将从因果推断的理论基础出发，介绍因果关系的定义与建模方法，涵盖因果图（Causal Graph）、d-分离、反事实估计等…...

编程日记 2024/12/15 11:14:13

freeswitch（30秒自动挂断）

亲测版本centos 7.9系统–》 freeswitch1.10.9 本人freeswitch安装路径（根据自己的路径进入） /usr/local/freeswitch/etc/freeswitch场景说明： A和B接通通话时候，时间开始计算到达30秒后自动挂断使用方法进入/usr/local/freeswitch/etc...

编程日记 2024/12/15 11:13:10

大模型呼入机器人有哪些功能特点？(转)

大模型呼入机器人有哪些功能特点？(转) 原作者：开源呼叫中心FreeIPCC，其Github：https://github.com/lihaiya/freeipcc 大模型呼入机器人，作为现代通信技术与人工智能深度融合的产物，正逐渐成为企业提升服务…...

编程日记 2024/12/15 11:12:07

网络工程师常用软件之配置对比软件

「24-配置比对软件-汉化WinMerge」链接：https://pan.quark.cn/s/cef7541d62d1 ################################################################################ 我们经常在项目或者运维中对设备的config进行变更，那么我们如何快速的知道变更了什么…...

编程日记 2024/12/15 11:09:04

Linux之远程登录

一、使用ssh命令登录 winR打开cmd输入命令 # root是命令，192.168.101.200是地址 ssh root192.168.101.200是否要保存密码，就是yes以后可以免密登录，这里就yes了输入密码，就登录成功了操作完成之后，输入命令退出 e…...

编程日记 2024/12/15 11:02:58

#渗透测试#漏洞挖掘#红蓝攻防#js分析（上）

免责声明本教程仅为合法的教学目的而准备，严禁用于任何形式的违法犯罪活动及其他商业行为，在使用本教程前，您应确保该行为符合当地的法律法规，继续阅读即表示您需自行承担所有操作的后果，如有异议，请立即停…...

编程日记 2024/12/15 11:01:57

数智读书笔记系列006 协同进化：人类与机器融合的未来

书名:协同进化：人类与机器融合的未来作者:[美]爱德华阿什福德李译者:李杨出版时间:2022-06-01 ISBN:9787521741476 中信出版集团制作发行爱德华・阿什福德・李（Edward Ashford Lee）是一位在计算机科学与工程领域颇具影响力的学者&am…...

编程日记 2024/12/15 10:51:46

操作系统（7）处理机调度

前言操作系统中的处理机调度是一个核心概念，它涉及如何从就绪队列中选择进程并将处理机分配给它以运行，从而实现进程的并发执行。一、调度的层次高级调度（作业调度）： 调度对象：作业（包含程序…...

编程日记 2024/12/15 10:50:45

java_网络服务相关_gateway_nacos_feign区别联系

1. spring-cloud-starter-gateway 作用：作为微服务架构的网关，统一入口，处理所有外部请求。核心能力： 路由转发（基于路径、服务名等）过滤器（鉴权、限流、日志、Header 处理）支持负…...

编程新知 2025/6/17 1:31:50

Cesium1.95中高性能加载1500个点

一、基本方式： 图标使用.png比.svg性能要好 <template><div id"cesiumContainer"></div><div class"toolbar"><button id"resetButton">重新生成点</button><span id"countDisplay&qu…...

编程新知 2025/6/17 7:02:51

在rocky linux 9.5上在线安装 docker

前面是指南，后面是日志 sudo dnf config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo sudo dnf install docker-ce docker-ce-cli containerd.io -y docker version sudo systemctl start docker sudo systemctl status docker …...

编程新知 2025/6/17 6:27:36