当前位置：首页 > news >正文

LLM各层参数详细分析（以LLaMA为例）

news 2026/2/10 22:34:34

网上大多分析LLM参数的文章都比较粗粒度，对于LLM的精确部署不太友好，在这里记录一下分析LLM参数的过程。

首先看QKV。先上transformer原文
在这里插入图片描述
也就是说，当h（heads） = 1时，在默认情况下， $W_i^Q$ 、 $W_i^K$ 、 $W_i^V$ 都是2维方阵，方阵维度是 $d_{model} \times d_{model}$ .

结合llama源码 (https://github.com/facebookresearch/llama/blob/main/llama/model.py)

class ModelArgs:dim: int = 4096n_layers: int = 32n_heads: int = 32n_kv_heads: Optional[int] = Nonevocab_size: int = -1  # defined later by tokenizermultiple_of: int = 256  # make SwiGLU hidden layer size multiple of large power of 2ffn_dim_multiplier: Optional[float] = Nonenorm_eps: float = 1e-5max_batch_size: int = 32max_seq_len: int = 2048
# ...class Attention(nn.Module):"""Multi-head attention module."""def __init__(self, args: ModelArgs):"""Initialize the Attention module.Args:args (ModelArgs): Model configuration parameters.Attributes:n_kv_heads (int): Number of key and value heads.n_local_heads (int): Number of local query heads.n_local_kv_heads (int): Number of local key and value heads.n_rep (int): Number of repetitions for local heads.head_dim (int): Dimension size of each attention head.wq (ColumnParallelLinear): Linear transformation for queries.wk (ColumnParallelLinear): Linear transformation for keys.wv (ColumnParallelLinear): Linear transformation for values.wo (RowParallelLinear): Linear transformation for output.cache_k (torch.Tensor): Cached keys for attention.cache_v (torch.Tensor): Cached values for attention."""super().__init__()self.n_kv_heads = args.n_heads if args.n_kv_heads is None else args.n_kv_headsmodel_parallel_size = fs_init.get_model_parallel_world_size()self.n_local_heads = args.n_heads // model_parallel_sizeself.n_local_kv_heads = self.n_kv_heads // model_parallel_sizeself.n_rep = self.n_local_heads // self.n_local_kv_headsself.head_dim = args.dim // args.n_heads

计算出
self.n_kv_heads = h = 32
self.head_dim = 4096/32=128
所以 $W_i^Q$ 、 $W_i^K$ 、 $W_i^V$ 大小都为(4096, 128). $Q×K^T$ 后，大小为(4096, 4096)，除法scale+softmax后不变，然后 $\times V$ ，大小恢复变为(4096, 128)。Attention不改变大小（在默认 $d_k=d_v$ 情况下）。
在这里插入图片描述

经过Cancat，分开的头又合并，大小变为(4096, 4096)方阵，经过 $W^O$ 全连接，还是(4096, 4096)方阵。

然后看Feed forward.根据源码，

class TransformerBlock(nn.Module):def __init__(self, layer_id: int, args: ModelArgs):"""Initialize a TransformerBlock.Args:layer_id (int): Identifier for the layer.args (ModelArgs): Model configuration parameters.Attributes:n_heads (int): Number of attention heads.dim (int): Dimension size of the model.head_dim (int): Dimension size of each attention head.attention (Attention): Attention module.feed_forward (FeedForward): FeedForward module.layer_id (int): Identifier for the layer.attention_norm (RMSNorm): Layer normalization for attention output.ffn_norm (RMSNorm): Layer normalization for feedforward output."""super().__init__()self.n_heads = args.n_headsself.dim = args.dimself.head_dim = args.dim // args.n_headsself.attention = Attention(args)self.feed_forward = FeedForward(dim=args.dim,hidden_dim=4 * args.dim,multiple_of=args.multiple_of,ffn_dim_multiplier=args.ffn_dim_multiplier,)self.layer_id = layer_idself.attention_norm = RMSNorm(args.dim, eps=args.norm_eps)self.ffn_norm = RMSNorm(args.dim, eps=args.norm_eps)def forward(self,x: torch.Tensor,start_pos: int,freqs_cis: torch.Tensor,mask: Optional[torch.Tensor],):"""Perform a forward pass through the TransformerBlock.Args:x (torch.Tensor): Input tensor.start_pos (int): Starting position for attention caching.freqs_cis (torch.Tensor): Precomputed cosine and sine frequencies.mask (torch.Tensor, optional): Masking tensor for attention. Defaults to None.Returns:torch.Tensor: Output tensor after applying attention and feedforward layers."""h = x + self.attention.forward(self.attention_norm(x), start_pos, freqs_cis, mask)out = h + self.feed_forward.forward(self.ffn_norm(h))return out

multiattention layer过后，经过加法和norm（RMS norm），进入feed_forward全连接。全连接层第一个维度是args.dim=4096, 第二个维度（hidden_dim）是4 * args.dim = 4*4096=16384 (目前还有问题）

LLM各层参数详细分析（以LLaMA为例）

相关文章：

LLM各层参数详细分析（以LLaMA为例）

linux ansible(三)

Anaconda和Pycharm详细安装配置教程

利用Linux虚拟化技术实现资源隔离和管理

12基于MATLAB的短时傅里叶变换( STFT),连续小波变换( CWT),程序已调通，可以直接运行。

k8s使用时无法ping通服务器From IP地址 icmp_seq=1 Destination Host Unreachable

两种风格的纯CSS3加载动画

Spring Cloud Eureka：服务注册与发现

安防监控视频云存储平台EasyNVR对接EasyNVS时，一直不上线该如何解决？

【完美解决】GitHub连接超时问题 Recv failure: Connection was reset

cpolar内网穿透

go语言操作数据库

zabbix实现钉钉报警

基于微信小程序的语言课学习系统设计与实现(源码+lw+部署文档+讲解等)

R 语言画图中英文字体解决方案

Golang反射相关知识总结

go语言初学（备忘）

免费获取独立ChatGPT账户！！

4.docker容器编排(docker compose 与 docker swarm)

Linux中配置sudo用户访问权限

AI Agent与Agentic AI：原理、应用、挑战与未来展望

阿里云ACP云计算备考笔记 (5)——弹性伸缩

mongodb源码分析session执行handleRequest命令find过程

最新SpringBoot+SpringCloud+Nacos微服务框架分享

OkHttp 中实现断点续传 demo

uniapp中使用aixos 报错

Mobile ALOHA全身模仿学习

MySQL账号权限管理指南：安全创建账户与精细授权技巧

基于Java+MySQL实现（GUI）客户管理系统

Razor编程中@Html的方法使用大全