当前位置：首页 > news >正文

如何在24GB的GPU上运行DeepSeek-R1-Distill-Qwen-32B

news 2025/12/4 8:22:19

如何在24GB的GPU上运行DeepSeek-R1-Distill-Qwen-32B

- 一、背景
- 二、解决方案
- 三、操作步骤
- - 1.下载模型
  - 2.安装依赖
  - 3.量化
  - 4.生成推理代码
  - 5.运行
  - - A.缓存上限为128条
    - B.不限制缓存上限
    - C.输出内容

一、背景

随着深度学习的不断发展，大型语言模型（LLM，Large Language Model）在自然语言处理领域展现出了强大的能力。然而，伴随着模型参数规模的指数级增长，运行这些模型所需的计算资源也变得异常庞大，尤其是对显存（GPU内存）的需求。因此，如何在有限的GPU显存下有效地运行超大规模的LLM，成为了一个亟待解决的挑战。

本文验证在GPU显存受限的情况下，如何高效地运行超出GPU内存容量的LLM模型。通过对模型权重的量化和内存管理策略的优化，期望能够突破硬件瓶颈，为大型模型的部署和应用提供新的思路。

二、解决方案

下面的方案，主要包括权重量化、内存缓存机制以及自定义Linear的设计。具体方案如下：

权重的INT4块量化
- 量化策略：将模型的权重参数进行INT4（4位整数）块量化处理，量化的块大小设定为128。这种量化方式能够大幅度减少模型权重所占用的存储空间。
- 内存优势：经过INT4量化后的权重占用空间显著降低，使得所有权重可以加载到主机（HOST）内存中。这不仅缓解了GPU显存的压力，还为后续的高效读取奠定了基础。
减少磁盘I/O操作
- 全量加载：将所有量化后的INT4权重一次性加载到HOST内存中，避免了在模型运行过程中频繁进行磁盘读写操作。这种方式有效减少了磁盘I/O带来的时间开销和性能瓶颈。
设备内存缓存机制
- 缓存设计：在GPU设备内存中建立一个缓存机制，设定最大缓存条目数为N。N的取值与具体的GPU配置相关，目的是充分利用可用的设备内存，最大化其占用率，提升数据读取效率。
- 动态管理：缓存机制需要智能地管理内存的分配和释放，确保在不超过设备内存上限的情况下，高效地存取所需的数据。
权重预加载线程
- 职责分离：引入一个专门的权重预加载线程，负责将HOST内存中的INT4权重进行反量化处理（即将INT4还原为计算所需的格式），并将处理后的权重加载到GPU设备内存的缓存中。
- 效率优化：通过预加载线程的异步处理，提升了数据准备的效率，确保模型在需要数据时可以及时获取，最大程度减少等待时间。
自定义Linear模块
- 模块替换：将原有的nn.Linear层替换为自定义的Module。在模型构建和加载过程中，使用该自定义模块来承载线性计算任务。
- 运行机制：自定义的Module在前向传播（forward）过程中，从设备内存的缓存中获取所需的权重进行计算。计算完成后，立即释放权重占用的设备内存，以供后续的计算任务使用。
- 优势：这种动态加载和释放的机制，避免了在整个计算过程中权重长时间占用设备内存，极大地提高了内存的利用效率。

三、操作步骤

1.下载模型

# 模型介绍: https://www.modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B# 下载模型
apt install git-lfs -y
git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B.git

2.安装依赖

MAX_JOBS=4 pip install flash-attn==2.3.6
pip install torch-tb-profiler

3.量化

cat > extract_weights.py << EOF
import torch
import os
from tqdm import tqdm
from glob import glob
import torch
import sys
from safetensors.torch import safe_open, save_filedef quantize_tensor_int4(tensor):"""将bfloat16的Tensor按照块大小128进行量化为int4，并返回每个块的scale。参数：tensor (torch.Tensor): bfloat16类型的输入Tensor。返回：int4_tensor (torch.Tensor): 量化后的uint8类型的Tensor，存储int4值，每个元素包含两个int4值。scales (torch.Tensor): 每个块对应的bfloat16类型的scale值。"""# 确保输入Tensor为bfloat16类型tensor = tensor.to(torch.bfloat16)# 将Tensor展平为一维flat_tensor = tensor.flatten()N = flat_tensor.numel()block_size = 128num_blocks = (N + block_size - 1) // block_size  # 计算块的数量# 计算每个元素的块索引indices = torch.arange(N, device=flat_tensor.device)block_indices = indices // block_size  # shape: [N]# 计算每个块的x_maxabs_tensor = flat_tensor.abs()zeros_needed = num_blocks * block_size - N# 对张量进行填充，使其长度为num_blocks * block_sizeif zeros_needed > 0:padded_abs_tensor = torch.cat([abs_tensor, torch.zeros(zeros_needed, device=abs_tensor.device, dtype=abs_tensor.dtype)])else:padded_abs_tensor = abs_tensorreshaped_abs_tensor = padded_abs_tensor.view(num_blocks, block_size)x_max = reshaped_abs_tensor.max(dim=1).values  # shape: [num_blocks]# 处理x_max为0的情况，避免除以0x_max_nonzero = x_max.clone()x_max_nonzero[x_max_nonzero == 0] = 1.0  # 防止除以0# 计算scalescales = x_max_nonzero / 7.0  # shape: [num_blocks]scales = scales.to(torch.bfloat16)# 量化scales_expanded = scales[block_indices]  # shape: [N]q = torch.round(flat_tensor / scales_expanded).clamp(-8, 7).to(torch.int8)# 将有符号int4转换为无符号表示q_unsigned = q & 0x0F  # 将范围[-8,7]映射到[0,15]# 如果元素数量是奇数，补充一个零if N % 2 != 0:q_unsigned = torch.cat([q_unsigned, torch.zeros(1, dtype=torch.int8, device=q.device)])# 打包两个int4到一个uint8q_pairs = q_unsigned.view(-1, 2)int4_tensor = (q_pairs[:, 0].to(torch.uint8) << 4) | q_pairs[:, 1].to(torch.uint8)return int4_tensor, scalestorch.set_default_device("cuda")
if len(sys.argv)!=3:print(f"{sys.argv[0]} input_model_dir output_dir")
else:input_model_dir=sys.argv[1]output_dir=sys.argv[2]if not os.path.exists(output_dir):os.makedirs(output_dir)state_dicts = {}for file_path in tqdm(glob(os.path.join(input_model_dir, "*.safetensors"))):with safe_open(file_path, framework="pt", device="cuda") as f:for name in f.keys():param: torch.Tensor = f.get_tensor(name)#print(name,param.shape,param.dtype)if "norm" in name or "embed" in name:state_dicts[name] = paramelse:if "weight" in name:int4_tensor, scales=quantize_tensor_int4(param)state_dict={}state_dict["w"]=int4_tensor.datastate_dict["scales"]=scales.datastate_dict["shape"]=param.shapetorch.save(state_dict, os.path.join(output_dir, f"{name}.pt"))else:torch.save(param.data, os.path.join(output_dir, f"{name}.pt"))torch.save(state_dicts, os.path.join(output_dir, "others.pt"))
EOF
python extract_weights.py DeepSeek-R1-Distill-Qwen-32B ./data

4.生成推理代码

cat > infer.py << EOF
import sys
import os
from transformers import AutoModelForCausalLM, AutoTokenizer,BitsAndBytesConfig
import torch
import time
import numpy as np
import json
from torch.utils.data import Dataset, DataLoader
import threading
from torch import Tensor
from tqdm import tqdm
import triton
import triton.language as tl
import time
import queue
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")@triton.jit
def dequantize_kernel(int4_ptr,     # 量化后的 int4 张量指针scales_ptr,   # 每个块的 scale 值指针output_ptr,   # 输出张量指针N,            # 总元素数量num_blocks,   # 总块数BLOCK_SIZE: tl.constexpr  # 每个线程块处理的元素数量
):# 计算全局元素索引pid = tl.program_id(axis=0)offs = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)mask = offs < N# 计算 int4 张量中的索引int4_idxs = offs // 2  # 每个 uint8 包含两个 int4 值int4_vals = tl.load(int4_ptr + int4_idxs, mask=int4_idxs < (N + 1) // 2)# 提取高 4 位和低 4 位的 int4 值shift = 4 * (1 - (offs % 2))q = (int4_vals >> shift) & 0x0Fq = q.to(tl.int8)# 将无符号 int4 转换为有符号表示q = (q + 8) % 16 - 8  # 将范围 [0, 15] 映射回 [-8, 7]# 计算每个元素所属的块索引block_size = 128block_idxs = offs // block_sizescales = tl.load(scales_ptr + block_idxs, mask=block_idxs < num_blocks)# 反量化dequantized = q.to(tl.float32) * scales# 存储结果tl.store(output_ptr + offs, dequantized, mask=mask)def dequantize_tensor_int4_triton(int4_tensor, scales, original_shape):N = original_shape.numel()num_blocks = scales.numel()output = torch.empty(N, dtype=torch.bfloat16, device=int4_tensor.device)# 动态调整块大小（A100建议512-1024）BLOCK_SIZE = min(1024, triton.next_power_of_2(N))grid = (triton.cdiv(N, BLOCK_SIZE),)dequantize_kernel[grid](int4_tensor, scales, output,N, scales.numel(), BLOCK_SIZE=BLOCK_SIZE)output = output.view(original_shape)return outputdef load_pinned_tensor(path):data = torch.load(path, map_location='cpu',weights_only=True)  # 先加载到CPU# 递归遍历所有对象，对Tensor设置pin_memorydef _pin(tensor):if isinstance(tensor, torch.Tensor):return tensor.pin_memory()elif isinstance(tensor, dict):return {k: _pin(v) for k, v in tensor.items()}elif isinstance(tensor, (list, tuple)):return type(tensor)(_pin(x) for x in tensor)else:return tensorreturn _pin(data)class WeightCache:def __init__(self, weight_names, weight_dir, max_cache_size):self.weight_names = weight_namesself.weight_dir = weight_dirif max_cache_size==-1:self.max_cache_size = len(weight_names)else:self.max_cache_size = max_cache_sizeself.cache = {}self.cache_lock = threading.Lock()self.condition = threading.Condition(self.cache_lock)self.index = 0self.weight_cpu = []self.dequantized = {}self.accessed_weights = set()  # 用于记录被 get 过的权值for name in tqdm(self.weight_names):weight_path = os.path.join(self.weight_dir, name + ".pt")self.weight_cpu.append(load_pinned_tensor(weight_path))self.loader_thread = threading.Thread(target=self._loader)self.loader_thread.daemon = Trueself.loader_thread.start()self.last_ts = time.time()def _loader(self):stream = torch.cuda.Stream()while True:with self.condition:while len(self.cache) > self.max_cache_size:# 尝试删除已被 get 过的权值removed = Falsefor weight_name in list(self.cache.keys()):if weight_name in self.accessed_weights:del self.cache[weight_name]self.accessed_weights.remove(weight_name)removed = Truebreak  # 每次删除一个if not removed:self.condition.wait()# 加载新的权值到缓存if self.index >= len(self.weight_names):self.index = 0weight_name = self.weight_names[self.index]if weight_name in self.cache:time.sleep(0.01)continuew = self.weight_cpu[self.index]with torch.cuda.stream(stream):if "weight" in weight_name:new_weight = {"w": w['w'].to(device, non_blocking=False),"scales": w['scales'].to(device, non_blocking=False),"shape": w['shape']}else:new_weight = w.to(device, non_blocking=False)with self.condition:self.cache[weight_name] = new_weightself.index += 1self.condition.notify_all()def wait_full(self):with self.condition:while len(self.cache) < self.max_cache_size:self.condition.wait()print(len(self.cache), self.max_cache_size)def get(self, weight_name):with self.condition:while weight_name not in self.cache:self.condition.wait()weight = self.cache[weight_name]  # 不再从缓存中删除self.accessed_weights.add(weight_name)  # 记录被 get 过的权值self.condition.notify_all()return weightclass TextGenerationDataset(Dataset):def __init__(self, json_data):self.data = json.loads(json_data)def __len__(self):return len(self.data)def __getitem__(self, idx):item = self.data[idx]input_text = item['input']expected_output = item['expected_output']return input_text, expected_outputclass Streamer:def __init__(self, tokenizer):self.cache = []self.tokenizer = tokenizerself.start_time = None      # 用于记录开始时间self.token_count = 0        # 用于记录生成的令牌数量def put(self, token):if self.start_time is None:self.start_time = time.time()  # 初始化开始时间decoded = self.tokenizer.decode(token[0], skip_special_tokens=True)self.cache.append(decoded)self.token_count += token.numel()  # 增加令牌计数elapsed_time = time.time() - self.start_timetokens_per_sec = self.token_count / elapsed_time if elapsed_time > 0 else 0print(f"{tokens_per_sec:.2f} tokens/sec| {''.join(self.cache)}", end="\r", flush=True)def end(self):total_time = time.time() - self.start_time if self.start_time else 0print("\nGeneration complete.")if total_time > 0:avg_tokens_per_sec = self.token_count / total_timeprint(f"总令牌数: {self.token_count}, 总耗时: {total_time:.2f}s, 平均速度: {avg_tokens_per_sec:.2f} tokens/sec.")else:print("总耗时过短，无法计算每秒生成的令牌数。")class MyLinear(torch.nn.Module):__constants__ = ['in_features', 'out_features']in_features: intout_features: intweight: Tensordef __init__(self, in_features: int,out_features: int,bias: bool = True,device=None,dtype=None) -> None:factory_kwargs = {'device': device, 'dtype': dtype}super().__init__()self.in_features = in_featuresself.out_features = out_featuresself.weight = Trueif bias:self.bias = Trueelse:self.bias = Falsedef forward(self, x):w = self.weight_cache.get(f"{self.w_name}.weight")weight=dequantize_tensor_int4_triton(w['w'], w['scales'],w['shape'])if self.bias:bias = self.weight_cache.get(f"{self.w_name}.bias")else:bias = Nonereturn torch.nn.functional.linear(x,weight,bias)def set_linear_name(model,weight_cache):for name, module in model.named_modules():if isinstance(module, torch.nn.Linear):module.w_name=namemodule.weight_cache=weight_cachetorch.nn.Linear=MyLinearinput_model_dir=sys.argv[1]
input_weights_dir=sys.argv[2]
cache_queue_size=int(sys.argv[3])torch.set_default_device('cuda')
tokenizer = AutoTokenizer.from_pretrained(input_model_dir)
from transformers.models.qwen2 import Qwen2ForCausalLM,Qwen2Config
config=Qwen2Config.from_pretrained(f"{input_model_dir}/config.json")
config.use_cache=True
config.torch_dtype=torch.float16
config._attn_implementation="flash_attention_2"
model =Qwen2ForCausalLM(config).bfloat16().bfloat16().to(device)
checkpoint=torch.load(f"{input_weights_dir}/others.pt",weights_only=True)
model.load_state_dict(checkpoint)weight_map=[]
with open(os.path.join(input_model_dir,"model.safetensors.index.json")) as f:for name in json.load(f)["weight_map"].keys():if "norm" in name or "embed" in name:passelse:weight_map.append(name)json_data =r'''
[{"input": "1.1+2.3=?", "expected_output": "TODO"}
]
'''weight_cache = WeightCache(weight_map,input_weights_dir,cache_queue_size)
print("wait done")
set_linear_name(model,weight_cache)
model.eval()test_dataset = TextGenerationDataset(json_data)
test_dataloader = DataLoader(test_dataset, batch_size=1, shuffle=False)dataloader_iter = iter(test_dataloader)
input_text, expected_output=next(dataloader_iter)
inputs = tokenizer(input_text, return_tensors="pt").to(device)
streamer = Streamer(tokenizer)if True:stream = torch.cuda.Stream()with torch.cuda.stream(stream):with torch.inference_mode():#outputs = model.generate(**inputs, max_length=4096,streamer=streamer,do_sample=True,pad_token_id=tokenizer.eos_token_id,num_beams=1,repetition_penalty=1.1)outputs = model.generate(**inputs, max_length=4096,streamer=streamer,use_cache=config.use_cache)
else:def trace_handler(prof):print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=-1))prof.export_chrome_trace("output.json")import torch.autograd.profiler as profilerfrom torch.profiler import profile, record_function, ProfilerActivitywith torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU,torch.profiler.ProfilerActivity.CUDA],on_trace_ready=trace_handler) as p:stream = torch.cuda.Stream()with torch.cuda.stream(stream):with torch.inference_mode():outputs = model.generate(**inputs, max_length=4096,streamer=streamer,use_cache=True,do_sample=True,pad_token_id=tokenizer.eos_token_id,num_beams=1,repetition_penalty=1.1)#outputs = model.generate(**inputs, max_length=4096,streamer=streamer)#outputs = model.generate(**inputs, max_length=8)p.step()
EOF

5.运行

A.缓存上限为128条

export TRITON_CACHE_DIR=$PWD/cache
python infer.py DeepSeek-R1-Distill-Qwen-32B data 128

性能

总令牌数: 403, 总耗时: 572.34s, 平均速度: 0.70 tokens/sec.

GPU利用率

|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:03:00.0 Off |                  N/A |
| 71%   62C    P0            186W /  350W |   10354MiB /  24576MiB |     95%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|    0   N/A  N/A     47129      C   python                                      10258MiB |
+-----------------------------------------------------------------------------------------+

B.不限制缓存上限

export TRITON_CACHE_DIR=$PWD/cache
python infer.py DeepSeek-R1-Distill-Qwen-32B data -1

性能

总令牌数: 403, 总耗时: 72.84s, 平均速度: 5.53 tokens/sec.

GPU利用率

|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:03:00.0 Off |                  N/A |
| 73%   65C    P0            330W /  350W |   22678MiB /  24576MiB |     97%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|    0   N/A  N/A     47903      C   python                                      22582MiB |
+-----------------------------------------------------------------------------------------+

C.输出内容

1.1+2.3=? Let me think.

Okay, so I need to add 1.1 and 2.3 together. Hmm, let me visualize this. I remember that when adding decimals, it’s important to line up the decimal points to make sure each place value is correctly added. So, I can write them one under the other like this:

 1.1
+2.3
------

Starting from the rightmost digit, which is the tenths place. 1 (from 1.1) plus 3 (from 2.3) equals 4. So, I write down 4 in the tenths place.

Next, moving to the units place. 1 (from 1.1) plus 2 (from 2.3) equals 3. So, I write down 3 in the units place.

Putting it all together, the sum is 3.4. Let me double-check to make sure I didn’t make a mistake. 1.1 plus 2 is 3.1, and then adding the 0.3 more gives me 3.4. Yep, that seems right.

I think I got it! The answer should be 3.4.

To add 1.1 and 2.3, follow these steps:

Align the decimal points:

 1.1
+2.3
------

Add the tenths place:

1 (from 1.1) + 3 (from 2.3) = 4

Add the units place:

1 (from 1.1) + 2 (from 2.3) = 3

Combine the results:

3.4

Final Answer: \boxed{3.4}
Generation complete.

如何在24GB的GPU上运行DeepSeek-R1-Distill-Qwen-32B

一、背景

二、解决方案

三、操作步骤

1.下载模型

2.安装依赖

3.量化

4.生成推理代码

5.运行

A.缓存上限为128条

B.不限制缓存上限

C.输出内容

相关文章：