当前位置：首页 > article >正文

从0-1体验模型部署到评测

article 2026/3/24 10:09:39

以下为mac电脑环境window部分命令自行替换目录1.首先python环境安装2.创建python虚拟环境3.安装评测框架4.小模型下载常见问题1执行报错是没安装 PyTorch常见问题2 代码执行超时是由于网络问题最好使用国内镜像5.运行评测命令常见问题1ModuleNotFoundError: No module named accelerate常见问题2httpx.ConnectTimeout: [Errno 60] Operation timed out常见问题3timed out thrown while requesting HEAD https://huggingface.co/datasets/Rowan/hellaswag/resolve/main/README.mdRetrying in 1s [Retry 1/5].补充说明仅yaml文件不创建python utils.py的文件查看.parquet文件内容的方式1.首先python环境安装推荐 3.9以上2.创建python虚拟环境# 创建虚拟环境 python3 -m venv venv # 激活 ,激活后剩余base命令操作均需在虚拟环境中 source venv/bin/activate当一切操作结束退出虚拟环境deactivate3.安装评测框架# 下载评测框架 git clone https://github.com/EleutherAI/lm-evaluation-harness # 安装 cd lm-evaluation-harness pip install -e .4.小模型下载可以在https://huggingface.co/ 上直接下载小模型到本地也可以通过代码下载模型名说明gpt2GPT‑2 基础模型非常小很适合初步体验评测链路EleutherAI/pythia‑160m约 160M 权重的小模型训练/评估快StabilityAI/stablelm‑2‑1.6b中型开源模型质量和速度比较好本地可跑以下载gpt2为例# 首先安装transformers pip install transformers # 其次安装 torch pip install torch # 在安装 pip install accelerate # 全部安装完成后执行如下命令验证 python -c import torch; import transformers; import accelerate; print(All good!)在python代码中下载gpt2模型from transformers import AutoModelForCausalLM, AutoTokenizer model_name gpt2 # 也可以是 EleutherAI/pythia-160m tokenizer AutoTokenizer.from_pretrained(model_name) model AutoModelForCausalLM.from_pretrained(model_name)这段代码会自动把模型权重下载到本地缓存 (~ ~/.cache/huggingface/transformers)。常见问题1执行报错是没安装 PyTorch安装命令(CPU版本)pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu如果GPU版本pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118如果不确定显卡或驱动情况先用 CPU 版本即可跑小模型足够练手。验证安装在 Python 中执行import torch print(torch.__version__) print(torch.cuda.is_available())输出类似2.1.0 False说明 PyTorch 安装成功CPU 可用GPU 可选。常见问题2 代码执行超时是由于网络问题最好使用国内镜像import os # 设置镜像源加速下载 os.environ[HF_ENDPOINT] https://hf-mirror.com from transformers import AutoModelForCausalLM, AutoTokenizer model_name gpt2 # 让 transformers 自动管理缓存,不要手动指定路径 tokenizer AutoTokenizer.from_pretrained(model_name) model AutoModelForCausalLM.from_pretrained(model_name) print(模型加载成功!) print(f模型参数量: {sum(p.numel() for p in model.parameters()):,})加载完成会得到如下输出如果还会报加载模型失败就干脆直接在huaggingface下载模型。核心文件如下5.运行评测命令查看有哪些评测任务lm-eval ls tasks评估模型基本能力以GPT‑2 在 HellaSwag benchmark上跑分为例lm_eval --model hf --model_args pretrainedgpt2 --tasks hellaswag --device cpu --batch_size 4 --output results.json注意如果执行报错连接失败看常见问题3按本地数据集的方式运行参数解释--model hf使用 HuggingFace 模型后端--model_args pretrainedgpt2模型名称可以换成本地路径--tasks hellaswag评测任务名字--device cpu若有 GPU可以设成cuda:0--batch_size 4每批多少样本--output results.json输出评测结果 JSON 文件评测结束后大概5-10分钟你将看到类似{results: { hellaswag_local: { name: hellaswag_local, alias: hellaswag_local, sample_len: 10042, acc,none: 0.2891854212308305, acc_stderr,none: 0.004524575892953094, acc_norm,none: 0.31139215295757816, acc_norm_stderr,none: 0.004621163476949437 } } }这表示 GPT‑2 在 HellaSwag 上的准确率大约是 28.91%- acc,none → 准确率 28.92%- acc_stderr,none → 标准误 0.45%就是 ± 后面的数- acc_norm,none → 标准化准确率 31.14%- acc_norm_stderr,none → 标准误 0.46%也可以在过程文件 eval_output.log 和日志打印中看到。也可以评测多个任务. 示例lm_eval --model hf \ --model_args pretrainedgpt2 \ --tasks hellaswag,mmlu \ --device cpu \ --batch_size 4 \ --output full_results.json这里列出运行日志2026-03-20:14:26:02 INFO [_cli.run:377] Including path: /Users/hongshao/dataset/tasks 2026-03-20:14:26:02 INFO [_cli.run:378] Selected Tasks: [hellaswag_local] 2026-03-20:14:26:03 INFO [evaluator:213] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 2026-03-20:14:26:03 INFO [evaluator:238] Initializing hf model, with arguments: {pretrained: /Users/hongshao/models/gpt2} 2026-03-20:14:26:05 INFO [models.huggingface:256] Using device cpu 2026-03-20:14:26:05 INFO [models.huggingface:518] Model parallel was set to False, max memory was not set, and device map was set to {: cpu} Loading weights: 0%| | 0/148 [00:00?, ?it/s] Loading weights: 100%|██████████| 148/148 [00:0000:00, 66519.18it/s] 2026-03-20:14:26:06 INFO [evaluator_utils:446] Selected tasks: 2026-03-20:14:26:06 INFO [evaluator_utils:480] Task: hellaswag_local (/Users/hongshao/dataset/tasks/hellaswag_local.yaml) 2026-03-20:14:26:06 INFO [api.task:312] Building contexts for hellaswag_local on rank 0... 0%| | 0/10042 [00:00?, ?it/s] 3%|▎ | 296/10042 [00:0000:08, 1216.45it/s] 7%|▋ | 727/10042 [00:0000:03, 2359.78it/s] 12%|█▏ | 1181/10042 [00:0000:02, 3112.42it/s] 中间省略--------------------------- Running loglikelihood requests: 100%|█████████▉| 40164/40168 [16:0200:00, 90.43it/s] Running loglikelihood requests: 100%|██████████| 40168/40168 [16:0200:00, 41.73it/s] fatal: not a git repository (or any of the parent directories): .git 2026-03-20:14:42:21 INFO [loggers.evaluation_tracker:247] Saving results aggregated hf ({pretrained: /Users/hongshao/models/gpt2}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 4 | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |---------------|------:|------|-----:|--------|---|-----:|---|-----:| |hellaswag_local| 1|none | 0|acc |↑ |0.2892|± |0.0045| | | |none | 0|acc_norm|↑ |0.3114|± |0.0046|常见问题1ModuleNotFoundError: No module named accelerate在虚拟环境中执行pip install accelerate常见问题2httpx.ConnectTimeout: [Errno 60] Operation timed out由于我们是联网加载模型进行评测因此会受网络问题影响。这里就需要将gpt模型下载到本地。然后修改模型加载的方式用本地模型方式from transformers import AutoTokenizer, AutoModelForCausalLM model_dir /Users/hongshao/models/gpt2 tokenizer AutoTokenizer.from_pretrained(model_dir, local_files_onlyTrue) model AutoModelForCausalLM.from_pretrained(model_dir, local_files_onlyTrue)同时评测命令的执行注意使用# 也是加载本地模型的方式就避免了联网不稳定 lm_eval --model hf --model_args pretrained/Users/hongshao/models/gpt2 --tasks hellaswag --device cpu --batch_size 4 --output results.json常见问题3timed out thrown while requesting HEAD https://huggingface.co/datasets/Rowan/hellaswag/resolve/main/README.mdRetrying in 1s [Retry 1/5].原因模型已经加载完成但是lm-evaluation-harness仍在尝试从 HuggingFace Hub 下载 benchmark 数据集因为hellaswagbenchmark 数据集默认不是本地的需要联网下载。你的网络不稳定或者被墙所以报超时。解决办法1.打开 HellaSwag 数据集页面https://huggingface.co/datasets/Rowan/hellaswag2.点击 Files and versions下载文件到本地 /User/hongshao/dataset/此时只能通过代码的方式执行因为lm-evaluation-harness没有支持的CLI 参数加载本地评测数据集3.处理文件差异原始 hellaswag 数据集字段 { activity_label: Removing ice from car, ctx_a: Then, the man writes over the snow..., ctx_b: then, endings: [option1, option2, option3, option4], label: 3 # 字符串类型 } lm-eval 需要的字段 { query: Removing ice from car: Then, the man writes..., # 需要拼接 choices: [option1, option2, option3, option4], gold: 3 # 需要是整数 }4.运行评测脚本4.1创建本地yaml配置文件 /Users/hongshao/dataset/tasks/hellaswag_local.yamltask: hellaswag_local dataset_path: /Users/hongshao/dataset/hellaswag dataset_name: null output_type: multiple_choice training_split: null validation_split: validation test_split: null process_docs: !function utils.process_docs doc_to_text: {{query}} doc_to_target: {{gold}} doc_to_choice: choices metric_list: - metric: acc aggregation: mean higher_is_better: true - metric: acc_norm aggregation: mean higher_is_better: true metadata: version: 1.04.2创建本地 Utils 函数文件 (/Users/hongshao/dataset/tasks/utils.py) 也可以使用纯yaml完成这件事下面补充import re def preprocess(text): text text.strip() text text.replace( [title], . ) text re.sub(\\[.*?\\], , text) text text.replace( , ) return text def process_docs(dataset): def _process_doc(doc): ctx doc[ctx_a] doc[ctx_b].capitalize() label doc.get(label, 0) try: gold int(label) except (ValueError, TypeError): gold 0 out_doc { query: preprocess(doc[activity_label] : ctx), choices: [preprocess(ending) for ending in doc[endings]], gold: gold, } return out_doc return dataset.map(_process_doc)process_docs 函数做三件事1. 拼接字段: 把 activity_label ctx_a ctx_b 拼成完整的 query2. 类型转换: 把 label 从字符串 3 转成整数 33. 文本清洗: preprocess 去除多余空格和伪影在虚拟机中执行HF_ENDPOINThttps://hf-mirror.com lm-eval run \ --model hf \ --model_args pretrained/Users/hongshao/models/gpt2 \ --tasks hellaswag_local \ --include_path /Users/hongshao/dataset/tasks \ --device cpu \ --batch_size 4 \ --output_path /Users/hongshao/results.json到这里你就静静等待结果吧。补充说明仅yaml文件不创建python utils.py的文件task: hellaswag_simple dataset_path: /Users/hongshao/dataset/hellaswag dataset_name: null output_type: multiple_choice validation_split: validation doc_to_text: {{activity_label}}: {{ctx_a}} {{ctx_b | capitalize}} doc_to_target: {{label | int}} doc_to_choice: {{endings}} metric_list: - metric: acc aggregation: mean higher_is_better: true metadata: version: 1.0查看.parquet文件内容的方式1使用 Python pandas最简单source venv/bin/activate python -c import pandas as pd df pd.read_parquet(/Users/hongshao/dataset/hellaswag/data/validation-00000-of-00001.parquet) print(df.head(2)) # 打印前 2 行 print(df.columns) # 打印列名 print(df.shape) # 打印形状 2直接用 lm-eval 内置的查看功能source venv/bin/activate python -c from datasets import load_dataset ds load_dataset(/Users/hongshao/dataset/hellaswag, splitvalidation) print(ds.features) # 查看字段 print(ds[0]) # 查看第一条数据输出结果字段定义 {ind: Value(int32), activity_label: Value(string), ctx_a: Value(string), ctx_b: Value(string), ctx: Value(string), endings: List(Value(string)), source_id: Value(string), split: Value(string), split_type: Value(string), label: Value(string)} 第一条数据 ind: 24 activity_label: Roof shingle removal ctx_a: A man is sitting on a roof. ctx_b: he ctx: A man is sitting on a roof. he endings: [is using wrap to wrap a pair of skis., is ripping level tiles off., is holding a rubiks cube., starts pulling up roofing on a roof.] source_id: activitynet~v_-JhWjGDPHMY split: val split_type: indomain label: 3

从0-1体验模型部署到评测

相关文章：

从0-1体验模型部署到评测

CH347F实战：5分钟搞定OpenOCD驱动安装与JTAG调试（Windows避坑指南）

JTAG接口上下拉电阻实战指南：从TMS到TCK的硬件设计细节

OpenClaw安全实践：GLM-4.7-Flash本地化部署的数据隐私保护

共生依赖症治疗：戒除AI决策辅助的康复方案

基于OpenCV的二维码识别与创建：图像算法、Python与GUI界面的实时生成与识别功能

算法性能建模中的非线性因素与误差控制的技术6

别等审计通报才行动：MCP OAuth 2026强制合规窗口仅剩89天，这份含12个可执行checklist的速通手册已内部封存

Qwen3-0.6B-FP8作品集：FP8模型在正则表达式生成任务准确率

eVTOL应急消杀模块功率链路优化：基于高压隔离、高效驱动与精准负载管理的MOSFET选型方案

ollama部署QwQ-32B参数详解：RMSNorm层对推理稳定性的影响

07-大模型微调-LLama Factor微调Qwen -- 局部微调/训练医疗问答模型

GTE+SeqGPT与Keil5集成开发：嵌入式AI应用实战

金管局地市级计算机岗之工作中遇到的所有类型数据库全解析：从 Oracle 到图数据库的监管数据生态全景

OpenClaw性能优化：降低GLM-4.7-Flash任务执行的Token消耗

美工连夜骂娘！这款手机端的“邪修”改图神器，3秒钟砸碎了 PS 的专业饭碗

js常用库函数

Emotion2Vec+ Large商业落地：智能音箱如何利用情感识别提升用户体验？

AI智能体与商业航天的范式革命：迈向自主航天时代的5-10年技术演进与战略蓝图

Lingyuxiu MXJ LoRA VSCode配置：Python开发环境优化

深入拆解AI Coding Agent 的底层原理

React核心语法：组件化与声明式编程

SpringBoot 业务逻辑层架构设计：Service+DTO+ 参数校验

一些论文word格式

清华开源新成果，国内首个L4来了！

电脑密码忘了怎么办？【图文讲解】登录密码？密码设置？修改密码？密码错误

正点原子2026开发板教程——从0开始配置Linux内核（5）——设备树在内核中的使用

计算机毕业设计 java 疫情期间物资分配管理系统 SpringBoot 疫情物资智能分配管理平台 JavaWeb 疫情期间物资申请分配系统

正点原子IMX6ULL史诗级新内核移植教程（2）—— 编译内核（新瓶子装旧酒）

第 2 章应用层总述｜《计算机网络：自顶向下方法》精读版