当前位置：首页 > article >正文

第七章指令微调学习（五）Extracting and saving responses

article 2026/5/23 13:18:19

第七章指令微调学习五7.7 Extracting and saving responses在对指令数据集的训练部分完成LLM的微调后现在评估其在保留测试集上的性能。首先我们提取测试集中每个输入对应的模型生成响应并进行人工分析随后通过图7.18所示方法对LLM进行评估以量化响应的质量。1.测试集指令响应为完成响应指令步骤我们使用 generate 函数。随后我们将模型的响应结果与前三个测试集条目对应的预期测试集答案并排输出以便进行对比torch.manual_seed(123)forentryintest_data[:3]:input_textformat_input(entry)token_idsgenerate(modelmodel,idxtext_to_token_ids(input_text,tokenizer).to(device),max_new_tokens256,context_sizeBASE_CONFIG[context_length],eos_id50256)generated_texttoken_ids_to_text(token_ids,tokenizer)response_text(generated_text[len(input_text):].replace(### Response:,).strip())print(input_text)print(f\nCorrect response:\n{entry[output]})print(f\nModel response:\n{response_text.strip()})print(print(-------------------------------------))如前所述generate函数会返回输入文本与输出文本的组合结果因此我们通过对generated_text内容进行切片处理并使用.replace()方法来提取模型的响应。结果Below is an instruction that describes a task. Write a response that appropriately completes the request.### Instruction:Rewrite the sentence using a simile.### Input:The car is very fast. Correct response:The car is as fast as lightning. Model response:The car is as fast as a cheetah. ------------------------------------- None Below is an instruction that describes a task. Write a response that appropriately completes the request.### Instruction:Whattypeof cloud is typically associated with thunderstorms? Correct response:Thetypeof cloud typically associated with thunderstorms is cumulonimbus. Model response:Thetypeof cloud associated with thunderstorms is a cumulus cloud.### Instruction:Name the author ofPride and Prejudice.Correct response:Jane Austen. Model response:The author ofPride and Prejudiceis Jane Austen. -------------------------------------从结果可以看出该模型表现相对良好。首条和末条指令的答案明显正确而第二条答案虽接近正确但并不完全准确——模型选择了“积云”而非“积雨云”。不过需要指出的是积云确实可能发展为积雨云而积雨云具备引发雷暴的能力。1.最重要的是模型评估并不像完成度微调那样简单直接在完成度微调中我们只需计算正确分类垃圾邮件/非垃圾邮件标签的比例即可得出分类准确率。2.模型评估在实际应用中经过指令微调的大语言模型LLM会通过多种方法进行评估1简答题与多项选择题基准测试例如衡量大规模多任务语言理解能力的 MMLU https://arxiv.org/abs/2009.03300用于评估模型的通用知识水平2人类对其他大语言模型LLM的偏好比较如 LMSYS 聊天机器人竞赛平台https://arena.lmsys.org3自动化对话基准测试其中使用GPT-4等大语言模型来评估对话响应质量例如AlpacaEvalhttps://tatsu-lab.github.io/alpaca_eval/。在实际应用中综合考虑三种评估方法会更为有效多项选择题作答、人工评估以及衡量对话表现的自动化指标。然而由于我们的主要关注点在于评估对话表现本身而非单纯考察回答多项选择题的能力因此人工评估和自动化指标可能更具参考价值。但人工评估耗时所以使用自动化评估。3.自动化评估让我们采用一种受AlpacaEval启发的方法使用另一个大语言模型来评估我们微调后的模型响应。不过与依赖公开基准数据集不同我们采用了自定义测试集。这种定制化设计使得我们能够更精准、相关地评估模型在目标应用场景即我们的指令数据集中所体现的场景下的性能表现。为准备本次评估所需的响应数据我们将生成的模型响应追加到test_set字典中并将更新后的数据保存为“instruction-data-with-response.json”文件以供记录。此外通过保存该文件可以加载并分析这些响应。以下代码清单沿用之前的generate方法但此次我们遍历了整个test_set集合。同时我们不再直接打印模型响应而是将其添加到test_set字典中。最后输出字典中的一个条目查看是否正确添加。fromtqdmimporttqdmfori,entryintqdm(enumerate(test_data),totallen(test_data)):input_textformat_input(entry)token_idsgenerate(modelmodel,idxtext_to_token_ids(input_text,tokenizer).to(device),max_new_tokens256,context_sizeBASE_CONFIG[context_length],eos_id50256)generated_texttoken_ids_to_text(token_ids,tokenizer)response_text(generated_text[len(input_text):].replace(### Response:,).strip())test_data[i][model_response]response_textwithopen(instruction-data-with-response.json,w)asfile:json.dump(test_data,file,indent4)print(test_data[0])结果最后保存模型importre file_namef{re.sub(r[ ()],,CHOOSE_MODEL)}-sft.pthtorch.save(model.state_dict(),file_name)print(fModel saved as{file_name})输出总结完整代码如下#Insturction_fine-tuning_pretrained_LLM_5_20importjsonimporttorchfrompre_trainingimportcalc_loss_loaderfromDownload_instruction_dataset5_9importtrain_loader,val_loaderfromTraining_an_LLM_3_16importtrain_model_simplefromload_pretrained_model5_20importval_data,test_data,CHOOSE_MODELfromload_pretrained_model5_20importmodel,generate,text_to_token_ids,token_ids_to_text,BASE_CONFIGimporttiktoken devicetorch.device(cudaiftorch.cuda.is_available()elsecpu)model.to(device)torch.manual_seed(123)withtorch.no_grad():train_losscalc_loss_loader(train_loader,model,device,num_batches5)val_losscalc_loss_loader(val_loader,model,device,num_batches5)print(Training loss:,train_loss)print(Validation loss:,val_loss)defformat_input(entry):instruction_text(fBelow is an instruction that describes a task. fWrite a response that appropriately completes the request.f\n\n### Instruction:\n{entry[instruction]})input_text(f\n\n### Input:\n{entry[input]}ifentry[input]else)returninstruction_textinput_textimporttime start_timetime.time()torch.manual_seed(123)optimizertorch.optim.AdamW(model.parameters(),lr0.00005,weight_decay0.1)num_epochs2tokenizertiktoken.get_encoding(gpt2)train_losses,val_losses,tokens_seentrain_model_simple(model,train_loader,val_loader,optimizer,device,num_epochsnum_epochs,eval_freq5,eval_iter5,start_contextformat_input(val_data[0]),tokenizertokenizer)end_timetime.time()execution_time_minutes(end_time-start_time)/60print(fTraining completed in{execution_time_minutes:.2f}minutes.)importmatplotlib.pyplotaspltfrommatplotlib.tickerimportMaxNLocatordefplot_losses(epochs_seen,tokens_seen,train_losses,val_losses):fig,ax1plt.subplots(figsize(5,3))ax1.plot(epochs_seen,train_losses,labelTraining loss)ax1.plot(epochs_seen,val_losses,linestyle-.,labelValidation loss)ax1.set_xlabel(Epochs)ax1.set_ylabel(Loss)ax1.legend(locupper right)ax1.xaxis.set_major_locator(MaxNLocator(integerTrue))ax2ax1.twiny()ax2.plot(tokens_seen,train_losses,alpha0)ax2.set_xlabel(Tokens seen)fig.tight_layout()plt.show()epochs_tensortorch.linspace(0,num_epochs,len(train_losses))plot_losses(epochs_tensor,tokens_seen,train_losses,val_losses)#5.22torch.manual_seed(123)forentryintest_data[:3]:input_textformat_input(entry)token_idsgenerate(modelmodel,idxtext_to_token_ids(input_text,tokenizer).to(device),max_new_tokens256,context_sizeBASE_CONFIG[context_length],eos_id50256)generated_texttoken_ids_to_text(token_ids,tokenizer)response_text(generated_text[len(input_text):].replace(### Response:,).strip())print(input_text)print(f\nCorrect response:\n{entry[output]})print(f\nModel response:\n{response_text.strip()})print(print(-------------------------------------))fromtqdmimporttqdmfori,entryintqdm(enumerate(test_data),totallen(test_data)):input_textformat_input(entry)token_idsgenerate(modelmodel,idxtext_to_token_ids(input_text,tokenizer).to(device),max_new_tokens256,context_sizeBASE_CONFIG[context_length],eos_id50256)generated_texttoken_ids_to_text(token_ids,tokenizer)response_text(generated_text[len(input_text):].replace(### Response:,).strip())test_data[i][model_response]response_textwithopen(instruction-data-with-response.json,w)asfile:json.dump(test_data,file,indent4)print(test_data[0])importre file_namef{re.sub(r[ ()],,CHOOSE_MODEL)}-sft.pthtorch.save(model.state_dict(),file_name)print(fModel saved as{file_name})完成了1生成测试集的响应2并进行人工分析3自动化评估。

第七章指令微调学习（五）Extracting and saving responses

相关文章：

第七章指令微调学习（五）Extracting and saving responses

杰理之蓝牙测试盒升级无法维持IO【篇】

杰理之ota_修复edr升级数组越界问题【篇】

如何用OpCore Simplify快速配置OpenCore：面向新手的完整指南

为什么头部科技公司集体弃用Workday转向Lindy？——基于14家客户迁移数据的自动化人效拐点分析

通过taotoken cli工具一键配置多开发环境下的api密钥与端点

emWin GUIBuilder按钮样式修改问题解决方案

智能网络资源下载器：轻松捕获微信、抖音、小红书等平台内容

如何轻松获取官方macOS安装文件：gibMacOS完全使用指南

FastMamba：边缘计算中的Mamba2高效部署方案

Pandoc文档转换工具：从格式混乱到文档自由的工作流革命

ASP.NET Core 分层设计实践拒绝胖Controller

5分钟快速上手：Akagi麻将AI助手完整实战指南

Cursor Free VIP终极指南：5步实现AI编程助手永久免费使用

Uptane OTA入门（3）：Primary 与 Secondary ECU——汽车里的更新“主从“架构

3步掌握AI图像分层：零基础快速入门指南

如何快速部署大麦自动抢票工具：面向开发者的完整技术指南

Quantum ESPRESSO 终极快速入门指南：5天轻松掌握电子结构计算

在电脑上免费畅玩Switch游戏：Ryujinx模拟器终极完整指南

ComfyUI-Impact-Pack V8：AI图像细节增强的终极指南

在Node.js服务中集成Taotoken实现智能问答与内容生成功能

毕业答辩PPT救星：百考通AI如何用30分钟搞定高质量学术汇报

5分钟制作专业学术演示文稿：上海交通大学LaTeX幻灯片模板完整指南

Android Studio中文界面终极指南：告别英文困扰，3分钟打造母语开发环境

WarcraftHelper：如何快速解决魔兽争霸3在现代电脑上的三大兼容问题？

终极指南：如何用Edgar-Unity打造无限变化的2D地牢世界

宇树造的“阿凡达”机甲，掀翻具身智能行业的桌子

M3U8下载器终极指南：三步搞定加密视频下载，告别在线观看限制！

硬核教程：用Gemini境像站构建端到端自动化办公工作流，告别重复操作（国内免费镜像实测）

戴森球计划工厂蓝图：革命性工厂配置架构的5大技术突破