当前位置：首页 > news >正文

1.2 Kaggle大白话：Eedi竞赛Transformer框架解决方案02-GPT_4o生成训练集缺失数据

news 2025/11/21 6:26:33

- 0. 本栏目竞赛汇总表
- 1. 本文主旨
- 2. AI工程架构
- 3. 数据预处理模块
- - 3.1 配置数据路径和处理参数
  - 3.2 配置API参数
  - 3.3 配置输出路径
- 4. AI并行处理模块
- - 4.1 定义LLM客户端类
  - 4.2 定义数据处理函数
  - 4.3 定义JSON保存函数
  - 4.4 定义数据分片函数
  - 4.5 定义分片处理函数
  - 4.5 定义文件名排序函数
- 5. 数据整合模块
- - 5.1 加载数据并生成分片
  - 5.2 初始化LLM客户端并测试
  - 5.3 并行处理数据生成
  - 5.4 合并处理结果
  - 5.5 保存最终结果

0. 本栏目竞赛汇总表

Kaggle竞赛汇总

1. 本文主旨

大白话：由于在上一篇文章的数据探索中，我们发现了部分训练数据的错误解释存在缺失，因此直接使用GPT_4o+人设提示词工程，对训练集数据存在的错误解释缺失问题的处理。
通过本文可收获技能：API调用AI接口、人设提示词工程案例、复杂的数据处理与缓存处理。
上文回顾：Eedi大模型蒸馏方案01-竞赛信息解读与数据理解

2. AI工程架构

3. 数据预处理模块

3.1 配置数据路径和处理参数

data_path = "~/work/eedi_synthetic_data/MalAlgoQA_format.csv"
index_start = 0
index_end = len(df)
step = 100
max_workers = 2

3.2 配置API参数

model_config = dict(openai_api_base = "https://testshellapi.kimi.asia/v1", api_key = "****",model = "gpt-4o",default_system_prompt = """##TaskYou are a Mathematics teacher. Your task is to reason and identify the ConstructName and SubjectName and then the misconception behind the user input Incorrect Answers with the Question.ConstructName is Most granular level of knowledge related to question, appears to describe the specific mathematical method or procedure used to solve the question. It explains the technique or approach needed to reach the answer.SubjectName is More general context than the construct, represents the broader mathematical topic or category that the question belongs to.Misconceptions are a mistake in conceptual understanding and they have relations with all the applications of those concepts. For example, a single misconception on the connections among proportional relationships (part/whole, part/part, whole/part) can cause problems in identifying those patterns in drawings and can be the cause of failing to realize all parts must be of equal size, therefore associating the denominator of the fraction with the total number of parts regardless their size.Answer concisely what misconception it is to lead to getting the incorrect answer.Do not use "The misconception is" to start your answers.Do not mention the concrete details of the question or answers. ##User inputQuestion: The question textA: multiple choice answer A textB: multiple choice answer B textC: multiple choice answer C textD: multiple choice answer D textCorrect Answer: The correct answer text##You should answer in the following JSON format{"ConstructName": "here writes the constructName","SubjectName": "here writes the SubjectName""MisconceptionAName": "here writes the answer A's misconception.","MisconceptionBName": "here writes the answer B's misconception.","MisconceptionCName": "here writes the answer C's misconception.","MisconceptionDName": "here writes the answer D's misconception.",}""", # system prompt,default_temperature = 0.5,max_tokens = 256,
)

3.3 配置输出路径

cache_folder = f"./cache_{model_config['model']}_model_misconceptions_result"
if not os.path.exists(cache_folder):os.makedirs(cache_folder)
output_data_path = f"misconception_data_{os.path.splitext(os.path.basename(data_path))[0]}_{model_config['model']}.csv"

4. AI并行处理模块

4.1 定义LLM客户端类

class LLMChat:def __init__(self, openai_api_base, api_key, model, default_temperature, default_system_prompt, max_tokens=512):self.client = OpenAI(api_key = api_key,base_url=openai_api_base,)self.model = modelself.default_temperature = default_temperatureself.default_system_prompt = default_system_promptself.max_tokens = max_tokensdef chat(self, user_prompt, system_prompt=None, temperature=None):if not system_prompt:system_prompt = self.default_system_promptif not temperature:temperature = self.default_temperaturechat_response = self.client.chat.completions.create(model=self.model,temperature=temperature,messages=[{"role": "system", "content": system_prompt},{"role": "user", "content": user_prompt},],max_tokens=self.max_tokens,response_format={"type": "json_object"})return chat_response.choices[0].message.content

4.2 定义数据处理函数

def process_row(args, debug=False):user_prompt = """Question: {question}A: {answer_a}B: {answer_b}C: {answer_c}D: {answer_d}Correct Answer: {correct_answer}"""index, row = argsca = row["CorrectAnswer"]correctanswer = row[f"Answer{ca}Text"]input_user_prompt = user_prompt.format(question=row['QuestionText'],answer_a=row['AnswerAText'],answer_b=row['AnswerBText'],answer_c=row['AnswerCText'],answer_d=row['AnswerDText'],correct_answer=correctanswer,)ret_data = {}try:ret_data = vc.chat(input_user_prompt)if debug:print(ret_data+'\n')except Exception as e:print(f'An exception occur {str(e)}')ret_data['error'] = str(e)passif debug:print('system: ', model_config['default_system_prompt'])print('>'* 50)print('user_input: ', input_user_prompt)print('>'* 50)print('assistant: ', ret_data)return ret_data

4.3 定义JSON保存函数

def save_json(fn, obj):with open(fn, 'w') as f:json.dump(obj, f, ensure_ascii=False, indent=4)print(f"save file to {fn}")

4.4 定义数据分片函数

def slice_range(start, end, step):if step <= 0:raise ValueError("步长必须大于0")result = []while start <= end:result.append(start)start += stepif result[-1] < end:result.append(end)return result

4.5 定义分片处理函数

def process_pairs(sliced_range):slices = []for first, second in zip(sliced_range, sliced_range[1:]):slices.append([first, second])return slices

4.5 定义文件名排序函数

def natural_sort_key(filename):parts = re.findall(r'\d+', filename)return tuple(map(int, parts))

5. 数据整合模块

5.1 加载数据并生成分片

df = pd.read_csv(data_path)
df.head()
sliced_range = process_pairs(slice_range(index_start, index_end, step))

df数据检查：
在这里插入图片描述

5.2 初始化LLM客户端并测试

vc = LLMChat(**model_config)
r = process_row((7, df.iloc[7]), debug=True)

5.3 并行处理数据生成

for slices in tqdm(sliced_range, total=len(sliced_range)):output_filepath = f'{cache_folder}/cache_res_{slices[0]}.json'if os.path.exists(output_filepath):print(f'cache file exists, skip {output_filepath}')continuedf_tasks = df.iloc[slices[0]:slices[1]]results = []with ProcessPoolExecutor(max_workers=max_workers) as executor:results = list(tqdm(executor.map(process_row, df_tasks.iterrows()), total=len(df_tasks)))save_json(output_filepath, results)

5.4 合并处理结果

f_names = glob.glob(f'{cache_folder}/*.json')
sorted_filenames = sorted(f_names, key=natural_sort_key)
f_names = sorted_filenamesresults = []
for fn in f_names:with open(fn, 'r') as f:batch_results = json.load(f)results.extend(batch_results)l = len(results)
results = [json.loads(r) for r in results]

5.5 保存最终结果

df = df.iloc[:l]
gen_df = pd.DataFrame(results)
df = pd.concat([df, gen_df], axis=1)
df.to_csv(output_data_path, index=False)

(To be continued)

1.2 Kaggle大白话：Eedi竞赛Transformer框架解决方案02-GPT_4o生成训练集缺失数据

目录 0. 本栏目竞赛汇总表1. 本文主旨2. AI工程架构3. 数据预处理模块3.1 配置数据路径和处理参数3.2 配置API参数3.3 配置输出路径 4. AI并行处理模块4.1 定义LLM客户端类4.2 定义数据处理函数4.3 定义JSON保存函数4.4 定义数据分片函数4.5 定义分片处理函数4.5 定义文件名排序…...

编程日记 2025/2/28 9:59:14

数据结构-顺序表专题

大家好！这里是摆子，今天给大家带来的是C语言数据结构开端-顺序表专题，主要介绍了数据结构和动态顺序表的实现，快来看看吧！记得一键三连哦！ 1.数据结构的概念 1.1什么是数据结构？ 数据结构是计…...

编程日记 2025/2/28 9:54:05

docker和containerd从TLS harbor拉取镜像

私有镜像仓库配置了自签名证书，https访问，好处是不需要处理免费证书和付费证书带来的证书文件变更，证书文件变更后需要重启服务，自签名证书需要将一套客户端证书存放在/etc/docker/cert.d目录下，或者/etc/containerd/c…...

编程日记 2025/2/28 9:50:57

kafka-关于ISR-概述

一. 什么是ISR ？ Kafka 中通常每个分区都有多个副本，其中一个副本被选举为 Leader，其他副本为 Follower。ISR 是指与 Leader 副本保持同步的 Follower 副本集合。ISR 机制的核心是确保数据在多个副本之间的一致性和可靠性，同时在 …...

编程日记 2025/2/28 9:49:51

el-input实现金额输入

需求：想要实现一个输入金额的el-input，限制只能输入数字和一个小数点。失焦数字转千分位，聚焦转为数字，超过最大值，红字提示效果图失焦聚焦报错效果 // 组件limitDialog <template><el-dialog:visible.s…...

编程日记 2025/2/28 9:45:46

C++11智能指针

一、指针管理的困境资源释放了，但指针没有置空（野指针、指针悬挂、踩内存） 没有释放资源，产生内存泄漏问题；重复释放资源，引发coredump 二、智能指针...

编程日记 2025/2/28 9:44:44

安装Git（小白也会装）

一、官网下载：Git 1.依次点击（红框） 不要安装在C盘了，要炸了！！！ 后面都使用默认就好了，不用改，直接Next！ 直到这里，选第一个这两种选项的区别如…...

编程日记 2025/2/28 9:43:39

驭势科技9周年：怀揣理想，踏浪前行

2025年的2月，驭势科技迎来9岁生日。位于国内外不同工作地的Uiseeker齐聚线上线下，共同庆祝驭势走过的璀璨九年。驭势科技联合创始人、董事长兼CEO吴甘沙现场分享了驭势9年的奔赴之路，每一段故事都包含着坚持与拼搏。左右滑动查看更多 Part.…...

编程日记 2025/2/28 9:41:32

一款在手机上制作电子表格

今天给大家分享一款在手机上制作电子表格的，免费好用的Exce1表格软件，让工作变得更加简单。 1 软件介绍 Exce1是一款手机制作表格的办公软件，您可以使用手机exce1在线制作表格、工资表、编辑xlsx和xls表格文件等，还可以学习使用…...

编程日记 2025/2/28 9:35:24

Python解决“比赛配对”问题

Python解决“比赛配对”问题问题描述测试样例解决思路代码问题描述小R正在组织一个比赛，比赛中有 n 支队伍参赛。比赛遵循以下独特的赛制： 如果当前队伍数为偶数，那么每支队伍都会与另一支队伍配对。总共进行 n / 2 场比赛，…...

编程日记 2025/2/28 9:32:20

【AI论文】RAD: 通过大规模基于3D图形仿真器的强化学习训练端到端驾驶策略

摘要：现有的端到端自动驾驶（AD）算法通常遵循模仿学习（IL）范式，但面临着因果混淆和开环差距等挑战。在本研究中，我们建立了一种基于3D图形仿真器（3DGS）的闭环强化学习&…...

编程日记 2025/2/28 9:30:17

Web开发：ORM框架之使用Freesql的导航属性

一、什么时候用导航属性看数据库表的对应关系，一对多的时候用比较好，不用多写一个联表实体，而且查询高效二、为实体配置导航属性 1.给关系是一的父表实体加上： [FreeSql.DataAnnotations.Navigate(nameof(子表.子表关联字段))]…...

编程日记 2025/2/28 9:29:15

【docker】namespace底层机制

Linux 的 Namespace 机制是实现容器化（如 Docker、LXC 等）的核心技术之一，它通过隔离系统资源（如进程、网络、文件系统等）为进程提供独立的运行环境。其底层机制涉及内核数据结构、系统调用和进程管理。以下是其核心实…...

编程日记 2025/2/28 9:26:10

【每天认识一个漏洞】url重定向

🌝博客主页：菜鸟小羊 💖专栏：Linux探索之旅 | 网络安全的神秘世界 | 专接本 | 每天学会一个渗透测试工具常见应用场景主要是业务逻辑中需要进行跳转的地方。比如登录处、注册处、访问用户信息、订单信息、加入购物车、分享、收…...

编程日记 2025/2/28 9:25:09

端口映射/内网穿透方式及问题解决:warning: remote port forwarding failed for listen port

文章目录需求：A机器是内网机器，B机器是公网服务器，想要从公网，访问A机器的端口方式：端口映射，内网穿透，使用ssh打洞端口：遇到问题：命令执行成功，但是端口转发…...

编程日记 2025/2/28 9:24:07

Polardb开发者大会

这是第二次参加这个大会还有不少老朋友好多年没有这种经历了–大会讲的我不是很懂 10几年前参会，那时候自己不懂。后来就慢慢懂了。这些年参会都虽然还在不断学习，但是没觉得自己差距很大了。这次出来很不一样，一堆新的技能，这…...

编程日记 2025/2/28 9:23:05

从二维随机变量到多维随机变量

二维随机变量设 X X X和 Y Y Y是定义在同一样本空间 Ω \varOmega Ω上的两个随机变量，称由它们组成的向量 ( X , Y ) (X, Y) (X,Y)为二维随机变量，亦称为二维随机向量，其中称 X X X和 Y Y Y是二维随机变量的分量。采用多个随机变量去描述…...

编程日记 2025/2/28 9:22:04

Vulnhub靶场 Kioptrix: Level 1.3 (#4) 练习

目录 0x00 环境准备0x01 主机信息收集0x02 站点信息收集0x03 漏洞查找与利用0x04 总结 0x00 环境准备下载：https://download.vulnhub.com/kioptrix/Kioptrix4_vmware.rar 解压后得到的是vmdk文件。在vm中新建虚拟机，稍后安装操作系统，系统选…...

编程日记 2025/2/28 9:19:53

权重生成图像

简介前面提到的许多生成模型都有保存了生成器的权重，本章主要介绍如何使用训练好的权重文件通过生成器生成图像。但是如何使用权重生成图像呢？一、参数配置 ima_size 为图像尺寸，这个需要跟你模型训练的时候resize的时候一样。 latent_dim为噪声维度，一般的设置都是…...

编程日记 2025/2/28 9:18:52

实时时钟（RTC）/日历芯片PCF8563的I2C读写驱动（2）：功能介绍

0 参考资料 PCF8563数据手册（第 11 版——2015 年 10 月 26 日）.pdf 1 功能介绍 1.1 实时时钟（RTC）/日历 （1）PCF8563支持实时时钟（RTC），提供时、分、秒信息。对应寄存器…...

编程日记 2025/2/28 9:17:50

基于服务器使用 apt 安装、配置 Nginx

🧾 一、查看可安装的 Nginx 版本首先，你可以运行以下命令查看可用版本： apt-cache madison nginx-core输出示例： nginx-core | 1.18.0-6ubuntu14.6 | http://archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages ng…...

编程新知 2025/8/26 18:00:13

理解 MCP 工作流：使用 Ollama 和 LangChain 构建本地 MCP 客户端

🌟 什么是 MCP？ 模型控制协议 (MCP) 是一种创新的协议，旨在无缝连接 AI 模型与应用程序。 MCP 是一个开源协议，它标准化了我们的 LLM 应用程序连接所需工具和数据源并与之协作的方式。可以把它想象成你的 AI 模型和想要使用它…...

编程新知 2025/11/17 4:24:04

STM32标准库-DMA直接存储器存取

文章目录一、DMA1.1简介1.2存储器映像1.3DMA框图1.4DMA基本结构1.5DMA请求1.6数据宽度与对齐1.7数据转运DMA1.8ADC扫描模式DMA 二、数据转运DMA2.1接线图2.2代码2.3相关API 一、DMA 1.1简介 DMA（Direct Memory Access）直接存储器存取 DMA可以提供外设…...

编程新知 2025/11/18 6:05:05

SpringBoot+uniapp 的 Champion 俱乐部微信小程序设计与实现，论文初版实现

摘要本论文旨在设计并实现基于 SpringBoot 和 uniapp 的 Champion 俱乐部微信小程序，以满足俱乐部线上活动推广、会员管理、社交互动等需求。通过 SpringBoot 搭建后端服务，提供稳定高效的数据处理与业务逻辑支持；利用 uniapp 实现跨平台前…...

编程新知 2025/11/16 21:56:35

pip install 库名 -i https://pypi.tuna.tsinghua.edu.cn/simple --user 举个例子： 报错 ModuleNotFoundError: No module named torch 那么我需要安装 torch pip install torch -i https://pypi.tuna.tsinghua.edu.cn/simple --user pip install 库名&#x…...

编程新知 2025/11/18 9:02:11

服务器--宝塔命令

一、宝塔面板安装命令 ⚠️ 必须使用 root 用户或 sudo 权限执行！ sudo su - 1. CentOS 系统： yum install -y wget && wget -O install.sh http://download.bt.cn/install/install_6.0.sh && sh install.sh2. Ubuntu / Debian 系统…...

编程新知 2025/10/3 10:56:48

C/C++ 中附加包含目录、附加库目录与附加依赖项详解

在 C/C 编程的编译和链接过程中，附加包含目录、附加库目录和附加依赖项是三个至关重要的设置，它们相互配合，确保程序能够正确引用外部资源并顺利构建。虽然在学习过程中，这些概念容易让人混淆，但深入理解它们的作用和联…...

编程新知 2025/11/17 3:35:01

Mysql8 忘记密码重置，以及问题解决

1.使用免密登录找到配置MySQL文件，我的文件路径是/etc/mysql/my.cnf，有的人的是/etc/mysql/mysql.cnf 在里最后加入 skip-grant-tables重启MySQL服务 service mysql restartShutting down MySQL… SUCCESS! Starting MySQL… SUCCESS! 重启成功 2.登…...

编程新知 2025/11/18 6:15:28

Windows安装Miniconda

一、下载 https://www.anaconda.com/download/success 二、安装三、配置镜像源 Anaconda/Miniconda pip 配置清华镜像源_anaconda配置清华源-CSDN博客四、常用操作命令 Anaconda/Miniconda 基本操作命令_miniconda创建环境命令-CSDN博客...

编程新知 2025/11/17 12:54:11

给网站添加live2d看板娘

给网站添加live2d看板娘参考文献： stevenjoezhang/live2d-widget: 把萌萌哒的看板娘抱回家 (ノ≧∇≦)ノ | Live2D widget for web platformEikanya/Live2d-model: Live2d model collectionzenghongtu/live2d-model-assets 前言网站环境如下，文章也主…...

编程新知 2025/11/21 5:28:55

目录