当前位置: 首页 > article >正文

利用大模型分歧优化NLP标注

In this blogpost I’d like to talk about large language models. There’s a bunch of hype, sure, but there’s also an opportunity to revisit one of my favourite machine learning techniques: disagreement.在本文中我想讨论大语言模型。虽然存在大量炒作但这也是重新审视我最喜欢的机器学习技术之一——分歧法——的好机会。This is the most basic version of the idea.这是该想法的最基础版本。The setup准备Let’s say that you’re interested in running a NLP model. You have text as input and you’d like to emit some structured information from it. Things like named entities, categories, spans … that sort of thing. Then you could try to leverage a large language model, armed with a prompt, to fetch this information. It might work, but there’s a fair amount of evidence that you might be better off training a custom model in the long run, especially if you’re in a specific domain. Not to mention the costs that might be involved with running a LLM, the latency involved or the practical constraints of working with a text-to-text system.So instead of fully relying on a large language model, how might we use it effectively in existing pipelines?假设你想运行一个NLP模型。输入文本希望从中提取结构化信息如命名实体、类别、文本片段等。你可以尝试利用大语言模型通过提示词来获取这些信息。这或许可行但有大量证据表明长远来看尤其是针对特定领域训练一个定制模型可能更好。更不用说运行大语言模型的成本、延迟以及处理文本到文本系统时的实际限制了。那么与其完全依赖大语言模型我们如何能在现有流程中有效利用它呢The trick技巧Suppose that we have an NLP pipeline locally. Let’s also assume that we have a large set of unlabelled data that we’d like to annotate to improve said pipeline. Then it would be nice to have a trick that allows us to look at the subset of data that will most likely improve the model. We could try an active learning approach, were we use uncertainty estimates of the pipeline to find relevant candidates … but with LLM’s there’s another trick you could use.You could run the LLM and your own pipeline against all of the unlabelled data. You can do this on a large batch of examples with no human input required. But then, after the fact, we can look at the examples where the two models disagree and prioritise these for annotation. The “cool trick” here is that, given a good prompt, we’re likely to see some examples where our original pipeline made a mistake. And it’s these examples that can now get prioritised for annotation early on in the annotation process.假设本地有一个NLP流程同时有一大批未标注数据希望通过标注来改进该流程。那么如果能有一个技巧帮我们找出最有可能提升模型的数据子集就太好了。可以尝试主动学习方法利用流程的不确定性估计来寻找相关样本……但借助大语言模型还有另一个技巧。可以对所有未标注数据同时运行大语言模型和本地流程无需人工介入处理大批量样本。然后我们可以找出两个模型预测结果不一致的样本并优先对它们进行标注。这个“巧妙技巧”在于在良好的提示词下很可能发现一些原始流程出错的数据。这些样本就可以在标注过程中被优先处理。The hypothesis is that the examples where both models agree are “less interesting”. These examples might confirm the models belief, but the examples where disagreement occurs might be more impactful with regards to actually making an update.假设是两个模型一致同意的样本“不太有趣”。这些样本可能只是确认了模型的已有认知而产生分歧的样本则可能对模型的实际更新更有价值。It’s like active learning, but based on the difference between two models instead of the confidence of a single one. It’s a trick that I can see work especially well in early parts of an annotation project. The local model will benefit from the extra annotations, but if you see the LLM make the same kind of mistake over and over … it might also inspire an improvement to your prompt.这类似于主动学习但依据的是两个模型之间的差异而非单个模型的置信度。这个技巧在标注项目的早期阶段尤其有效。本地模型会从额外标注中获益而如果发现大语言模型反复出现同类型错误也可能促使你改进提示词。Another big benefit of this approach is that the human remains in the loop. Large language models are amazing feats of technology but they’re certainly at risk of generating harmful text. By keeping a human in the loop, you also reduce the risk of these risks seeping into your own custom model.这种方法的另一个巨大好处是保留了人工参与。大语言模型是惊人的技术成就但也存在生成有害文本的风险。通过让人工参与其中可以降低这些风险渗透到定制模型中的可能性。Brief demo简要演示To demonstrate this trick, I figured I’d display a demo. For this demo I’ll be leveraging data from the Guardian’s content API. This gives me access to a stream of news texts from which, for demo purposes, I’ll try to extract organisations and person entities.I wrote some code to help me do this in Prodigy. And thanks to the recent release of spacy-llm, it’s been an easy to ingerate this with LLMs. What follows is a setup for OpenAI.为了演示这个技巧我准备了一个示例。本示例使用来自某新闻机构内容API的数据得到一个新闻文本流并尝试从中提取组织机构名称和人物实体。我编写了一些代码在Prodigy中实现这一过程。得益于最近发布的spacy-llm将其与大语言模型集成变得很容易。以下是一个针对某机构API的配置。First, you’ll want to define a config.cfg config file that sets up a spaCy pipeline for NER using OpenAI as the large language model. I’m going with OpenAI in this example, but you can also configure one of the alternative LLM providers.首先需要定义一个config.cfg配置文件使用某机构的大语言模型来设置用于命名实体识别的spaCy流程。本示例中使用某机构但你也可以配置其他大语言模型提供商。[nlp] lang en pipeline [ner] [components] [components.ner] factory llm [components.ner.task] llm_tasks spacy.NER.v1 labels ORGANISATION,PERSON,DATE [components.ner.backend] llm_backends spacy.REST.v1 api OpenAI config {model: text-davinci-003, temperature: 0.3}Next, you can load this configuration and immediately save the spaCy model to disk.接下来可以加载此配置并立即将spaCy模型保存到磁盘。dotenv run -- spacy assemble config.cfg en_my_llmThis command ensures that we have a en_openai_llm pipeline stored on disk that we can load just like any other spaCy model.该命令确保磁盘上存储了一个en_openai_llm流程可以像其他spaCy模型一样加载。Why are you using dotenv run --?Prefixing thespacycommand withdotenv run --makes sure that the environment variables in my.envfile are loaded before the script that follows runs. I find it a convenient way to source environment variables just for a single script.为什么使用dotenv run --在spacy命令前加上dotenv run --可以确保在运行后续脚本之前加载.env文件中的环境变量。这是一种为单个脚本加载环境变量的便捷方式。Next, I’ve made a script that makes predictions with an en_core_web_md model as well as this new en_openai_llm model.The script is set up so that the entity names map to the same value. The spaCy pipeline refers to an organisation via the “org” label name while the LLM model uses “organisation”. This was a bit of custom code to write, but not a huge hurdle.然后编写了一个脚本使用en_core_web_md模型和新的en_openai_llm模型分别进行预测。脚本中设置了实体名称映射到相同的值。spaCy流程使用标签名org指代组织机构而大语言模型使用organisation。这需要编写一些自定义代码但难度不大。python cli.py predict en_core_web_md examples.jsonl out-spacy.jsonl--nerORG:ORGANISATION,ORGANISATION,PERSON,DATE--annot-name spacy-md python cli.py predict en_openai_llm examples.jsonl out-llm.jsonl--nerORG:ORGANISATION,ORGANISATION,PERSON,DATE--annot-name llmContents of cli.py (包含预测逻辑的脚本使用spaCy模型处理文本流提取实体并添加标注信息)Finally, you can upload these two files to Prodigy and start the review recipe to see when the models disagree.最后可以将这两个文件上传到Prodigy并启动审核流程查看两个模型产生分歧的样本。# Load the data into Prodigypython-mprodigy db-in guardian out-spacy.jsonl python-mprodigy db-in guardian out-llm.jsonl# Start the review recipepython-mprodigy review reviewed guardian --view-id ner_manual--labelORGANISATION,PERSON,DATEHere are some examples of the two models disagreeing.以下是一些两个模型产生分歧的示例。Example 1This example is interesting because both models are wrong. It’s also an example where each word is capitalised, which is likely confusing the models. I can also imagine that the spaCy model was trained on a corpus that’s unaware of “Meta” as a company, which isn’t helping.示例1这个示例很有趣因为两个模型都错了。同时文本中每个单词都大写首字母这可能让模型感到困惑。另外spaCy模型训练所用的语料库可能不知道Meta作为一家公司这也无助于正确识别。Example 2示例2再次看到spaCy模型对一些公司名称识别有问题但注意大语言模型在日期识别上也遇到了困难这些示例使用了零样本学习提示词非常简单只提到了标签名称。因此使用少样本学习或对标签给出更详细的解释可能会有所帮助。Example 3This is an example where both models agreed, which also happens. Examples like this are easy to skip, or possibly even auto-accept.示例3这是两个模型意见一致的示例这种情况也会发生。像这样的样本可以跳过甚至自动接受。The Gist要点I’ve found that going through these examples like this really helps me appreciate the difference between the LLM approach and the pretrained model. Every time when I look at the disagreement between them I’m usually inspired to write a better prompt for the LLM while I also gather more training data that improves the local pipeline.It just really seems like a nice evolution of the “disagreement between models”-trick that I’ve already enjoyed using for so long. It really remains just a “trick”, not an everything-solving-mega technique, but having a new trick that’s easily moldable to a lot of situations is still very nice.我发现这样逐一审视分歧示例确实有助于理解大语言模型方法与预训练模型之间的差异。每次看到它们的分歧我往往会受到启发为大语言模型编写更好的提示词同时收集更多训练数据来改进本地流程。这似乎是我长期以来喜欢使用的“模型间分歧”技巧的一个很好的演进。它仍然只是一个“技巧”并非万能解决方案但拥有一个能轻松适配多种场景的新技巧仍然非常棒。New ways to iterate on data and models数据和模型迭代的新方式These LLM techniques to help annotate excite me. They offer new ways to kickstart NLP projects while still getting the best of both worlds. The large language models offer a lot of flexibility and are easy to configure, but are typically very heavy. But with disagreement techniques they can become an aid to quickly create training data for a much more lightweight (and therefore more deployable) model for a specific use-case.While this blogpost highlights a technique for named entities, it’s also good to know that there are other use-cases for large language models too! You can find the new OpenAI features for Prodigy here, which also lists recipes for text classification. There’s also a very interesting recipe for terms which allows you to generate relevant terms that can be re-used for weak supervision modelling. There’s even recipes that allow you to do prompt engineering which can help you write better prompts for your language models.这些用于辅助标注的大语言模型技术让我感到兴奋。它们提供了启动NLP项目的新途径同时兼顾了两方面的优势。大语言模型灵活性高且易于配置但通常非常重量级。而通过分歧技术它们可以帮助快速为特定用例创建训练数据从而训练出更轻量级因此也更易于部署的模型。虽然本文重点介绍了命名实体的技术但大语言模型还有其他用例例如可以找到用于文本分类的配置还有一个非常有用的术语生成配置可以生成相关术语用于弱监督建模。甚至还有用于提示词工程的配置帮助你为大语言模型编写更好的提示词。A caveat on performance性能注意事项You may have noticed that I’m usingen_core_web_mdwhich, on paper at least, isn’t as performant asen_core_web_lgoren_core_web_trf. You may also observe that I’m running OpenAI in a zero shot manner, and the predictions would likely improve if I did some few-shot tricks. I could also have chosen to add a pretrained Huggingface model to the mix.These are all fair observations and it makes sense to consider all of this in a real-life scenario. But the main point I’m trying to make here is that withvery low effortwe now have tools at our disposal to “pull off the disagreement trick” in a short amount of time.Thatis very new and exciting, and something that can really shave off some time when you’re getting started with a new project.你可能注意到我使用的是en_core_web_md从纸面性能上看不如en_core_web_lg或en_core_web_trf。你可能还注意到我以零样本方式运行某机构模型如果使用一些少样本技巧预测效果可能会更好。我也可以选择加入一个预训练的某机构模型。这些观察都很合理在实际场景中考虑这些因素是有意义的。但我在这里想强调的核心点是现在只需极低的投入我们就能在短时间内“实现分歧技巧”。这非常新颖且令人兴奋在启动新项目时确实可以节省大量时间。FINISHED更多精彩内容 请关注我的个人公众号 公众号办公AI智能小助手或者 我的个人博客 https://blog.qife122.com/对网络安全、黑客技术感兴趣的朋友可以关注我的安全公众号网络安全技术点滴分享

相关文章:

利用大模型分歧优化NLP标注

In this blogpost I’d like to talk about large language models. There’s a bunch of hype, sure, but there’s also an opportunity to revisit one of my favourite machine learning techniques: disagreement. 在本文中,我想讨论大语言模型。虽然存在大量炒…...

开发者个人网站搭建指南:从静态站点生成器到部署实战

1. 项目概述:一个为开发者量身定制的“数字家园” 在代码的海洋里泡久了,我们开发者总会遇到一个不大不小的痛点:如何高效、优雅地展示自己的技术栈、项目作品和个人思考?GitHub的README.md固然是标配,但它更像一份静态…...

如何让老款Mac重获新生:OpenCore Legacy Patcher完整指南

如何让老款Mac重获新生:OpenCore Legacy Patcher完整指南 【免费下载链接】OpenCore-Legacy-Patcher Experience macOS just like before 项目地址: https://gitcode.com/GitHub_Trending/op/OpenCore-Legacy-Patcher 还在为苹果官方停止支持的老款Mac无法升…...

Simulink模块搭建跟踪误差不归零?可能是隐藏的信号延迟在捣鬼(附S函数解法)

Simulink隐性信号延迟:从图形化建模到S函数的高精度控制实践 在控制系统仿真领域,Simulink作为行业标准工具链的核心组件,其图形化建模方式极大降低了算法验证的门槛。但当工程师从功能实现进阶到性能优化阶段时,常常会遇到一个令…...

挖掘MCU硬件加速潜力:以R80515的Double DPTR和MDU为例,在Keil C51中开启性能外挂

挖掘MCU硬件加速潜力:R80515双DPTR与MDU在Keil C51中的实战优化 当你在Keil C51环境下为资源受限的8051架构编写代码时,是否曾为缓慢的数据搬运和复杂的数学运算而头疼?现代增强型8051内核如R80515通过硬件加速单元提供了突破性能瓶颈的可能…...

【Sora 2×AE工作流革命】:20年特效总监亲授无缝整合5大黄金法则,错过再等三年?

更多请点击: https://intelliparadigm.com 第一章:Sora 2AE工作流革命的底层逻辑与行业拐点 Sora 2AE(Advanced Encoding)并非简单升级,而是将扩散模型时序建模能力与自适应编码器深度耦合的范式重构。其核心突破在于…...

影刀RPA高阶架构:告别“连点器”思维,内置原生指纹浏览器重塑全域店群防封底座

大家好,我是林焱,一名专注电商底层业务逻辑与企业级 RPA 自动化架构定制的独立开发者。 在技术社区和各大电商交流群里,我经常会遇到使用影刀 RPA 的开发者提出这样一个痛点:“林大,我用影刀写了一套逻辑非常严密的自…...

【Sora 2 × Gaussian Splatting融合实战指南】:20年CV专家亲授3大跨模态生成瓶颈突破法

更多请点击: https://intelliparadigm.com 第一章:Sora 2 Gaussian Splatting融合的技术演进与范式跃迁 Sora 2 与 Gaussian Splatting 的深度耦合,标志着生成式视频建模从隐式神经表征迈向显式可微几何渲染的关键转折。二者并非简单串联&a…...

Cadence AMS Designer 保姆级教程:手把手教你搞定数模混合仿真(含Verilog模块导入避坑指南)

Cadence AMS Designer 保姆级教程:手把手教你搞定数模混合仿真(含Verilog模块导入避坑指南) 数模混合仿真一直是芯片设计中的关键环节,尤其对于刚接触Cadence环境的新手工程师或在校学生来说,从零开始搭建混合仿真环境…...

一天怎么完成论文初稿

写论文这件事,从选题到完稿,哪一步都能卡掉你半条命。我身边不少读研读博的同学,白天泡实验室做实验,晚上挤时间写论文,熬了一两个月出初稿,结果格式不对、文献零散,还要和同门改来改去&#xf…...

科研人狂喜!AI生成的位图可以转矢量图了

今天给大家分享我最近挖到的宝藏科研工具:MedPeer「图片创作」——国内领先的垂直领域AI科研绘图工具,刚好解决我们科研人最头疼的几个痛点。尤其是它的人工绘图转换服务,简直是帮我解决了大麻烦,必须给大家捋捋明白。我们科研人绘…...

5分钟掌握HunterPie:解决《怪物猎人:世界》战斗信息盲区的终极指南

5分钟掌握HunterPie:解决《怪物猎人:世界》战斗信息盲区的终极指南 【免费下载链接】HunterPie-legacy A complete, modern and clean overlay with Discord Rich Presence integration for Monster Hunter: World. 项目地址: https://gitcode.com/gh_…...

ArcGIS符号库“隐身”之谜:从DAO组件缺失到完整恢复的实战指南

1. 当符号选择器突然"罢工":一个GISer的崩溃瞬间 那天早上我正赶着完成客户的地图项目,准备给水系图层换个漂亮的蓝色符号。像往常一样双击图层打开属性窗口,点击Symbol Selector准备挑选样式时,整个人瞬间僵住了——本…...

CompressO终极指南:免费开源视频图片压缩工具完整使用教程

CompressO终极指南:免费开源视频图片压缩工具完整使用教程 【免费下载链接】compressO Convert any video/image into a tiny size. 100% free & open-source. Available for Mac, Windows & Linux. 项目地址: https://gitcode.com/gh_mirrors/co/compres…...

STM32F4上跑FreeType:手把手教你为嵌入式GUI添加矢量字体(附源码)

STM32F4实战:FreeType矢量字体移植与GUI深度优化指南 1. 嵌入式矢量字体技术选型与原理 在资源受限的嵌入式环境中实现矢量字体渲染,本质上是一场内存效率与视觉质量的博弈。FreeType作为行业标准的字体引擎,其核心优势在于采用二次贝塞尔曲…...

保姆级教程:用Winbox给ROS配置一线多拨,实测200M宽带叠加效果(附避坑指南)

家庭网络优化实战:Winbox配置多拨提升宽带利用率 家里装了200M宽带,但下载大文件时总觉得速度没跑满?多人同时在线看4K视频就开始卡顿?其实通过简单的路由器配置,你完全有可能突破运营商单线限制,让宽带利用…...

从表情包到OLED屏显:基于Image2Lcd与PCtoLCD2002的嵌入式图片取模实战

1. 从表情包到OLED显示的完整流程 最近在做一个智能家居项目时,遇到了一个有趣的需求:需要为自制的语音助手设计一个唤醒图标。这个图标要在0.96寸OLED上显示,但市面上现成的图标要么尺寸不合适,要么风格不匹配。于是我想到了一个…...

从零到一:手把手教你为Nachos实现Exec和Exit系统调用(附完整代码与调试技巧)

从零构建Nachos系统调用:Exec与Exit的深度实现指南 1. 系统调用实现基础 在操作系统中,系统调用是用户程序与内核交互的唯一途径。Nachos作为一个教学用操作系统框架,其系统调用机制模拟了真实操作系统的核心设计思想。 寄存器交互机制是系统…...

告别adb shell:用Python脚本一键搞定Android屏幕截图与导出

Python自动化:告别adb shell,一键搞定Android屏幕截图与导出 每次调试Android应用时,手动敲adb命令截图、导出、重命名,是不是让你感到效率低下?作为一名长期与Android设备打交道的开发者,我深知这种重复劳…...

Mac小白必看:手把手教你找回丢失的Recovery HD分区(附diskutil命令详解)

Mac用户必备技能:深度解析Recovery HD分区修复与diskutil实战指南 当你按下CommandR却只看到闪烁的问号图标时,那种手足无措的感觉我深有体会。Recovery HD分区就像是Mac的急救箱,藏着系统恢复、磁盘修复和时间机器备份等关键工具。但很多用户…...

从原理到实践:液压与气压传动核心概念与应用场景解析

1. 液压与气压传动的核心原理 液压与气压传动是现代工业中广泛应用的动力传输方式,它们虽然介质不同,但都遵循着相似的物理原理。液压系统使用不可压缩的液体(通常是液压油)作为工作介质,而气压系统则使用可压缩的空气…...

AI工具搭建自动化视频生成Quick Sync

# Quick Sync:AI驱动的自动化视频生成技术实战解析 前阵子团队接了个批量短视频生成的项目,要在短时间内产出数百条产品演示视频。一开始想着一个个用Premiere剪,但算算时间,光是渲染就够呛。后来试用了几种自动化方案&#xff0c…...

AI工具搭建自动化视频生成NVENC

最近在折腾视频生成这块,发现AI工具搭配NVENC(NVIDIA的硬件编码器)做自动化视频生成,其实是个挺有意思的组合。很多人以为写个脚本调用FFmpeg就能搞定,但真正要把NVENC用透,背后的门道还是挺多的。不如从几…...

避开C2000开发第一个坑:TMS320F28069的InitSysCtrl()函数里,为什么ADC时钟要开一下又关?

TMS320F28069开发揭秘:ADC时钟瞬启瞬闭背后的硬件校准逻辑 在TMS320F28069的InitSysCtrl()初始化函数中,有一段看似矛盾的代码操作:先启用ADC时钟,调用(*Device_cal)()函数后立即关闭。这个"开关ADC时钟"的瞬态操作绝非…...

Python地理空间数据处理技能库geoskills:简化GIS分析,提升开发效率

1. 项目概述:一个面向地理空间数据处理的技能库最近在GitHub上闲逛,发现了一个挺有意思的项目,叫geoskills,来自一个叫Cognitic-Labs的组织。光看名字,geo和skills的组合,就让我这个常年和数据打交道的人眼…...

英雄联盟玩家必备:5分钟快速上手LeagueAkari完整教程

英雄联盟玩家必备:5分钟快速上手LeagueAkari完整教程 【免费下载链接】League-Toolkit An all-in-one toolkit for LeagueClient. Gathering power 🚀. 项目地址: https://gitcode.com/gh_mirrors/le/League-Toolkit 还在为英雄联盟繁琐的操作流程…...

HFSS与CST互导实战:5分钟搞定模型转换与数据对比(以微带天线为例)

HFSS与CST互导实战:微带天线模型转换与数据对比指南 在射频工程领域,HFSS和CST作为两大主流电磁仿真工具各有优势。实际项目中经常需要在这两个平台间迁移模型并对比结果,以确保仿真可靠性。本文将手把手演示如何高效完成模型互导与数据验证。…...

从单机到集群的基石:手把手配置ZooKeeper 3.5.8单机模式,为分布式应用铺路

从单机到集群的基石:手把手配置ZooKeeper 3.5.8单机模式,为分布式应用铺路 在分布式系统的世界里,协调服务就像交响乐团的指挥,确保每个乐器(节点)在正确的时间演奏正确的音符。ZooKeeper正是这样一个"…...

别再手动算归一化了!用Origin9.1的‘列公式’功能一键搞定数据预处理

用Origin9.1列公式功能高效实现数据归一化:从原理到实战 科研数据处理中,归一化是消除量纲影响、提升分析结果可比性的关键步骤。传统手动计算不仅耗时费力,还容易因公式输入错误导致结果偏差。Origin9.1的"列公式"功能&#xff08…...

一、NodeMCU-32S核心功能与上手场景解析

1. NodeMCU-32S开发板的核心特性解析 第一次拿到NodeMCU-32S这块开发板时,我就被它小巧的尺寸和丰富的接口吸引了。作为基于ESP32芯片设计的开发板,它最大的亮点就是双核处理器和Wi-Fi/蓝牙双模无线功能。这两个特性让它在物联网项目中特别吃香&#xff…...