当前位置：首页 > article >正文

利用大模型分歧优化NLP标注

article 2026/5/11 19:49:06

In this blogpost I’d like to talk about large language models. There’s a bunch of hype, sure, but there’s also an opportunity to revisit one of my favourite machine learning techniques: disagreement.在本文中我想讨论大语言模型。虽然存在大量炒作但这也是重新审视我最喜欢的机器学习技术之一——分歧法——的好机会。This is the most basic version of the idea.这是该想法的最基础版本。The setup准备Let’s say that you’re interested in running a NLP model. You have text as input and you’d like to emit some structured information from it. Things like named entities, categories, spans … that sort of thing. Then you could try to leverage a large language model, armed with a prompt, to fetch this information. It might work, but there’s a fair amount of evidence that you might be better off training a custom model in the long run, especially if you’re in a specific domain. Not to mention the costs that might be involved with running a LLM, the latency involved or the practical constraints of working with a text-to-text system.So instead of fully relying on a large language model, how might we use it effectively in existing pipelines?假设你想运行一个NLP模型。输入文本希望从中提取结构化信息如命名实体、类别、文本片段等。你可以尝试利用大语言模型通过提示词来获取这些信息。这或许可行但有大量证据表明长远来看尤其是针对特定领域训练一个定制模型可能更好。更不用说运行大语言模型的成本、延迟以及处理文本到文本系统时的实际限制了。那么与其完全依赖大语言模型我们如何能在现有流程中有效利用它呢The trick技巧Suppose that we have an NLP pipeline locally. Let’s also assume that we have a large set of unlabelled data that we’d like to annotate to improve said pipeline. Then it would be nice to have a trick that allows us to look at the subset of data that will most likely improve the model. We could try an active learning approach, were we use uncertainty estimates of the pipeline to find relevant candidates … but with LLM’s there’s another trick you could use.You could run the LLM and your own pipeline against all of the unlabelled data. You can do this on a large batch of examples with no human input required. But then, after the fact, we can look at the examples where the two models disagree and prioritise these for annotation. The “cool trick” here is that, given a good prompt, we’re likely to see some examples where our original pipeline made a mistake. And it’s these examples that can now get prioritised for annotation early on in the annotation process.假设本地有一个NLP流程同时有一大批未标注数据希望通过标注来改进该流程。那么如果能有一个技巧帮我们找出最有可能提升模型的数据子集就太好了。可以尝试主动学习方法利用流程的不确定性估计来寻找相关样本……但借助大语言模型还有另一个技巧。可以对所有未标注数据同时运行大语言模型和本地流程无需人工介入处理大批量样本。然后我们可以找出两个模型预测结果不一致的样本并优先对它们进行标注。这个“巧妙技巧”在于在良好的提示词下很可能发现一些原始流程出错的数据。这些样本就可以在标注过程中被优先处理。The hypothesis is that the examples where both models agree are “less interesting”. These examples might confirm the models belief, but the examples where disagreement occurs might be more impactful with regards to actually making an update.假设是两个模型一致同意的样本“不太有趣”。这些样本可能只是确认了模型的已有认知而产生分歧的样本则可能对模型的实际更新更有价值。It’s like active learning, but based on the difference between two models instead of the confidence of a single one. It’s a trick that I can see work especially well in early parts of an annotation project. The local model will benefit from the extra annotations, but if you see the LLM make the same kind of mistake over and over … it might also inspire an improvement to your prompt.这类似于主动学习但依据的是两个模型之间的差异而非单个模型的置信度。这个技巧在标注项目的早期阶段尤其有效。本地模型会从额外标注中获益而如果发现大语言模型反复出现同类型错误也可能促使你改进提示词。Another big benefit of this approach is that the human remains in the loop. Large language models are amazing feats of technology but they’re certainly at risk of generating harmful text. By keeping a human in the loop, you also reduce the risk of these risks seeping into your own custom model.这种方法的另一个巨大好处是保留了人工参与。大语言模型是惊人的技术成就但也存在生成有害文本的风险。通过让人工参与其中可以降低这些风险渗透到定制模型中的可能性。Brief demo简要演示To demonstrate this trick, I figured I’d display a demo. For this demo I’ll be leveraging data from the Guardian’s content API. This gives me access to a stream of news texts from which, for demo purposes, I’ll try to extract organisations and person entities.I wrote some code to help me do this in Prodigy. And thanks to the recent release of spacy-llm, it’s been an easy to ingerate this with LLMs. What follows is a setup for OpenAI.为了演示这个技巧我准备了一个示例。本示例使用来自某新闻机构内容API的数据得到一个新闻文本流并尝试从中提取组织机构名称和人物实体。我编写了一些代码在Prodigy中实现这一过程。得益于最近发布的spacy-llm将其与大语言模型集成变得很容易。以下是一个针对某机构API的配置。First, you’ll want to define a config.cfg config file that sets up a spaCy pipeline for NER using OpenAI as the large language model. I’m going with OpenAI in this example, but you can also configure one of the alternative LLM providers.首先需要定义一个config.cfg配置文件使用某机构的大语言模型来设置用于命名实体识别的spaCy流程。本示例中使用某机构但你也可以配置其他大语言模型提供商。[nlp] lang en pipeline [ner] [components] [components.ner] factory llm [components.ner.task] llm_tasks spacy.NER.v1 labels ORGANISATION,PERSON,DATE [components.ner.backend] llm_backends spacy.REST.v1 api OpenAI config {model: text-davinci-003, temperature: 0.3}Next, you can load this configuration and immediately save the spaCy model to disk.接下来可以加载此配置并立即将spaCy模型保存到磁盘。dotenv run -- spacy assemble config.cfg en_my_llmThis command ensures that we have a en_openai_llm pipeline stored on disk that we can load just like any other spaCy model.该命令确保磁盘上存储了一个en_openai_llm流程可以像其他spaCy模型一样加载。Why are you using dotenv run --?Prefixing thespacycommand withdotenv run --makes sure that the environment variables in my.envfile are loaded before the script that follows runs. I find it a convenient way to source environment variables just for a single script.为什么使用dotenv run --在spacy命令前加上dotenv run --可以确保在运行后续脚本之前加载.env文件中的环境变量。这是一种为单个脚本加载环境变量的便捷方式。Next, I’ve made a script that makes predictions with an en_core_web_md model as well as this new en_openai_llm model.The script is set up so that the entity names map to the same value. The spaCy pipeline refers to an organisation via the “org” label name while the LLM model uses “organisation”. This was a bit of custom code to write, but not a huge hurdle.然后编写了一个脚本使用en_core_web_md模型和新的en_openai_llm模型分别进行预测。脚本中设置了实体名称映射到相同的值。spaCy流程使用标签名org指代组织机构而大语言模型使用organisation。这需要编写一些自定义代码但难度不大。python cli.py predict en_core_web_md examples.jsonl out-spacy.jsonl--nerORG:ORGANISATION,ORGANISATION,PERSON,DATE--annot-name spacy-md python cli.py predict en_openai_llm examples.jsonl out-llm.jsonl--nerORG:ORGANISATION,ORGANISATION,PERSON,DATE--annot-name llmContents of cli.py (包含预测逻辑的脚本使用spaCy模型处理文本流提取实体并添加标注信息)Finally, you can upload these two files to Prodigy and start the review recipe to see when the models disagree.最后可以将这两个文件上传到Prodigy并启动审核流程查看两个模型产生分歧的样本。# Load the data into Prodigypython-mprodigy db-in guardian out-spacy.jsonl python-mprodigy db-in guardian out-llm.jsonl# Start the review recipepython-mprodigy review reviewed guardian --view-id ner_manual--labelORGANISATION,PERSON,DATEHere are some examples of the two models disagreeing.以下是一些两个模型产生分歧的示例。Example 1This example is interesting because both models are wrong. It’s also an example where each word is capitalised, which is likely confusing the models. I can also imagine that the spaCy model was trained on a corpus that’s unaware of “Meta” as a company, which isn’t helping.示例1这个示例很有趣因为两个模型都错了。同时文本中每个单词都大写首字母这可能让模型感到困惑。另外spaCy模型训练所用的语料库可能不知道Meta作为一家公司这也无助于正确识别。Example 2示例2再次看到spaCy模型对一些公司名称识别有问题但注意大语言模型在日期识别上也遇到了困难这些示例使用了零样本学习提示词非常简单只提到了标签名称。因此使用少样本学习或对标签给出更详细的解释可能会有所帮助。Example 3This is an example where both models agreed, which also happens. Examples like this are easy to skip, or possibly even auto-accept.示例3这是两个模型意见一致的示例这种情况也会发生。像这样的样本可以跳过甚至自动接受。The Gist要点I’ve found that going through these examples like this really helps me appreciate the difference between the LLM approach and the pretrained model. Every time when I look at the disagreement between them I’m usually inspired to write a better prompt for the LLM while I also gather more training data that improves the local pipeline.It just really seems like a nice evolution of the “disagreement between models”-trick that I’ve already enjoyed using for so long. It really remains just a “trick”, not an everything-solving-mega technique, but having a new trick that’s easily moldable to a lot of situations is still very nice.我发现这样逐一审视分歧示例确实有助于理解大语言模型方法与预训练模型之间的差异。每次看到它们的分歧我往往会受到启发为大语言模型编写更好的提示词同时收集更多训练数据来改进本地流程。这似乎是我长期以来喜欢使用的“模型间分歧”技巧的一个很好的演进。它仍然只是一个“技巧”并非万能解决方案但拥有一个能轻松适配多种场景的新技巧仍然非常棒。New ways to iterate on data and models数据和模型迭代的新方式These LLM techniques to help annotate excite me. They offer new ways to kickstart NLP projects while still getting the best of both worlds. The large language models offer a lot of flexibility and are easy to configure, but are typically very heavy. But with disagreement techniques they can become an aid to quickly create training data for a much more lightweight (and therefore more deployable) model for a specific use-case.While this blogpost highlights a technique for named entities, it’s also good to know that there are other use-cases for large language models too! You can find the new OpenAI features for Prodigy here, which also lists recipes for text classification. There’s also a very interesting recipe for terms which allows you to generate relevant terms that can be re-used for weak supervision modelling. There’s even recipes that allow you to do prompt engineering which can help you write better prompts for your language models.这些用于辅助标注的大语言模型技术让我感到兴奋。它们提供了启动NLP项目的新途径同时兼顾了两方面的优势。大语言模型灵活性高且易于配置但通常非常重量级。而通过分歧技术它们可以帮助快速为特定用例创建训练数据从而训练出更轻量级因此也更易于部署的模型。虽然本文重点介绍了命名实体的技术但大语言模型还有其他用例例如可以找到用于文本分类的配置还有一个非常有用的术语生成配置可以生成相关术语用于弱监督建模。甚至还有用于提示词工程的配置帮助你为大语言模型编写更好的提示词。A caveat on performance性能注意事项You may have noticed that I’m usingen_core_web_mdwhich, on paper at least, isn’t as performant asen_core_web_lgoren_core_web_trf. You may also observe that I’m running OpenAI in a zero shot manner, and the predictions would likely improve if I did some few-shot tricks. I could also have chosen to add a pretrained Huggingface model to the mix.These are all fair observations and it makes sense to consider all of this in a real-life scenario. But the main point I’m trying to make here is that withvery low effortwe now have tools at our disposal to “pull off the disagreement trick” in a short amount of time.Thatis very new and exciting, and something that can really shave off some time when you’re getting started with a new project.你可能注意到我使用的是en_core_web_md从纸面性能上看不如en_core_web_lg或en_core_web_trf。你可能还注意到我以零样本方式运行某机构模型如果使用一些少样本技巧预测效果可能会更好。我也可以选择加入一个预训练的某机构模型。这些观察都很合理在实际场景中考虑这些因素是有意义的。但我在这里想强调的核心点是现在只需极低的投入我们就能在短时间内“实现分歧技巧”。这非常新颖且令人兴奋在启动新项目时确实可以节省大量时间。FINISHED更多精彩内容请关注我的个人公众号公众号办公AI智能小助手或者我的个人博客 https://blog.qife122.com/对网络安全、黑客技术感兴趣的朋友可以关注我的安全公众号网络安全技术点滴分享

利用大模型分歧优化NLP标注

相关文章：

利用大模型分歧优化NLP标注

开发者个人网站搭建指南：从静态站点生成器到部署实战

如何让老款Mac重获新生：OpenCore Legacy Patcher完整指南

Simulink模块搭建跟踪误差不归零？可能是隐藏的信号延迟在捣鬼（附S函数解法）

挖掘MCU硬件加速潜力：以R80515的Double DPTR和MDU为例，在Keil C51中开启性能外挂

【Sora 2×AE工作流革命】：20年特效总监亲授无缝整合5大黄金法则，错过再等三年？

影刀RPA高阶架构：告别“连点器”思维，内置原生指纹浏览器重塑全域店群防封底座

【Sora 2 × Gaussian Splatting融合实战指南】：20年CV专家亲授3大跨模态生成瓶颈突破法

Cadence AMS Designer 保姆级教程：手把手教你搞定数模混合仿真（含Verilog模块导入避坑指南）

一天怎么完成论文初稿

科研人狂喜！AI生成的位图可以转矢量图了

5分钟掌握HunterPie：解决《怪物猎人：世界》战斗信息盲区的终极指南

ArcGIS符号库“隐身”之谜：从DAO组件缺失到完整恢复的实战指南

CompressO终极指南：免费开源视频图片压缩工具完整使用教程

STM32F4上跑FreeType：手把手教你为嵌入式GUI添加矢量字体（附源码）

保姆级教程：用Winbox给ROS配置一线多拨，实测200M宽带叠加效果（附避坑指南）

从表情包到OLED屏显：基于Image2Lcd与PCtoLCD2002的嵌入式图片取模实战

从零到一：手把手教你为Nachos实现Exec和Exit系统调用（附完整代码与调试技巧）

告别adb shell：用Python脚本一键搞定Android屏幕截图与导出

Mac小白必看：手把手教你找回丢失的Recovery HD分区（附diskutil命令详解）

从原理到实践：液压与气压传动核心概念与应用场景解析

AI工具搭建自动化视频生成Quick Sync

AI工具搭建自动化视频生成NVENC

避开C2000开发第一个坑：TMS320F28069的InitSysCtrl()函数里，为什么ADC时钟要开一下又关？

Python地理空间数据处理技能库geoskills：简化GIS分析，提升开发效率

英雄联盟玩家必备：5分钟快速上手LeagueAkari完整教程

HFSS与CST互导实战：5分钟搞定模型转换与数据对比（以微带天线为例）

从单机到集群的基石：手把手配置ZooKeeper 3.5.8单机模式，为分布式应用铺路

别再手动算归一化了！用Origin9.1的‘列公式’功能一键搞定数据预处理

一、NodeMCU-32S核心功能与上手场景解析