当前位置：首页 > news >正文

NLP的预处理数据

news 2026/2/10 14:29:12

处理文本数据的主要工具是Tokenizer。Tokenizer根据一组规则将文本拆分为tokens。然后将这些tokens转换为数字，然后转换为张量，成为模型的输入。模型所需的任何附加输入都由Tokenizer添加。

如果您计划使用预训练模型，重要的是使用与之关联的预训练Tokenizer。这确保文本的拆分方式与预训练语料库相同，并在预训练期间使用相同的标记-索引的对应关系（通常称为词汇表-vocab）。

开始使用AutoTokenizer.from_pretrained()方法加载一个预训练tokenizer。这将下载模型预训练的vocab：

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

然后将您的文本传递给tokenizer：

encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
print(encoded_input)
{'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102],'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

tokenizer返回一个包含三个重要对象的字典：

input_ids 是与句子中每个token对应的索引。
attention_mask 指示是否应该关注一个token。
token_type_ids 在存在多个序列时标识一个token属于哪个序列。

通过解码 input_ids 来返回您的输入：

tokenizer.decode(encoded_input["input_ids"])

如您所见，tokenizer向句子中添加了两个特殊token - CLS 和 SEP（分类器和分隔符）。并非所有模型都需要特殊token，但如果需要，tokenizer会自动为您添加。

如果有多个句子需要预处理，将它们作为列表传递给tokenizer：

from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")batch_sentences = [["But what about second breakfast?","i am a sentence"],"Don't think he knows about second breakfast, Pip.","What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True, truncation = True)
print(encoded_input)

{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 178, 1821, 170, 5650, 
102, 0, 0], [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}

注意token_type_ids在上面的例子中有体现。101与102是CLS与SEP的id，对应句子的开始与结束。

1.2.3.1.1 填充

句子的长度并不总是相同，这可能会成为一个问题，因为模型输入的张量需要具有统一的形状。填充是一种策略，通过在较短的句子中添加一个特殊的padding token，以确保张量是矩形的。

将 padding 参数设置为 True，以使批次中较短的序列填充到与最长序列相匹配的长度：

batch_sentences = ["But what about second breakfast?","Don't think he knows about second breakfast, Pip.","What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True)
print(encoded_input)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],[101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],[101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}

1.2.3.1.2 截断

另一方面，有时候一个序列可能对模型来说太长了。在这种情况下，您需要将序列截断为更短的长度。

将 truncation 参数设置为 True，以将序列截断为模型接受的最大长度：

batch_sentences = ["But what about second breakfast?","Don't think he knows about second breakfast, Pip.","What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
print(encoded_input)
{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],[101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],[101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}

查看填充和截断概念指南，了解更多有关填充和截断参数的信息。

1.2.3.1.3 构建张量

最后，tokenizer可以返回实际输入到模型的张量。

将 return_tensors 参数设置为 pt（对于PyTorch）或 tf（对于TensorFlow）：

Pytorch:

batch_sentences = ["But what about second breakfast?","Don't think he knows about second breakfast, Pip.","What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
print(encoded_input)
{'input_ids': tensor([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],[101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],[101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]]),'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}

NLP的预处理数据

1.2.3.1.1 填充

1.2.3.1.2 截断

1.2.3.1.3 构建张量

相关文章：

NLP的预处理数据

【DeepSeek问答】QProcess::start是异步的吗?会使UI卡顿吗？

【Java项目】基于Spring Boot的体质测试数据分析及可视化设计

JAVA-如何理解Mysql的索引

VUE向外暴露文件，并通过本地接口调用获取，前端自己生成接口获取public目录里面的文件

京准电钟：NTP精密时钟服务器在自动化系统中的作用

CSDN年度评选揭晓，永洪科技AI技术与智能应用双星闪耀

vscode settings(二)：文件资源管理器编辑功能主题快捷键

Ubuntu本地使用AnythingLLM

MybatisPlus-注解

【多模态大模型学习】位置编码的学习记录

在MAC上面通过HomeBrew安装node和npm@指定版本

基于YOLO11深度学习的医学X光骨折检测与语音提示系统【python源码+Pyqt5界面+数据集+训练代码】

HDFS扩缩容及数据迁移

【2025信息安全软考重点考点归纳】实时更新

在生产环境中部署和管理 PostgreSQL：实战经验与最佳实践

使用OpenCV实现帧间变化检测：基于轮廓的动态区域标注

rabbitmq单向ssl认证配置与最佳实践（适用于各大云厂商）

解决 Tkinter 在 Linux 上 Combobox 组件导致焦点丢失问题

JVM 简单内存结构及例子

深度学习在微纳光子学中的应用

【JavaEE】-- HTTP

使用分级同态加密防御梯度泄漏

Leetcode 3577. Count the Number of Computer Unlocking Permutations

在四层代理中还原真实客户端ngx_stream_realip_module

第25节 Node.js 断言测试

DIY｜Mac 搭建 ESP-IDF 开发环境及编译小智 AI

《基于Apache Flink的流处理》笔记

如何理解 IP 数据报中的 TTL？

九天毕昇深度学习平台 | 如何安装库？