当前位置：首页 > news >正文

【Transformers基础入门篇2】基础组件之Pipeline

news 2025/7/17 2:33:10

文章目录

一、什么是Pipeline
二、查看PipeLine支持的任务类型
三、Pipeline的创建和使用
- 3.1 根据任务类型，直接创建Pipeline，默认是英文模型
- 3.2 指定任务类型，再指定模型，创建基于指定模型的Pipeline
- 3.3 预先加载模型，再创建Pipeline
- 3.4 使用Gpu进行推理
- 3.5 查看Device
- 3.6 测试一下耗时
- 3.7 确定的Pipeline的参数
四、Pipeline的背后实现

本文为 https://space.bilibili.com/21060026/channel/collectiondetail?sid=1357748的视频学习笔记

项目地址为：https://github.com/zyds/transformers-code

一、什么是Pipeline

将数据预处理、模型调用、结果后处理三部分组装成的流水线，如下流程图
使我们能够直接输入文本便获得最终的答案，不需要我们关注细节

二、查看PipeLine支持的任务类型

from transformers.pipelines import SUPPORTED_TASKS
from pprint import pprint
for k, v in SUPPORTED_TASKS.items():print(k, v)

输出但其概念PipeLine支持的任务类型以及可以调用的
举例输出：

audio-classification {'impl': <class 'transformers.pipelines.audio_classification.AudioClassificationPipeline'>, 'tf': (), 'pt': (<class 'transformers.models.auto.modeling_auto.AutoModelForAudioClassification'>,), 'default': {'model': {'pt': ('superb/wav2vec2-base-superb-ks', '372e048')}}, 'type': 'audio'}

key: 任务的名称，如音频分类
v：关于任务的实现，如具体哪个Pipeline，有没有TF模型，有没有pytorch模型，模型具体是哪一个

三、Pipeline的创建和使用

3.1 根据任务类型，直接创建Pipeline，默认是英文模型

from transformers import pipeline
pipe = pipeline("text-classification") # 根据pipeline直接创建一个任务类
pipe("very good") # 测试一个句子，输出结果

3.2 指定任务类型，再指定模型，创建基于指定模型的Pipeline

注，这里我已经将模型离线下载到本地了

# https://huggingface.co/models
pipe = pipeline("text-classification", model="./models/roberta-base-finetuned-dianping-chinese")

3.3 预先加载模型，再创建Pipeline

rom transformers import AutoModelForSequenceClassification, AutoTokenizer# 这种方式，必须同时指定model和tokenizer
model = AutoModelForSequenceClassification.from_pretrained("./models_roberta-base-finetuned-dianping-chinese")
tokenizer = AutoTokenizer.from_pretrained("./models_roberta-base-finetuned-dianping-chinese")
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)

3.4 使用Gpu进行推理

pipe = pipeline("text-classification", model="./models_roberta-base-finetuned-dianping-chinese", device=0)

3.5 查看Device

pipe.model.device

3.6 测试一下耗时

import torch
import time
times = []
for i in range(100):torch.cuda.synchronize()start = time.time()pipe("我觉得不太行！")torch.cuda.synchronize()end = time.time()times.append(end - start)
print(sum(times) / 100)

3.7 确定的Pipeline的参数

# 先创建一个pipeline
qa_pipe = pipeline("question-answering", model="../../models/models")
qa_pipe

输出
在这里插入图片描述 QuestionAnsweringPipeline

查看定义，会告诉我们这个pipeline该如何使用

class QuestionAnsweringPipeline(ChunkPipeline):"""Question Answering pipeline using any `ModelForQuestionAnswering`. See the [question answeringexamples](../task_summary#question-answering) for more information.Example:```python>>> from transformers import pipeline>>> oracle = pipeline(model="deepset/roberta-base-squad2")>>> oracle(question="Where do I live?", context="My name is Wolfgang and I live in Berlin"){'score': 0.9191, 'start': 34, 'end': 40, 'answer': 'Berlin'}```Learn more about the basics of using a pipeline in the [pipeline tutorial](../pipeline_tutorial)This question answering pipeline can currently be loaded from [`pipeline`] using the following task identifier:`"question-answering"`.The models that this pipeline can use are models that have been fine-tuned on a question answering task. See theup-to-date list of available models on[huggingface.co/models](https://huggingface.co/models?filter=question-answering)."""

进入pipeline，看__call__，查看可以支持的更多的参数
列出了更多的参数

    def __call__(self, *args, **kwargs):"""Answer the question(s) given as inputs by using the context(s).Args:args ([`SquadExample`] or a list of [`SquadExample`]):One or several [`SquadExample`] containing the question and context.X ([`SquadExample`] or a list of [`SquadExample`], *optional*):One or several [`SquadExample`] containing the question and context (will be treated the same way as ifpassed as the first positional argument).data ([`SquadExample`] or a list of [`SquadExample`], *optional*):One or several [`SquadExample`] containing the question and context (will be treated the same way as ifpassed as the first positional argument).question (`str` or `List[str]`):One or several question(s) (must be used in conjunction with the `context` argument).context (`str` or `List[str]`):One or several context(s) associated with the question(s) (must be used in conjunction with the`question` argument).topk (`int`, *optional*, defaults to 1):The number of answers to return (will be chosen by order of likelihood). Note that we return less thantopk answers if there are not enough options available within the context.doc_stride (`int`, *optional*, defaults to 128):If the context is too long to fit with the question for the model, it will be split in several chunkswith some overlap. This argument controls the size of that overlap.max_answer_len (`int`, *optional*, defaults to 15):The maximum length of predicted answers (e.g., only answers with a shorter length are considered).max_seq_len (`int`, *optional*, defaults to 384):The maximum length of the total sentence (context + question) in tokens of each chunk passed to themodel. The context will be split in several chunks (using `doc_stride` as overlap) if needed.max_question_len (`int`, *optional*, defaults to 64):The maximum length of the question after tokenization. It will be truncated if needed.handle_impossible_answer (`bool`, *optional*, defaults to `False`):Whether or not we accept impossible as an answer.align_to_words (`bool`, *optional*, defaults to `True`):Attempts to align the answer to real words. Improves quality on space separated langages. Might hurt onnon-space-separated languages (like Japanese or Chinese)Return:A `dict` or a list of `dict`: Each result comes as a dictionary with the following keys:- **score** (`float`) -- The probability associated to the answer.- **start** (`int`) -- The character start index of the answer (in the tokenized version of the input).- **end** (`int`) -- The character end index of the answer (in the tokenized version of the input).- **answer** (`str`) -- The answer to the question."""

如下面的例子

我们输出问题：中国的首都是哪里？给的上下文是：中国的首都是北京

qa_pipe(question="中国的首都是哪里？", context="中国的首都是北京")

在这里插入图片描述

如果通过 max_answer_len参数来限定输出的最大长度，会进行强行截断

qa_pipe(question="中国的首都是哪里？", context="中国的首都是北京", max_answer_len=1)

在这里插入图片描述

四、Pipeline的背后实现

step1 初始化组件，Tokenizer，model

# step1 初始化tokenizer， model
tokenizer = AutoTokenizer.from_pretrained("../../models/models_roberta-base-finetuned-dianping-chinese")
model = AutoModelForSequenceClassification.from_pretrained("../../models/models_roberta-base-finetuned-dianping-chinese")

step2 预处理

# 预处理，返回pytorch的tensor，是一个dict
input_text = "我觉得不太行！"
inputs = tokenizer(input_text, return_tensors="pt")
inputs

在这里插入图片描述

step3 模型预测

res = model(**inputs)
res

在这里插入图片描述
预测的结果，包括的内容有点多，如loss,logits等

step4 结果后处理

logits = res.logits
logits = torch.softmax(logits, dim=-1)
pred = torch.argmax(logits).item()
result = model.config.id2label.get(pred)
result

在这里插入图片描述

【Transformers基础入门篇2】基础组件之Pipeline

文章目录一、什么是Pipeline二、查看PipeLine支持的任务类型三、Pipeline的创建和使用3.1 根据任务类型，直接创建Pipeline，默认是英文模型3.2 指定任务类型，再指定模型，创建基于指定模型的Pipeline3.3 预先加载模型，再…...

编程日记 2024/9/25 4:03:02

java反射学习总结

最近在项目上有一个内部的CR，运用到了反射。想起之前面试的时候被面试官追问有没有在项目中用过反射，以及反射的原理和对反射的了解。于是借此机会，学习回顾一下反射，以及在项目中可能会用到的场景。 Java 中的反射概述反射&…...

编程日记 2024/9/25 4:02:02

探索C语言与Linux编程：获取当前用户ID与进程ID

探索C语言与Linux编程：获取当前用户ID与进程ID 一、Linux系统概述与用户、进程概念二、C语言与系统调用三、获取当前用户ID四、获取当前进程ID五、综合应用：同时获取用户ID和进程ID六、深入理解与扩展七、结语在操作系统与编程语言的交汇点，Linux作为开源操作系统的典范，为…...

编程日记 2024/9/25 3:59:58

1.4 边界值分析法

欢迎大家订阅【软件测试】专栏，开启你的软件测试学习之旅！ 文章目录前言1 定义2 选取3 具体步骤4 案例分析本篇文章参考黑马程序员前言边界值分析法是一种广泛应用于软件测试中的技术，旨在识别输入值范围内的潜在缺陷。本文将详细探讨…...

编程日记 2024/9/25 3:58:57

Spring IOC容器Bean对象管理-注解方式

目录 1、Bean对象常用注解介绍 2、注解示例说明 1、Bean对象常用注解介绍 Component 通用类组件注解，该类被注解，IOC容器启动时实例化此类对象Controller 注解控制器类Service 注解业务逻辑类Respository 注解和数据库操作的类，如DAO类Reso…...

编程日记 2024/9/25 3:55:54

OpenAI API: How to catch all 5xx errors in Python?

题意：OpenAI API：如何在 Python 中捕获所有 5xx 错误？ 问题背景： I want to catch all 5xx errors (e.g., 500) that OpenAI API sends so that I can retry before giving up and reporting an exception. 我想捕获 OpenAI API…...

编程日记 2024/9/25 3:54:53

C++初阶学习——探索STL奥秘——标准库中的priority_queue与模拟实现

1.priority_queque的介绍 1.priority_queue中文叫优先级队列。优先队列是一种容器适配器，根据严格的弱排序标准，它的第一个元素总是它所包含的元素中最大的。 2. 此上下文类似于堆，在堆中可以随时插入元素，并且只能检索最大堆元…...

编程日记 2024/9/25 3:51:50

PyTorch经典模型

PyTorch 经典模型教程 1. PyTorch 库架构概述 PyTorch 是一个广泛使用的深度学习框架，具有高度的灵活性和动态计算图的特性。它支持自动求导功能，并且拥有强大的 GPU 加速能力，适用于各种神经网络模型的训练与部署。 PyTorch 的核心架构包…...

编程日记 2024/9/25 3:50:49

C++ STL容器(三) —— 迭代器底层剖析

本篇聚焦于STL中的迭代器，同样基于MSVC源码。文章目录迭代器模式应用场景实现方式优缺点 UML类图代码解析list 迭代器const 迭代器非 const 迭代器 vector 迭代器const 迭代器非const迭代器反向迭代器迭代器失效参考资料迭代器模式首先迭代器模式是设计模式中…...

编程日记 2024/9/25 3:49:48

力扣416周赛

举报垃圾信息题目 3295. 举报垃圾信息 - 力扣（LeetCode） 思路直接模拟就好了，这题居然是中等难度代码 public boolean reportSpam(String[] message, String[] bannedWords) {Map<String,Integer> map new HashMap<>()…...

编程日记 2024/9/25 3:48:47

vue 页面常用图表框架

在 Vue.js 页面中，常见的用于制作图表的框架或库有以下几种： ECharts: 官方网站: EChartsECharts 是一个功能强大、可扩展的图表库，支持多种图表类型，如柱状图、折线图、饼图等。Vue 集成: 可以使用 vue-echarts 插件，…...

编程日记 2024/9/25 3:46:46

spring 注解 - @PostConstruct - 用于初始化工作

PostConstruct 是 Java EE 5 中引入的一个注解，用于标注在方法上，表示该方法应该在依赖注入完成之后执行。这个注解是 javax.annotation 包的一部分，通常用于初始化工作，比如初始化成员变量或者启动一些后台任务。在 Spring 框架…...

编程日记 2024/9/25 3:45:45

特征处理 import os import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.model_selection import StratifiedShuffleSplit from sklearn.impute import SimpleImputer from sklearn.pipeline import FeatureUnion fr…...

编程日记 2024/9/25 3:42:42

文章目录

一、什么是Pipeline

二、查看PipeLine支持的任务类型

三、Pipeline的创建和使用

3.1 根据任务类型，直接创建Pipeline，默认是英文模型

3.2 指定任务类型，再指定模型，创建基于指定模型的Pipeline

3.3 预先加载模型，再创建Pipeline

3.4 使用Gpu进行推理

3.5 查看Device

3.6 测试一下耗时

3.7 确定的Pipeline的参数

四、Pipeline的背后实现

相关文章：