当前位置：首页 > news >正文

LLM - 使用 ModelScope SWIFT 测试 Qwen2-VL 的 LoRA 指令微调教程(2)

news 2025/10/16 12:26:20

欢迎关注我的CSDN：https://spike.blog.csdn.net/
本文地址：https://spike.blog.csdn.net/article/details/142827217

免责声明：本文来源于个人知识与公开资料，仅用于学术交流，欢迎讨论，不支持转载。

SWIFT

SWIFT 即 Scalable lightWeight Infrastructure for FineTuning (可扩展轻量级微调基础设施)，是高效、轻量级的模型微调和推理框架，支持大语言模型(LLM) 和多模态大型模型(MLLM) 的训练、推理、评估和部署。可以将 SWIFT 框架直接应用到研究和生产环境中，实现从模型训练和评估到应用的完整工作流程。

1. 数据集

测试 OCR 数据集：

已整理 (Parquet格式)：https://modelscope.cn/datasets/AI-ModelScope/LaTeX_OCR
原始：https://github.com/LinXueyuanStdio/Data-for-LaTeX_OCR

数据集缓存( MODELSCOPE_CACHE) 位置：modelscope_models/AI-ModelScope/LaTeX_OCR

测试数据效果：

[your path]/llm/vision_test_data/latex-print.png
[your path]/llm/vision_test_data/latex-fullhand.png

测试 qwen2-vl-7b-instruct 的 OCR 识别能力，即：

CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen2-vl-7b-instruct<<< <image>使用OCR识别图像中的Latex公式
Input an image path or URL <<< [your path]/llm/vision_test_data/latex-print.png
ds^2 = (1 - \frac{qcos\theta}{r})^{\frac{2}{1 + \alpha^2}} \{dr^2 + r^2 d\theta^2 + r^2 sin^2\theta d\phi^2 \} - \frac{dt^2}{(1 - \frac{qcos\theta}{r})^{\frac{2}{1 + \alpha^2}}}.

原始图像：
print

识别结果(印刷)：

$ds^2 = (1 - \frac{qcos\theta}{r})^{\frac{2}{1 + \alpha^2}} \{dr^2 + r^2 d\theta^2 + r^2 sin^2\theta d\phi^2 \} - \frac{dt^2}{(1 - \frac{qcos\theta}{r})^{\frac{2}{1 + \alpha^2}}}$

原始图像：
fullhand
识别结果(手写)：

$ds^2 = (1 - \frac{qcos\theta}{r})^{\frac{2}{1 + \alpha^2}} \{d\delta^2 + r^2 d\theta^2 + n^2 s/n^2 d\phi^2 \} - \frac{dt^2}{(1 - \frac{qcos\theta}{r})^{\frac{2}{1 + \alpha^2}}}.$

其中，数据集 latex-ocr-print 的 preprocess_func() 函数，如下：

def _preprocess_latex_ocr_dataset(dataset: DATASET_TYPE) -> DATASET_TYPE:from datasets import Imageprompt = 'Using LaTeX to perform OCR on the image.'def _process(d):return {'query': prompt, 'response': d['text']}kwargs = {}if not isinstance(dataset, HfIterableDataset):kwargs['load_from_cache_file'] = dataset_enable_cachereturn dataset.map(_process, **kwargs).rename_column('image', 'images')

使用 ModelScope 下载的数据集，位于 modelscope_models/hub/datasets，数据集是 arrow 格式，与默认格式不兼容，即：

├── [4.0K]  AI-ModelScope___la_te_x_ocr
│   └── [4.0K]  synthetic_handwrite-eb02dd1cc52afa40
│       └── [4.0K]  0.0.0
│           ├── [4.0K]  master
│           │   ├── [752K]  cache-8f28bc5f38ad58b9-fa2020342a21.arrow
│           │   ├── [6.3M]  cache-a7c7e67013e13072-fa2020342a21.arrow
│           │   ├── [606M]  cache-c67a1e1eba314afd-fa2020342a21.arrow
│           │   ├── [7.9K]  cache-e9fb6f7ceeaa8304-fa2020342a21.arrow
│           │   ├── [1.2K]  dataset_info.json
│           │   ├── [ 59M]  la_te_x_ocr-test.arrow
│           │   ├── [474M]  la_te_x_ocr-train.arrow
│           │   └── [ 59M]  la_te_x_ocr-validation.arrow
│           ├── [   0]  master.incomplete_info.lock
│           └── [   0]  master_builder.lock

2. 有监督微调训练

有监督微调(Supervised Fine-Tuning, SFT)，参数说明：

python [your path]/llm/ms-swift/swift/cli/sft.py --help

在运行过程中，自动下载数据集，至 MODELSCOPE_CACHE，并且转换成 SWIFT 支持的 Arrow 格式，无法使用默认数据集，即：

MAX_STEPS=2000 SIZE_FACTOR=8 MAX_PIXELS=602112 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8 nohup swift sft \
--model_type qwen2-vl-7b-instruct \
--model_id_or_path qwen/Qwen2-VL-7B-Instruct \
--sft_type lora \
--num_train_epochs 2 \
--batch_size 4 \
--eval_steps 1000 \
--save_steps 1000 \
--dataset latex-ocr-handwrite \
> nohup.latex-ocr-handwrite.out &tail -f nohup.latex-ocr-handwrite.out

如果使用，自定义数据集格式，参考 Swift - 自定义数据集，需要转换成标准的 json 或 jsonl 格式。

训练完成，输出日志，累计训练 11808 steps，如下：

[INFO:swift] Saving model checkpoint to [your folder]/output/qwen2-vl-7b-instruct/v0-20241011-205638/checkpoint-11808
Train: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11808/11808 [6:16:15<00:00,  1.91s/it]
[INFO:swift] last_model_checkpoint: [your folder]/output/qwen2-vl-7b-instruct/v0-20241011-205638/checkpoint-11808
[INFO:swift] best_model_checkpoint: [your folder]/output/qwen2-vl-7b-instruct/v0-20241011-205638/checkpoint-11000
[INFO:swift] images_dir: [your folder]/output/qwen2-vl-7b-instruct/v0-20241011-205638/images
[INFO:swift] End time of running main: 2024-10-12 03:17:31.020443
{'eval_loss': 0.12784964, 'eval_acc': 0.96368307, 'eval_runtime': 44.673, 'eval_samples_per_second': 21.355, 'eval_steps_per_second': 5.35, 'epoch': 2.0, 'global_step/max_steps': '11808/11808', 'percentage': '100.00%', 'elapsed_time': '6h 16m 14s', 'remaining_time': '0s'}
{'train_runtime': 22574.9994, 'train_samples_per_second': 8.369, 'train_steps_per_second': 0.523, 'train_loss': 0.14006881, 'epoch': 2.0, 'global_step/max_steps': '11808/11808', 'percentage': '100.00%', 'elapsed_time': '6h 16m 15s', 'remaining_time': '0s'}

输出如下，其中 images 保存训练过程的绘制图像，即：

[your folder]/output/qwen2-vl-7b-instruct/v0-20241011-205638
├── [4.0K]  checkpoint-11000
├── [4.0K]  checkpoint-11808
├── [4.0K]  images
├── [1.1M]  logging.jsonl
├── [4.0K]  runs
├── [ 11K]  sft_args.json
└── [4.8K]  training_args.json

使用 TensorBoard 读取训练日志，即：

# http://127.0.0.1:6006/
tensorboard --logdir=[your folder]/output/qwen2-vl-7b-instruct/v0-20241011-205638/runs/ --host=0.0.0.0 --port=6006

训练 Loss，Smooth=0.9，如下：

Loss

学习率，如下：

验证集 Loss，eval_steps=1000，如下：

Loss

显存占用 (BatchSize=4)，如下：

GPU

其他，如果使用 Matplotlib 和 TensorBoard 数据绘制 Loss 曲线，平滑度设置成 0.9，参考：

import os
from typing import Dict, List, Tupleimport matplotlib.pyplot as plt
from tensorboard.backend.event_processing.event_accumulator import EventAccumulatorItem = Dict[str, float]
TB_COLOR, TB_COLOR_SMOOTH = '#FFE2D9', '#FF7043'def read_tensorboard_file(fpath: str) -> Dict[str, List[Item]]:if not os.path.isfile(fpath):raise FileNotFoundError(f'fpath: {fpath}')ea = EventAccumulator(fpath)ea.Reload()res: Dict[str, List[Item]] = {}tags = ea.Tags()['scalars']print(f"[Info] tags: {tags}")for tag in tags:values = ea.Scalars(tag)r: List[Item] = []for v in values:r.append({'step': v.step, 'value': v.value})res[tag] = rreturn resdef tensorboard_smoothing(values: List[float], smooth: float = 0.9) -> List[float]:norm_factor = 0x = 0res: List[float] = []for i in range(len(values)):x = x * smooth + values[i]  # Exponential decaynorm_factor *= smoothnorm_factor += 1res.append(x / norm_factor)return resdef plot_images(images_dir: str,tb_dir: str,smooth_key: List[str],smooth_val: float = 0.9,figsize: Tuple[int, int] = (8, 5),dpi: int = 100) -> None:"""Using tensorboard's data content to plot images"""os.makedirs(images_dir, exist_ok=True)fname = [fname for fname in os.listdir(tb_dir) if os.path.isfile(os.path.join(tb_dir, fname))][0]tb_path = os.path.join(tb_dir, fname)data = read_tensorboard_file(tb_path)for k in data.keys():_data = data[k]steps = [d['step'] for d in _data]values = [d['value'] for d in _data]if len(values) == 0:continue_, ax = plt.subplots(1, 1, squeeze=True, figsize=figsize, dpi=dpi)ax.set_title(k)if len(values) == 1:ax.scatter(steps, values, color=TB_COLOR_SMOOTH)elif k in smooth_key:ax.plot(steps, values, color=TB_COLOR)values_s = tensorboard_smoothing(values, smooth_val)ax.plot(steps, values_s, color=TB_COLOR_SMOOTH)else:ax.plot(steps, values, color=TB_COLOR_SMOOTH)# fpath = os.path.join(images_dir, k.replace('/', '_'))# plt.savefig(fpath, dpi=dpi, bbox_inches='tight')# plt.close()plt.show()plt.close()breakckpt_dir="[your path]/llm/ms-swift/output"
images_dir = os.path.join(ckpt_dir, 'images')
tb_dir = "[your path]/run_cuda/output/qwen2-vl-7b-instruct/v0-20241011-205638/runs/"
plot_images(images_dir, tb_dir, ['train/loss'], 0.9)

3. 合并 LoRA 模型

训练完成，输出的 LoRA 模型，如下：

(rag) output/qwen2-vl-7b-instruct/v0-20241011-205638/checkpoint-11808# tree -L 1 -h .
.
├── [5.0K]  README.md
├── [ 712]  adapter_config.json
├── [ 39M]  adapter_model.safetensors
├── [  67]  additional_config.json
├── [ 383]  configuration.json
├── [ 219]  generation_config.json
├── [ 77M]  optimizer.pt
├── [ 14K]  rng_state.pth
├── [1.0K]  scheduler.pt
├── [ 11K]  sft_args.json
├── [608K]  trainer_state.json
└── [7.2K]  training_args.bin

将 LoRA 合并至源模型，同时，评估模型，即：

CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir [your path]/run_cuda/output/qwen2-vl-7b-instruct/v0-20241011-205638/checkpoint-11808/ \
--load_dataset_config true \
--merge_lora true
# 直接评估模型

使用合并之后的模型，进行推理：

# [your path]/run_cuda/output/qwen2-vl-7b-instruct/v0-20241011-205638/checkpoint-11808-merged
# CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen2-vl-7b-instruct
CUDA_VISIBLE_DEVICES=0 swift infer --ckpt_dir [your path]/run_cuda/output/qwen2-vl-7b-instruct/v0-20241011-205638/checkpoint-11808-merged

测试输出差异，即：

<<< <image>使用OCR识别图像中的Latex公式
Input an image path or URL <<< [your path]/llm/vision_test_data/latex-fullhand.png
d s ^ { 2 } = ( 1 - \frac { q c o s \theta } { r } ) ^ { \frac { 2 } { 1 + \kappa ^ { 2 } } } \{ d r ^ { 2 } + r ^ { 2 } d \theta ^ { 2 } + r ^ { 2 } s i n ^ { 2 } \theta d \varphi ^ { 2 } \} - \frac { d t ^ { 2 } } { ( 1 - \frac { q c o s \theta } { r } ) ^ { \frac { 2 } { 1 + \kappa ^ { 2 } } } } .# 之前格式
# ds^2 = (1 - \frac{qcos\theta}{r})^{\frac{2}{1 + \alpha^2}} \{d\delta^2 + r^2 d\theta^2 + n^2 s/n^2 d\phi^2 \} - \frac{dt^2}{(1 - \frac{qcos\theta}{r})^{\frac{2}{1 + \alpha^2}}}.

注意：格式与之前差异较大，模型已经学会新的 OCR 输出格式，之前的输出格式没有空格，新的输出格式包括空格，与微调数据一致。

测试微调的训练数据格式，与 LoRA 输出保持一致，训练成功，如下：

d s ^ { 2 } = ( 1 - { \frac { q c o s \theta } { r } } ) ^ { \frac { 2 } { 1 + \alpha ^ { 2 } } } \lbrace d r ^ { 2 } + r ^ { 2 } d \theta ^ { 2 } + r ^ { 2 } s i n ^ { 2 } \theta d \varphi ^ { 2 } \rbrace - { \frac { d t ^ { 2 } } { ( 1 - { \frac { q c o s \theta } { r } } ) ^ { \frac { 2 } { 1 + \alpha ^ { 2 } } } } } \, .
\widetilde \gamma _ { \mathrm { h o p f } } \simeq \sum _ { n > 0 } \widetilde { G } _ { n } { \frac { ( - a ) ^ { n } } { 2 ^ { 2 n - 1 } } }

4. 训练参数 --dataset 调用逻辑

数据集的声明位于 swift/llm/utils/dataset.py，参考：

latex_ocr_print = 'latex-ocr-print'register_dataset(DatasetName.latex_ocr_print,	# dataset_name'AI-ModelScope/LaTeX_OCR',		# dataset_id_or_path['full'],											# subsets_preprocess_latex_ocr_dataset,# preprocess_funcget_dataset_from_repo,				# get_functionsplit=['validation', 'test'],  # There are some problems in the training dataset.hf_dataset_id='linxy/LaTeX_OCR',tags=['chat', 'ocr', 'multi-modal', 'vision'])

其中 register_dataset 函数的作用是，把 dataset_info 注册进入 DATASET_MAPPING 中：

dataset_info = {'dataset_id_or_path': dataset_id_or_path,'subsets': subsets,'preprocess_func': preprocess_func,'split': split,'hf_dataset_id': hf_dataset_id,'is_local': is_local,**kwargs
}
DATASET_MAPPING[dataset_name] = dataset_info

其中 args.dataset 参数是位于 _get_train_val_dataset 函数中：

sft_main = get_sft_main(SftArguments, llm_sft)def llm_sft(args: SftArguments) -> Dict[str, Any]:# ...train_dataset, val_dataset = prepare_dataset(args, template, msg)  # 调用def prepare_dataset(args, template: Template, msg: Optional[Dict[str, Any]] = None):# ...train_dataset, val_dataset = _get_train_val_dataset(args)  # 调用def _get_train_val_dataset(args: SftArguments) -> Tuple[HfDataset, Optional[HfDataset]]:# ...train_dataset, val_dataset = get_dataset(args.dataset,args.dataset_test_ratio,args.dataset_seed,check_dataset_strategy=args.check_dataset_strategy,model_name=args.model_name,model_author=args.model_author,streaming=args.streaming,streaming_val_size=args.streaming_val_size,streaming_buffer_size=args.streaming_buffer_size)

即 swift/llm/sft.py#llm_sft() -> prepare_dataset() -> _get_train_val_dataset() -> get_dataset()

在 swift/llm/utils/dataset.py 中，即：

def get_dataset(dataset_name_list: Union[List[str], str],dataset_test_ratio: float = 0.,dataset_seed: Union[int, RandomState] = 42,check_dataset_strategy: Literal['none', 'discard', 'error', 'warning'] = 'none',*,# for self-cognitionmodel_name: Union[Tuple[str, str], List[str], None] = None,model_author: Union[Tuple[str, str], List[str], None] = None,**kwargs) -> Tuple[DATASET_TYPE, Optional[DATASET_TYPE]]:"""Returns train_dataset and val_dataset"""# ...if isinstance(dataset_name_list, str):dataset_name_list = [dataset_name_list]# ...# dataset_id_or_path -> dataset_namedataset_name_list = _dataset_id_to_name(dataset_name_list)

调用 _dataset_id_to_name() 函数：

调用 register_dataset_info() 函数
调用 register_local_dataset() 函数
调用 register_dataset() 函数
调用 get_local_dataset() 函数
调用 load_dataset_from_local() 函数
处理 .jsonl 、 .json 、.csv 文件
或者调用 preprocess_func() 函数

即：

if dataset_path.endswith('.csv'):dataset = HfDataset.from_csv(dataset_path, na_filter=False)
elif dataset_path.endswith('.jsonl') or dataset_path.endswith('.json'):dataset = HfDataset.from_json(dataset_path)
else:raise ValueError('The custom dataset only supports CSV, JSONL or JSON format.')
dataset = preprocess_func(dataset)

LLM - 使用 ModelScope SWIFT 测试 Qwen2-VL 的 LoRA 指令微调教程(2)

1. 数据集

2. 有监督微调训练

3. 合并 LoRA 模型

4. 训练参数 --dataset 调用逻辑

相关文章：

LLM - 使用 ModelScope SWIFT 测试 Qwen2-VL 的 LoRA 指令微调教程(2)

2024 年热门前端框架对比及选择指南

map_server

无人机航拍视频帧处理与图像拼接算法

搬砖11、Python 文件和异常

24.6 监控系统在采集侧对接运维平台

refresh-1

如何写好一篇计算机应用的论文？

工业 5.0 时代的数字孪生：迈向高效和可持续的智能工厂

Python脚本之获取Splunk数据发送到第三方UDP端口

Protobuf：复杂类型接口

Git Push 深度解析：命令的区别与实践

大数据开发基础实训室设备

【数据结构】string（C++模拟实现）

【笔记】I/O总结王道强化视频笔记

XML XSLT：转换与呈现数据的力量

ES6总结

晶体匹配测试介绍

超声波清洗机靠谱吗？适合学生党入手的四款眼镜清洗机品牌推荐！

Java生成图片_基于Spring AI

云启出海，智联未来｜阿里云网络「企业出海」系列客户沙龙上海站圆满落地

安宝特方案丨XRSOP人员作业标准化管理平台：AR智慧点检验收套件

Spring Boot面试题精选汇总

如何理解 IP 数据报中的 TTL？

3-11单元格区域边界定位(End属性)学习笔记

视频行为标注工具BehaviLabel（源码+使用介绍+Windows.Exe版本）

【Linux系统】Linux环境变量：系统配置的隐形指挥官

适应性Java用于现代 API：REST、GraphQL 和事件驱动

C# WPF 左右布局实现学习笔记(1)

比较数据迁移后MySQL数据库和ClickHouse数据仓库中的表