当前位置：首页 > news >正文

三种文本相似计算方法：规则、向量与大模型裁判

news 2026/6/2 4:21:00

文本相似计算

项目背景

目前有众多工作需要评估字符串之间的相似(相关)程度：
比如，RAG 智能问答系统文本召回阶段需要计算用户文本与文本库内文本的相似分数，返回前TopK个候选文本。
在评估大模型生成的文本阶段，也需要评估大模型生成的文本与最终结果的相似或者相关程度。
做信息检索与评估生成式LLM效果的时候，都需要使用到文本相似度算法。掌握文本的相似度算法，有众多应用场景与实用性。

介绍

比如，要评估大模型生成的结果，与预设定的答案之间的相似程度。
本文介绍三类方法用于评估两个字符串的相似程度：规则、向量、大模型裁判。

规则：基于字符 n-gram 的相似计算，常用算法，ROUGE、BLEU;
向量：使用热门的嵌入模型(Jina)，把字符串编码为向量，计算两个向量之间的相似度；
大模型裁判：使用大模型评估两个字符串之间的相关性；

介绍了三种方法，评估两个字符串之间的相似度：基于字符 n-gram 的规则算法（如 ROUGE、BLEU），通过嵌入模型将文本编码为向量并计算余弦相似度，以及使用大模型直接评判文本相关性。文章详细探讨了这些方法的实现细节及适用场景，并提供了 Python 示例代码，帮助读者理解和应用不同的方法来满足具体需求。

规则

Find a metric on the Hub

本篇文章主要关注 Metric 方面的评估

Metric: measures the performance of a model on a given dataset, usually by comparing the model’s predictions to some ground truth labels – these are covered in this space.

装包，主要依赖 nltk 这个包:

pip install transformers evaluate

众多的自然语言处理评估方法会发布在 evaluate 这个包上。

google_bleu 网页，若想浏览更多的例子请点击查看，https://huggingface.co/spaces/evaluate-metric/google_bleu
在这里插入图片描述
从 evaluate 加载工具的时候，需要科学上网，解决方案如下：

梯子开启全局代理；

import os
os.environ['HTTP_PROXY'] = 'http://127.0.0.1:7890'
os.environ['HTTPS_PROXY'] = 'http://127.0.0.1:7890'

import evaluate
google_bleu = evaluate.load("google_bleu")sentence1 = "the cat sat on the mat"
sentence2 = "the cat ate the mat"
result1 = google_bleu.compute(predictions=[sentence1], references=[[sentence2]])
print(result1)
# result1 {'google_bleu': 0.3333333333333333}result2 = google_bleu.compute(predictions=[sentence1], references=[[sentence1]])
print(result2)
# result2 {'google_bleu': 1.0}

【注意】：references 是一个嵌套的二维列表。

references 设计为二维列表的原因是，针对同一个问题，可能有多个回答，最终的结果是返回与多个结果计算google_bleu的最大值。

predictions = ["The cat is on the mat."]
references = [["The cat is on the mat.", "There is a cat on the mat."]]
print(google_bleu.compute(predictions=predictions, references=references))
>>> {'google_bleu': 1.0}

下述是中文的例子：

google_bleu.compute(predictions=["我爱你"], references=[["我爱我的祖国"]]
)
# >>> {'google_bleu': 0.0}

上述 我爱你 和 我爱我的祖国
如上述所示，google_bleu 不会原生支持汉字，原因在于英文可直接按照空格拆分开，但是汉语之间没有空格。
比如, [“我爱我的祖国”] 可拆分为：

[“我爱我的祖国”] ，
[“我爱我的祖国”] , 祖国中间没有空格分开

显然 祖国 作为一个词更好，若拆分为 祖和国 两个字则会丢失原来的语义信息。

google_bleu.compute(predictions=["我 爱 你"], references=[["我 爱 我 的 祖 国"]]
)
# >>> {'google_bleu': 0.16666666666666666}

google_bleu.compute(predictions=["我 爱 你"], references=[["我 爱 我 的 祖国"]]
)
# >>> {'google_bleu': 0.21428571428571427}

使用合适的中文分词技术，可提高 google_bleu 分数。如上所示，祖国 变成一个词后，google_bleu 从0.16 提高到 0.21。
如果想尝试中文分词技术，可尝试使用pip install jieba，支持添加新词到字典中。

向量

使用经过训练的嵌入模型，把文本编码为向量，再计算两个向量的余弦相似度。
浏览 jina-embeddings-v2-base-zh 的介绍， https://modelscope.cn/models/jinaai/jina-embeddings-v2-base-zh

下述是一个简单的例子：

!pip install modelscope
from modelscope import AutoModel
from numpy.linalg import normcos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
# trust_remote_code is needed to use the encode method
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True) 
embeddings = model.encode(['How is the weather today?', '今天天气怎么样?'])
print(cos_sim(embeddings[0], embeddings[1]))

import numpy as np
from numpy.linalg import norm
from modelscope import AutoModel# 定义余弦相似度计算函数
cos_sim = lambda a, b: (a @ b.T) / (norm(a) * norm(b))# 加载模型
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True)# 输入字符串和候选字符串
input_string = 'How is the weather today?'
candidates = ['今天天气怎么样?', '我今天很高兴', '天气预报说今天会下雨', '你最喜欢的颜色是什么?']# 计算输入字符串的嵌入向量
input_embedding = model.encode([input_string])[0]# 计算候选字符串的嵌入向量
candidate_embeddings = model.encode(candidates)# 计算相似度并排序
similarities = [cos_sim(input_embedding, candidate_embedding) for candidate_embedding in candidate_embeddings]
sorted_candidates = sorted(zip(candidates, similarities), key=lambda x: x[1], reverse=True)# 输出排序结果
for candidate, similarity in sorted_candidates:print(f"({input_string} - {candidate}), Similarity: {similarity:.4f}")

上面代码展示了，计算 input_string 与 candidates 候选字符串之间的向量余弦相似度分数，按照从高到低排序：

Downloading Model to directory: C:\Users\user_name\.cache\modelscope\hub\jinaai/jina-embeddings-v2-base-zh
(How is the weather today? - 今天天气怎么样?), Similarity: 0.7861
(How is the weather today? - 天气预报说今天会下雨), Similarity: 0.5470
(How is the weather today? - 我今天很高兴), Similarity: 0.4202
(How is the weather today? - 你最喜欢的颜色是什么?), Similarity: 0.1032

大模型裁判

制定一个基于规则的程序来评估输出是非常具有挑战性的。传统的评估指标，基于输出和参考答案之间的相似性（例如，ROUGE、BLEU;），对于这些问题也无效。^[1] 在复杂场景下，可尝试使用大模型进行判决。

主要针对复杂的场景，在基于规则与向量相似度均效果不显著的情况下，可尝试使用LLM进行判决。

提示词参考：

JUDGE_PROMPT = """
You will be given a user_question and system_answer couple.
Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.
Give your answer as a float on a scale of 0 to 10, where 0 means that the system_answer is not helpful at all, and 10 means that the answer completely and helpfully addresses the question.Provide your feedback as follows:Feedback:::
Total rating: (your rating, as a float between 0 and 10)Now here are the question and answer.Question: {question}
Answer: {answer}Feedback:::
Total rating: """

参考资料

使用 LLM 作为评判者🧑‍⚖️进行自动化和多方面的评估
https://github.com/huggingface/evaluate

三种文本相似计算方法：规则、向量与大模型裁判

文本相似计算

项目背景

介绍

规则

向量

大模型裁判

参考资料

相关文章：

三种文本相似计算方法：规则、向量与大模型裁判

Python语言的计算机基础

Dify应用-工作流

02.02、返回倒数第 k 个节点

Linux手写FrameBuffer任意引脚驱动spi屏幕

怎么修复损坏的U盘？而且不用格式化的方式！

语音技术在播客领域的应用（2）

【Linux】应用层自定义协议与序列化

深度学习中的张量 - 使用PyTorch进行广播和元素级操作

gitignore忽略已经提交过的

h5使用video播放时关掉vant弹窗视频声音还在后台播放

Widows搭建sqli-labs

为AI聊天工具添加一个知识系统之46 蒙板程序设计（第一版）：Facet六边形【意识形态：操纵】

ASP.NET Core WebApi接口IP限流实践技术指南

文件移动工具 (File Mover)

PTA L1-039 古风排版

Docker 镜像加速的配置

简历_使用优化的Redis自增ID策略生成分布式环境下全局唯一ID，用于用户上传数据的命名以及多种ID的生成

PHP的HMAC_SHA1和HMAC_MD5算法方法

二进制/源码编译安装mysql 8.0

告别沉浸式白屏！UniApp中iOS/Android底部安全区与顶部状态栏颜色自定义全攻略

别再乱算相似度了！用Python实战二元变量聚类：从Jaccard系数到病人分组

论文创新点像挤牙膏？导师强推这几个AI论文平台

GitLab External Wiki代理权限绕过漏洞深度解析

钱钟书《围城》第1-5章阅读笔记：一场关于人生困境的提前预演

Unity事件系统实战：用事件驱动重构你的金币拾取逻辑（告别硬编码）

武汉国电华美16875kVA串联谐振试验装置，这手活儿细

如何快速掌握Avidemux：新手完整入门指南与5个核心技巧

艾尔登法环存档迁移终极指南：3分钟解决角色转移难题

OpenIPC开源固件：5分钟解锁网络摄像头的终极控制权