当前位置：首页 > news >正文

chatGPT2:如何构建一个可以回答有关您网站问题的 AI 嵌入（embeddings）

news 2026/4/26 1:14:37

感觉这个目前没有什么用，因为客户可以直接问通用chatGPT，实时了解你网站内的信息，除非你的网站chatGPT无法访问。
不过自动预订、买票等用嵌入还是挺有用的。

什么是嵌入？

OpenAI的嵌入（embeddings）是一种技术，它能够将文本、代码或者其他类型的数据转换成数值向量。这些数值向量捕捉了原始数据的关键特征和含义，使得计算机和算法能更有效地处理和分析数据。

将数据传递给模型的过程通常涉及以下几个步骤：

数据准备：首先，您需要准备您想要分析或处理的数据。这可能是文本、图片、音频或其他类型的数据。
格式转换：将数据转换成模型能够理解的格式。对于文本数据，这通常意味着将其转换为字符串形式。
使用API：如果您使用OpenAI的API，您需要按照API的要求格式化您的数据，并通过HTTP请求将其发送给API。这通常涉及到编写一些代码，使用像Python这样的编程语言。
处理响应：模型会处理您的数据，并以某种形式返回结果。这个结果通常也是一个数值向量，或者是其他类型的数据，例如生成的文本、图片等。
后处理：根据您的需求，您可能需要对模型的输出进行进一步的处理或分析。

对于不同类型的数据和不同的应用场景，这个过程可能会有所不同，但基本的原则是相似的。如果您具体想知道如何使用某个特定的OpenAI模型或API，通常最好的做法是查阅该模型或API的官方文档，那里会有详细的说明和示例代码。

构建一个可以回答有关您网站问题的 AI步骤：

1抓取网站

从根 URL 开始，访问每个页面，查找其他链接，并访问这些页面。
爬网程序遍历所有可访问的链接并将这些页面转换为文本文件（去掉html的tag）。
（内容如果太长，需分解成更小的块）
转为csv结构。

2使用 Embeddings API 将抓取的页面转换为嵌入

第一步是将嵌入转换为 NumPy 数组（基它形式的也可以）

import numpy as np
from openai.embeddings_utils import distances_from_embeddingsdf=pd.read_csv('processed/embeddings.csv', index_col=0)
df['embeddings'] = df['embeddings'].apply(eval).apply(np.array)df.head()

关于NumPy 数组 Certainly! Here’s a simple example using NumPy, a powerful
library for numerical processing in Python. This example will
demonstrate how to create a NumPy array and perform some basic
operations:


# Creating a simple NumPy array array = np.array([1, 2, 3, 4, 5]) print("Original Array:", array)# Performing basic operations
# Adding a constant to each element added_array = array + 10 print("Array after adding 10 to each element:", added_array)# Multiplying each element by 2 multiplied_array = array * 2 print("Array after multiplying each element by 2:", multiplied_array)# Computing the mean of the array mean_value = np.mean(array) print("Mean of the array:", mean_value)# Reshaping the array into a 2x3 matrix
# Note: The total number of elements must remain the same. reshaped_array = np.reshape(array, (2, 2))  # Only possible with an
array of 4 elements print("Reshaped array into a 2x2 matrix:\n",
reshaped_array) ```In this example, we first import the NumPy library. Then, we create a
basic array and perform operations like addition, multiplication, and
calculating the mean. Finally, we reshape the array into a 2x2 matrix.
Keep in mind that the reshape function requires the total number of
elements to remain the same, so in this example, you would need to
modify the original array or the shape to ensure they match.

3创建一个基本的搜索功能，允许用户询问有关嵌入信息的问题

现在数据已经准备好了，根据检索到的文本生成一个自然的答案。

def create_context(question, df, max_len=1800, size="ada"
):"""Create a context for a question by finding the most similar context from the dataframe"""# Get the embeddings for the questionq_embeddings = client.embeddings.create(input=question, engine='text-embedding-ada-002')['data'][0]['embedding']# Get the distances from the embeddingsdf['distances'] = distances_from_embeddings(q_embeddings, df['embeddings'].values, distance_metric='cosine')returns = []cur_len = 0# Sort by distance and add the text to the context until the context is too longfor i, row in df.sort_values('distances', ascending=True).iterrows():# Add the length of the text to the current lengthcur_len += row['n_tokens'] + 4# If the context is too long, breakif cur_len > max_len:break# Else add it to the text that is being returnedreturns.append(row["text"])# Return the contextreturn "\n\n###\n\n".join(returns)

回答提示将尝试从检索到的上下文中提取相关事实，以制定连贯的答案。如果没有相关答案，提示将返回“我不知道”。

def answer_question(df,model="gpt-3.5-turbo",question="Am I allowed to publish model outputs to Twitter, without a human review?",max_len=1800,size="ada",debug=False,max_tokens=150,stop_sequence=None
):"""Answer a question based on the most similar context from the dataframe texts"""context = create_context(question,df,max_len=max_len,size=size,)# If debug, print the raw model responseif debug:print("Context:\n" + context)print("\n\n")try:# Create a chat completion using the question and contextresponse = client.chat.completions.create(model="gpt-3.5-turbo",messages=[{"role": "system", "content": "Answer the question based on the context below, and if the question can't be answered based on the context, say \"I don't know\"\n\n"},{"role": "user", f"content": "Context: {context}\n\n---\n\nQuestion: {question}\nAnswer:"}],temperature=0,max_tokens=max_tokens,top_p=1,frequency_penalty=0,presence_penalty=0,stop=stop_sequence,)return response.choices[0].message.strip()except Exception as e:print(e)return ""

测试问答系统

测试来查看输出的质量。
如果系统无法回答预期的问题，则值得搜索原始文本文件，以查看预期已知的信息是否实际上最终被嵌入。

chatGPT2:如何构建一个可以回答有关您网站问题的 AI 嵌入（embeddings）

什么是嵌入？

构建一个可以回答有关您网站问题的 AI步骤：

1抓取网站

2使用 Embeddings API 将抓取的页面转换为嵌入

3创建一个基本的搜索功能，允许用户询问有关嵌入信息的问题

测试问答系统

相关文章：

chatGPT2:如何构建一个可以回答有关您网站问题的 AI 嵌入（embeddings）

Vue3-新特性defineOptions和defineModel

【计算机基础】通过插件plantuml，实现在VScode里面绘制状态机

Linux常用基础命令及重要目录，配置文件功能介绍

Oracle登录认证方式详解

ate测试原理及ate测试系统（软件）知识科普 -纳米软件

Linux | 创建 | 删除 | 查看 | 基本命名详解

搭配：基于OpenCV的边缘检测实战

AI大发展：人机交互、智能生活全解析

Django DRF序列化器serializer

【开源】基于JAVA的衣物搭配系统

Spark---基于Standalone模式提交任务

webrtc的RTCPeerConnection使用

【视觉SLAM十四讲学习笔记】第三讲——Eigen库

Ubuntu开机显示recovering journal，进入emergency mode

C++_String增删查改模拟实现

LeeCode前端算法基础100题（2）- 最多水的容器

排序算法--归并排序

【LeetCode:1410. HTML 实体解析器 | 模拟+哈希表+字符串+库函数】

基于SSM的公司仓库管理系统（有报告）。Javaee项目

编程初学者学习：句柄（二）

nli-MiniLM2-L6-H768保姆级教程：NLI服务接入企业统一认证（LDAP/OAuth2）方案

keysight N9040B是德 UXA 频谱分析仪 2 Hz 至 50 GHz

如何快速掌握JetBrains IDE试用期重置：开发者的完整指南

实测FireRed-OCR Engine：一键将PDF/图片表格公式转成Markdown

百度文库智能打印工具：突破文档获取限制的完整指南

Blender glTF插件实战指南：解决3D资产跨平台兼容的5大核心挑战

EldenRingSaveCopier：5步实现艾尔登法环存档角色无损迁移

【FAQ】HP Anyware文章汇总列表

VRM4U与LiveLinkFace：打造实时面部动画的终极解决方案