当前位置：首页 > news >正文

2024-10月的“冷饭热炒“--解读GUI Agent 之computer use？phone use?——多模态大语言模型的应用进阶之路

news 2026/2/9 18:19:06

GUI Agent 之computer use？phone use?——多模态大语言模型的进阶之路

1.最新技术事件浅析
- 三、思考和方案设计
- 工具代码部分
- - 1.提示词
  - 2.工具类API定义，这里主要看computer tool就够了
总结

本文会总结概括这一应用的利弊，然后给出分析和工具代码部分解析，以及优化方案，相关项目其实多不胜数，去年就有个open-interperter、autogpt\metagpt、s-agent等等相关一直迭代的agent开源项目，只能说是多模态大模型的能力相比去年有了质变。在纯应用层面目前看来，高级多模态自动化和人类交互记忆机制是两个开源项目类型的热点。从算法模型层面看，是多模态输入和输出（语音、文本、视频、图像）和强推理、工具调用方面是模型卷的点，比上下文、长输出热度会高，智源和智谱近期的工作谁是国产META呢？我个人偏向前者。

1.最新技术事件浅析

事件1. 2024年10月 Anthropics放出了最新的claude3.5 sonnect，编码、复杂推理能力大幅提升，确实很强，其中最吸引人眼球的是展示了computer use的demo演示，虽然很初级，但是未来可期的意思。
事件2. 前后时间内，智谱也发布了autoglm和autoweb系列，展示了web use和phone use的演示，其中autoweb作为chrome商店可以插件下载体验，看仓库是空的估计这块他们也做了小半年了，目前在开始发布阶段。

“冷饭热炒”——AI Agent(LLM)&&RPA
实际上，从2023年就一直涌现相关开源项目，通过软硬件交互的RPA、自动化执行指令相关技术一直都有，唯一区别：多模态大语言模型能力的进化，帮助了人们可以通过截屏视觉和自然语言交互，来更准确执行软硬件的功能函数指令。

			导致一切变化，归因“大模型的能力迭代”

因此不论是哪家的相关技术，本质仍是多模态大语言模型起到决定性作用，这一应用点取决于视觉理解屏幕信息和 LLM决策能力（Agent——主要是通过工具调用来决定执行那个操作函数，如鼠标左键点击、右键点击），这是一种多模态大模型通过感知屏幕和用户提问，走出思考，一步步调用预定义好的指令函数来完成回答的一个过程。

快速总结
其实几个核心难点分别在于：大模型思考能力（特别是function call工具调用能力）、截屏视觉分析能力（视觉模型要进行精准的语义分析和像素坐标）、需要适配操作场景中的各种软件APP的交互。
从算法层面来说，其实关键问题仍旧是得有一个强大的LLM能够精准理解和执行，这对模型能力要求很高，目前来看几B参数的模型很难做到，但是如果聚焦场景微调也不是不可行。
因此claude3.5这个demo实际上仍旧是对自己模型能力的一个宣传，在评测报告中其复杂COT能力提高，正是对于GUI-AGENT应用一个关键助力，同理，智谱也如此。

我自费一个api key，用官方docker体验了下，自媒体上面也有很多总体在它沙盒里面感觉还可以，（这里我推荐一个GIT项目——agent.exe，用JS做的DEMO）

试用场景和使用限制：假设在模型能力达标前提下，电脑低风险任务、简单明确的规则场景，具体怎么理解：比如说订票会给你订错、安装软件因为网络、条款等等使用问题，无法识别和进行后续操作，总之BAD CASE超多，但是claude3.5 确实强，后面需要进行每个软件适配。

三、思考和方案设计

不考虑适配问题，我们的核心是AI AGENT（这里的AGENT更简单，加一些提示词规则和工具函数调用配置），那么最终问题还是模型本身——工具调用能力和COT能力。

那么最终落到自研角度，最关键的问题——首先，如何拥有或者训练一个这种类似能力的模型去平替？有感兴趣的朋友可以用一些开源的VL去试试，我简单放了个截图给自己微调的VL-7B模型进行一个试探，就想看看小模型能不能微调来适应这个场景
在这里插入图片描述
我们看到一般的视觉语言模型是有基础的视觉定位能力、逻辑思考能力，但是还远未到标准，所以要是想不通过三方实现，训练微调这些函数调用、多模态视觉这个门槛是一定要迈的。
这里我测试了两种DEMO，一种是按照官方DEMO DOCKER沙盒里面的系统进行操作，理论上是最好的。另一种是脱离Docker容器化，那么就是直接在用户系统上进行交互，并且要将代码部分函数操作变成windows的函数，实际上还是很不稳定：
这里要强调下系统操作时候打开是有区别的：我发现在windows上实现DEMO，双击和左键点击混淆严重，应该是官方linux虚拟机中左键为打开，windows 双击。
比如说输入网址操作时候，你的界面上有多个文本框，就会识别错误，更严重的是帮你执行了比较有风险的操作，所以大家用时候还是要监控下，还有很多AGENT的常规缺陷（在过程中遇到和预期不符的结果进入更多的推理，最终结果仍然无效化），下面展示下官方的例子，理论上也是最好的

1.我输入一个问题，看是否能帮我正确执行？
在这里插入图片描述
2.Claude 会逐步分析思考，因为是基于视觉，所以每次第一步都会截图去分析图标语义和像素坐标信息，调用screenshot函数

3.根据截屏图像理解，找到firefox，根据坐标移动mouse_move函数调用
在这里插入图片描述
4.进行点击
5.点击火狐后，模型做出了思考，需要继续点击地址栏，再输入网址，继续第一步截屏，需要分析画面哪个是地址栏

6.因为我搜索过有历史记录，所以模型发现可以偷懒，直接触发，所以他分析了CSDN的坐标位置，调用Mouse_move移动鼠标
在这里插入图片描述
7.继续左键点击函数调用，就成功进去了

在这里插入图片描述
8.然后就继续思考要执行我的问题关键信息”Mixture-of-Depths“的问题，还是每次执行前必须要进行截图

在这里插入图片描述
9.根据图像信息理解，定位到搜索框进行输入词语查找，所以继续mouse_move到搜索框坐标那里

10.同理，鼠标点击

11.打印相关信息

12.执行等待返回结果

13.这里貌似漏了个截屏分析，不重要，继续执行鼠标移动到BOX位置
在这里插入图片描述
14.左键点击

15.这里应该是截屏分析最终结果

最终就是根据屏幕信息进行总结，但是由于是视觉方案，所以仅针对截图部分进行了理解，模型本身还是赞的！

PS：双屏容易发生错误，因为其程序只针对一个屏幕进行截屏识别；
国内你要用速率受限制的那么开发是不太可能了。
非官方docker的个性化用户界面测试
直接给结论：效果很差，这里说一个主要原因就是：
1.每个人的UI界面不一样，主题颜色、界面复杂度等等，这会导致claude模型出现视觉定位或者目标识别的错误
界面复杂度这个很重要，如果你有大量页面打开，大概率会降低截屏识别的准确性。
2.还有一种是模型的思考和视觉界面操作没有对齐：
比如这种很容易发生理解和执行错误，模型操作到下方Chrome了，但是这个时候我已经存在分页了，点击并不会新建，所以这种情况执行会无效化。
3.bad case太多了在自己电脑上体验下就知道了
在这里插入图片描述

三应用可行性思考
1.通过上述，我们可以得到一个经验性结论：强如claude3.5也会有着很多问题，一方面是模型视觉能力和在个性化用户界面存在理解的准确性GAP，即使Claude思考的基本都是对的，但是视觉能力的差异导致整个操作链路崩溃。
2.操作浏览器和软件都需要经过定制化的适配，因为有很多权限或者证书、网络问题
3.那么这个应用没有价值进行落地嘛？
答：不是，恰恰就是没那么成熟，所以目前业界一线巨头都要去冲这个应用，利用云端大模型这种能力进行任何能联网的操作系统GUI设备的适配，电脑、手机等等，智谱已经释放了手机DEMO，这块应用会蔓延，对于个人来说，我比较关心如何能够保持特定场景准确性？如何使用开源VL模型平替（微调）？

目前针对上述问题1 截屏识别优化，简单分享下我的思路：

1.端到端微调操作系统界面的图标，进行训练，适合大模型算法相关人员，需要继续后训练
2.进行检测器+Caption, 不需要训练大模型，进行小模型堆叠，增强截屏识别的准确性，这点微软已经这么做了,omniparser,简单来说通过YOLO检测器训练了Windows的相关图标，能够准确定位识别；同时需要caption模型这样有助于大模型对图标的理解正确，作为一个视觉增强插件，目前结合我实际体会确实这样能Work,对于GPT4、cladue这种模型，他们的文本推理能力并不差。
在这里插入图片描述
这种方案输出：图标类型和坐标，并且进行了丰富语义描述，这样能帮助大模型进行正确的理解，但是如果你这两小模型不行，那么一样会拉垮，所以这种方案输出给的目标是VLM模型还是LLM？我个人觉得仍然要给VLM视觉语言模型，

从实现难度上来看，直接端到端训练是相对比较难的，而第二种方案是风险较小的，但是会增强推理开销。
我们继续思考，那么这套Detect&caption的组合直接给LLM，而不是VLM呢？微软没提这个默认是给各种开源和闭源视觉模型，从综合来看还是扔给VLM，还是缺少完整的图像信息，但是硬给这些文本结构化信息呢？LLM也并不是不行。

工具代码部分

官方代码

其实就是Fuctionc call ,我们主要看下sys prompt和工具怎么写的

1.提示词

# System prompt adapted for Windows environment
SYSTEM_PROMPT = f"""<SYSTEM_CAPABILITY>
* You are utilizing a Windows {platform.machine()} system with internet access.
* You have access to mouse and keyboard control through PyAutoGUI.
* You can perform clicks, type text, and take screenshots.
* The system maintains exact monitor resolution for perfect accuracy.
* Screenshots are taken at full monitor resolution and compressed only if needed.
* The system properly accounts for Windows DPI scaling and taskbar position.
* When using your computer function calls, they take a while to run and send back to you. Where possible/feasible, try to chain multiple of these calls all into one function calls request.
* The current date is {datetime.today().strftime('%A, %B %d, %Y')}.
</SYSTEM_CAPABILITY><IMPORTANT>
* The system maintains exact screen dimensions - no resolution scaling is performed.
* Screenshots are taken at full monitor resolution to maintain perfect accuracy.
* Only compression is applied if needed to stay under size limits.
* The system handles Windows-specific elements like taskbar and DPI scaling.
* When viewing a page it can be helpful to zoom out so that you can see everything on the page.
</IMPORTANT><COORDINATE_HANDLING>
* Coordinates are maintained at exact screen resolution.
* DPI scaling is properly handled for accurate positioning.
* Taskbar position is accounted for in coordinate calculations.
* The system maintains perfect 1:1 pixel mapping with the screen.
* Coordinate translations preserve exact screen positions.
</COORDINATE_HANDLING>"""

Windows系统信息：You are utilizing a Windows {platform.machine()} system with internet access. 这里使用了platform.machine()来获取Windows系统的机器类型。

鼠标和键盘控制：You have access to mouse and keyboard control through PyAutoGUI. PyAutoGUI是一个跨平台的自动化工具，但在Windows环境下使用时，可以进行鼠标点击、键盘输入等操作。

屏幕分辨率和DPI缩放：The system maintains exact monitor resolution for perfect accuracy. 和 The system properly accounts for Windows DPI scaling and taskbar position. 这些提示说明了系统如何处理Windows特有的DPI缩放和任务栏位置。

截图和压缩：Screenshots are taken at full monitor resolution and compressed only if needed. 说明了截图的处理方式，特别是在Windows环境下如何处理全屏截图和压缩。

2.工具类API定义，这里主要看computer tool就够了

ComputerTool：这个工具可能包含了一些特定于Windows的操作，例如使用PyAutoGUI进行鼠标点击、键盘输入、截图等操作。
这里贴下源码：
简单来说就算计算机工具的各种操作函数和动作描述等。

import subprocess
import platform
import pyautogui
import asyncio
import base64
import os
from enum import StrEnum
from pathlib import Path
from typing import Literal, TypedDict
from uuid import uuid4from anthropic.types.beta import BetaToolComputerUse20241022Paramfrom .base import BaseAnthropicTool, ToolError, ToolResult
from .run import runOUTPUT_DIR = "/tmp/outputs"TYPING_DELAY_MS = 12
TYPING_GROUP_SIZE = 50Action = Literal["key","type","mouse_move","left_click","left_click_drag","right_click","middle_click","double_click","screenshot","cursor_position",
]class Resolution(TypedDict):width: intheight: intMAX_SCALING_TARGETS: dict[str, Resolution] = {"XGA": Resolution(width=1024, height=768),  # 4:3"WXGA": Resolution(width=1280, height=800),  # 16:10"FWXGA": Resolution(width=1366, height=768),  # ~16:9
}class ScalingSource(StrEnum):COMPUTER = "computer"API = "api"class ComputerToolOptions(TypedDict):display_height_px: intdisplay_width_px: intdisplay_number: int | Nonedef chunks(s: str, chunk_size: int) -> list[str]:return [s[i : i + chunk_size] for i in range(0, len(s), chunk_size)]class ComputerTool(BaseAnthropicTool):"""A tool that allows the agent to interact with the screen, keyboard, and mouse of the current computer.Adapted for Windows using 'pyautogui'."""name: Literal["computer"] = "computer"api_type: Literal["computer_20241022"] = "computer_20241022"width: intheight: intdisplay_num: int | None_screenshot_delay = 2.0_scaling_enabled = True@propertydef options(self) -> ComputerToolOptions:width, height = self.scale_coordinates(ScalingSource.COMPUTER, self.width, self.height)return {"display_width_px": width,"display_height_px": height,"display_number": self.display_num,}def to_params(self) -> BetaToolComputerUse20241022Param:return {"name": self.name, "type": self.api_type, **self.options}def __init__(self):super().__init__()# Get screen width and height using Windows commandself.width, self.height = self.get_screen_size()self.display_num = None# Path to cliclickself.cliclick = "cliclick"self.key_conversion = {"Page_Down": "pagedown", "Page_Up": "pageup", "Super_L": "win"}async def __call__(self,*,action: Action,text: str | None = None,coordinate: tuple[int, int] | None = None,**kwargs,):if action in ("mouse_move", "left_click_drag"):if coordinate is None:raise ToolError(f"coordinate is required for {action}")if text is not None:raise ToolError(f"text is not accepted for {action}")if not isinstance(coordinate, (list, tuple)) or len(coordinate) != 2:raise ToolError(f"{coordinate} must be a tuple of length 2")if not all(isinstance(i, int) and i >= 0 for i in coordinate):raise ToolError(f"{coordinate} must be a tuple of non-negative ints")x, y = self.scale_coordinates(ScalingSource.API, coordinate[0], coordinate[1])if action == "mouse_move":pyautogui.moveTo(x, y)return ToolResult(output=f"Moved mouse to ({x}, {y})")elif action == "left_click_drag":current_x, current_y = pyautogui.position()pyautogui.dragTo(x, y, duration=0.5)  # Adjust duration as neededreturn ToolResult(output=f"Dragged mouse from ({current_x}, {current_y}) to ({x}, {y})")if action in ("key", "type"):if text is None:raise ToolError(f"text is required for {action}")if coordinate is not None:raise ToolError(f"coordinate is not accepted for {action}")if not isinstance(text, str):raise ToolError(output=f"{text} must be a string")if action == "key":# Handle key combinationskeys = text.split('+')for key in keys:key = self.key_conversion.get(key.strip(), key.strip())pyautogui.keyDown(key)  # Press down each keyfor key in reversed(keys):key = self.key_conversion.get(key.strip(), key.strip())pyautogui.keyUp(key)    # Release each key in reverse orderreturn ToolResult(output=f"Pressed keys: {text}")elif action == "type":pyautogui.typewrite(text, interval=TYPING_DELAY_MS / 1000)  # Convert ms to secondsscreenshot_base64 = (await self.screenshot()).base64_imagereturn ToolResult(output=text, base64_image=screenshot_base64)if action in ("left_click","right_click","double_click","middle_click","screenshot","cursor_position",):if text is not None:raise ToolError(f"text is not accepted for {action}")if coordinate is not None:raise ToolError(f"coordinate is not accepted for {action}")if action == "screenshot":return await self.screenshot()elif action == "cursor_position":x, y = pyautogui.position()x, y = self.scale_coordinates(ScalingSource.COMPUTER, x, y)return ToolResult(output=f"X={x},Y={y}")else:if action == "left_click":pyautogui.click()elif action == "right_click":pyautogui.rightClick()elif action == "middle_click":pyautogui.middleClick()elif action == "double_click":pyautogui.doubleClick()return ToolResult(output=f"Performed {action}")raise ToolError(f"Invalid action: {action}")async def screenshot(self):"""Take a screenshot of the current screen and return a ToolResult with the base64 encoded image."""output_dir = Path(OUTPUT_DIR)output_dir.mkdir(parents=True, exist_ok=True)path = output_dir / f"screenshot_{uuid4().hex}.png"# Take screenshot using pyautoguiscreenshot = pyautogui.screenshot()screenshot.save(str(path))if path.exists():# Return a ToolResult instance instead of a dictionaryreturn ToolResult(base64_image=base64.b64encode(path.read_bytes()).decode())raise ToolError(f"Failed to take screenshot: {path} does not exist.")async def shell(self, command: str, take_screenshot=True) -> ToolResult:"""Run a shell command and return the output, error, and optionally a screenshot."""_, stdout, stderr = await run(command)base64_image = Noneif take_screenshot:# delay to let things settle before taking a screenshotawait asyncio.sleep(self._screenshot_delay)base64_image = (await self.screenshot()).base64_imagereturn ToolResult(output=stdout, error=stderr, base64_image=base64_image)def scale_coordinates(self, source: ScalingSource, x: int, y: int):"""Scale coordinates to a target maximum resolution."""if not self._scaling_enabled:return x, yratio = self.width / self.heighttarget_dimension = Nonefor dimension in MAX_SCALING_TARGETS.values():# allow some error in the aspect ratio - not ratios are exactly 16:9if abs(dimension["width"] / dimension["height"] - ratio) < 0.02:if dimension["width"] < self.width:target_dimension = dimensionbreakif target_dimension is None:return x, y# should be less than 1x_scaling_factor = target_dimension["width"] / self.widthy_scaling_factor = target_dimension["height"] / self.heightif source == ScalingSource.API:if x > self.width or y > self.height:raise ToolError(f"Coordinates {x}, {y} are out of bounds")# scale upreturn round(x / x_scaling_factor), round(y / y_scaling_factor)# scale downreturn round(x * x_scaling_factor), round(y * y_scaling_factor)def get_screen_size(self):if platform.system() == "Windows":# Command to get screen resolution on Windowscmd = "wmic path Win32_VideoController get CurrentHorizontalResolution,CurrentVerticalResolution /format:value"elif platform.system() == "Darwin":  # macOScmd = "system_profiler SPDisplaysDataType | grep Resolution"else:  # Linux or other OScmd = "xrandr | grep '*' | awk '{print $1}'"try:output = subprocess.check_output(cmd, shell=True).decode()if platform.system() == "Windows":lines = output.strip().split('\n')width = height = Nonefor line in lines:if line.strip():  # Check for non-empty linekey, value = line.split('=')value = value.strip()  # Strip whitespace from value# Assign only if value is non-empty and can be converted to intif value and key == 'CurrentHorizontalResolution':width = int(value) if value.isdigit() else Noneelif value and key == 'CurrentVerticalResolution':height = int(value) if value.isdigit() else None# After the loop, check if we have valid width and heightif width is not None and height is not None:print(f"Resolution: {width}x{height}")else:print("Could not retrieve valid resolution values.")elif platform.system() == "Darwin":resolution = output.split()[0]width, height = map(int, resolution.split('x'))else:resolution = output.strip().split()[0]width, height = map(int, resolution.split('x'))return width, heightexcept subprocess.CalledProcessError as e:print(f"Error occurred: {e}")return None, None  # Return None or some default valuesdef get_mouse_position(self):# TODO: enhance this funcfrom AppKit import NSEventfrom Quartz import CGEventSourceCreate, kCGEventSourceStateCombinedSessionStateloc = NSEvent.mouseLocation()# Adjust for different coordinate systemreturn int(loc.x), int(self.height - loc.y)def map_keys(self, text: str):"""Map text to cliclick key codes if necessary."""# For simplicity, return text as is# Implement mapping if special keys are neededreturn text

总结

当前对于自己应用难点就是你需要一个类似claude能力极强的模型，并且其functioncall能力也要很强，然后才可以进行自主开发，模型是第一关，后面可以用视觉方案增强也好，再微调也好都可以，更进一步当你的操作系统有N个tools怎么办呢？这里实际上也有两种业内方案，点赞超100 我补更。
额外就是即使行业内后续多模态做的再准也无法做到人类的满意水准，因为人类的语言其实意图并不具备明确性，比如我对AI说帮我找一个我喜欢的电视剧看，那么大概率他不会找对，因为我自己都不知道我此时此刻想看什么，所以需要模型对交互人有了解和记忆机制回溯，专属模型的需求就会越来越多，那么记忆机制方案和微调模型就会变成重要实现手段。

2024-10月的“冷饭热炒“--解读GUI Agent 之computer use？phone use?——多模态大语言模型的应用进阶之路

GUI Agent 之computer use？phone use?——多模态大语言模型的进阶之路

1.最新技术事件浅析

三、思考和方案设计

工具代码部分

1.提示词

2.工具类API定义，这里主要看computer tool就够了

总结

相关文章：

2024-10月的“冷饭热炒“--解读GUI Agent 之computer use？phone use?——多模态大语言模型的应用进阶之路

sheng的学习笔记-AI基础-激活函数

重构代码之重复的观察数据

SpringBoot【实用篇】- 热部署

C语言核心语法2

【论文阅读】Real-ESRGAN

安达发|零部件APS车间排程系统销售预测的优点

Android 同花顺面经

搜维尔科技：Manus数据手套在水下捕捉精确的手指动作，可以在有水的条件下使用

网络：IP分片和组装

Oracle dblink创建使用

Classic GNNs are Strong Baselines: Reassessing GNNs for Node Classification

Android 字节飞书面经

选择好友窗口（三）

【含文档】基于ssm+jsp的音乐播放系统（含源码+数据库+lw）

【C语言】动态内存开辟

Redis缓存在thinkPHP/fastAdmin框架中的应用

Ceisum无人机巡检视频投放

分享几款开源好用的图片在线编辑，适合做快速应用嵌入

闪存学习_1：Flash-Aware Computing from Jihong Kim

盘古信息PCB行业解决方案：以全域场景重构，激活智造新未来

shell脚本--常见案例

STM32F4基本定时器使用和原理详解

[ICLR 2022]How Much Can CLIP Benefit Vision-and-Language Tasks?

今日科技热点速览

【HarmonyOS 5 开发速记】如何获取用户信息（头像/昵称/手机号）

Angular微前端架构：Module Federation + ngx-build-plus (Webpack)

django blank 与 null的区别

深度学习之模型压缩三驾马车：模型剪枝、模型量化、知识蒸馏

HybridVLA——让单一LLM同时具备扩散和自回归动作预测能力：训练时既扩散也回归，但推理时则扩散