当前位置: 首页 > news >正文

开源模型应用落地-glm模型小试-glm-4-9b-chat-vLLM集成(四)

一、前言

    GLM-4是智谱AI团队于2024年1月16日发布的基座大模型,旨在自动理解和规划用户的复杂指令,并能调用网页浏览器。其功能包括数据分析、图表创建、PPT生成等,支持128K的上下文窗口,使其在长文本处理和精度召回方面表现优异,且在中文对齐能力上超过GPT-4。与之前的GLM系列产品相比,GLM-4在各项性能上提高了60%,并且在指令跟随和多模态功能上有显著强化,适合于多种应用场景。尽管在某些领域仍逊于国际一流模型,GLM-4的中文处理能力使其在国内大模型中占据领先地位。该模型的研发历程自2020年始,经过多次迭代和改进,最终构建出这一高性能的AI系统。

    在开源模型应用落地-glm模型小试-glm-4-9b-chat-快速体验(一)已经掌握了glm-4-9b-chat的基本入门。

    在开源模型应用落地-glm模型小试-glm-4-9b-chat-批量推理(二)已经掌握了glm-4-9b-chat的批量推理。

    在开源模型应用落地-glm模型小试-glm-4-9b-chat-Gradio集成(三)已经掌握了如何集成Gradio进行页面交互。

    本篇将介绍如何集成vLLM进行推理加速。


二、术语

2.1.GLM-4-9B

    是智谱 AI 推出的一个开源预训练模型,属于 GLM-4 系列。它于 2024 年 6 月 6 日发布,专为满足高效能语言理解和生成任务而设计,并支持最高 1M(约两百万字)的上下文输入。该模型拥有更强的基础能力,支持26种语言,并且在多模态能力上首次实现了显著进展。

GLM-4-9B的基础能力包括:

- 中英文综合性能提升 40%,在特别的中文对齐能力、指令遵从和工程代码等任务中显著增强

- 较 Llama 3 8B 的性能提升,尤其在数学问题解决和代码编写等复杂任务中表现优越

- 增强的函数调用能力,提升了 40% 的性能

- 支持多轮对话,还支持网页浏览、代码执行、自定义工具调用等高级功能,能够快速处理大量信息并给出高质量的回答

2.2.GLM-4-9B-Chat

    是智谱 AI 在 GLM-4-9B 系列中推出的对话版本模型。它设计用于处理多轮对话,并具有一些高级功能,使其在自然语言处理任务中更加高效和灵活。

2.3.vLLM

    vLLM是一个开源的大模型推理加速框架,通过PagedAttention高效地管理attention中缓存的张量,实现了比HuggingFace Transformers高14-24倍的吞吐量。


三、前置条件

3.1.基础环境及前置条件

     1. 操作系统:centos7

     2. NVIDIA Tesla V100 32GB   CUDA Version: 12.2 

    3.最低硬件要求

3.2.下载模型

huggingface:

https://huggingface.co/THUDM/glm-4-9b-chat/tree/main

ModelScope:

魔搭社区

使用git-lfs方式下载示例:

3.3.创建虚拟环境

conda create --name glm4 python=3.10
conda activate glm4

3.4.安装依赖库

pip install torch>=2.5.0
pip install torchvision>=0.20.0
pip install transformers>=4.46.0
pip install huggingface-hub>=0.25.1
pip install sentencepiece>=0.2.0
pip install jinja2>=3.1.4
pip install pydantic>=2.9.2
pip install timm>=1.0.9
pip install tiktoken>=0.7.0
pip install numpy==1.26.4 
pip install accelerate>=1.0.1
pip install sentence_transformers>=3.1.1
pip install openai>=1.51.0
pip install einops>=0.8.0
pip install pillow>=10.4.0
pip install sse-starlette>=2.1.3
pip install bitsandbytes>=0.43.3# using with VLLM Framework
pip install vllm>=0.6.3

四、技术实现

4.1.vLLM服务端实现

# -*- coding: utf-8 -*-
import time
from asyncio.log import logger
import re
import uvicorn
import gc
import json
import torch
import random
import string
from vllm import SamplingParams, AsyncEngineArgs, AsyncLLMEngine
from fastapi import FastAPI, HTTPException, Response
from fastapi.middleware.cors import CORSMiddleware
from contextlib import asynccontextmanager
from typing import List, Literal, Optional, Union
from pydantic import BaseModel, Field
from transformers import AutoTokenizer, LogitsProcessor
from sse_starlette.sse import EventSourceResponseEventSourceResponse.DEFAULT_PING_INTERVAL = 1000MAX_MODEL_LENGTH = 8192@asynccontextmanager
async def lifespan(app: FastAPI):yieldif torch.cuda.is_available():torch.cuda.empty_cache()torch.cuda.ipc_collect()app = FastAPI(lifespan=lifespan)app.add_middleware(CORSMiddleware,allow_origins=["*"],allow_credentials=True,allow_methods=["*"],allow_headers=["*"],
)def generate_id(prefix: str, k=29) -> str:suffix = ''.join(random.choices(string.ascii_letters + string.digits, k=k))return f"{prefix}{suffix}"class ModelCard(BaseModel):id: str = ""object: str = "model"created: int = Field(default_factory=lambda: int(time.time()))owned_by: str = "owner"root: Optional[str] = Noneparent: Optional[str] = Nonepermission: Optional[list] = Noneclass ModelList(BaseModel):object: str = "list"data: List[ModelCard] = ["glm-4"]class FunctionCall(BaseModel):name: Optional[str] = Nonearguments: Optional[str] = Noneclass ChoiceDeltaToolCallFunction(BaseModel):name: Optional[str] = Nonearguments: Optional[str] = Noneclass UsageInfo(BaseModel):prompt_tokens: int = 0total_tokens: int = 0completion_tokens: Optional[int] = 0class ChatCompletionMessageToolCall(BaseModel):index: Optional[int] = 0id: Optional[str] = Nonefunction: FunctionCalltype: Optional[Literal["function"]] = 'function'class ChatMessage(BaseModel):# “function” 字段解释:# 使用较老的OpenAI API版本需要注意在这里添加 function 字段并在 process_messages函数中添加相应角色转换逻辑为 observationrole: Literal["user", "assistant", "system", "tool"]content: Optional[str] = Nonefunction_call: Optional[ChoiceDeltaToolCallFunction] = Nonetool_calls: Optional[List[ChatCompletionMessageToolCall]] = Noneclass DeltaMessage(BaseModel):role: Optional[Literal["user", "assistant", "system"]] = Nonecontent: Optional[str] = Nonefunction_call: Optional[ChoiceDeltaToolCallFunction] = Nonetool_calls: Optional[List[ChatCompletionMessageToolCall]] = Noneclass ChatCompletionResponseChoice(BaseModel):index: intmessage: ChatMessagefinish_reason: Literal["stop", "length", "tool_calls"]class ChatCompletionResponseStreamChoice(BaseModel):delta: DeltaMessagefinish_reason: Optional[Literal["stop", "length", "tool_calls"]]index: intclass ChatCompletionResponse(BaseModel):model: strid: Optional[str] = Field(default_factory=lambda: generate_id('chatcmpl-', 29))object: Literal["chat.completion", "chat.completion.chunk"]choices: List[Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice]]created: Optional[int] = Field(default_factory=lambda: int(time.time()))system_fingerprint: Optional[str] = Field(default_factory=lambda: generate_id('fp_', 9))usage: Optional[UsageInfo] = Noneclass ChatCompletionRequest(BaseModel):model: strmessages: List[ChatMessage]temperature: Optional[float] = 0.8top_p: Optional[float] = 0.8max_tokens: Optional[int] = Nonestream: Optional[bool] = Falsetools: Optional[Union[dict, List[dict]]] = Nonetool_choice: Optional[Union[str, dict]] = Nonerepetition_penalty: Optional[float] = 1.1class InvalidScoreLogitsProcessor(LogitsProcessor):def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:if torch.isnan(scores).any() or torch.isinf(scores).any():scores.zero_()scores[..., 5] = 5e4return scoresdef process_response(output: str, tools: dict | List[dict] = None, use_tool: bool = False) -> Union[str, dict]:lines = output.strip().split("\n")arguments_json = Nonespecial_tools = ["cogview", "simple_browser"]tools = {tool['function']['name'] for tool in tools} if tools else {}if len(lines) >= 2 and lines[1].startswith("{"):function_name = lines[0].strip()arguments = "\n".join(lines[1:]).strip()if function_name in tools or function_name in special_tools:try:arguments_json = json.loads(arguments)is_tool_call = Trueexcept json.JSONDecodeError:is_tool_call = function_name in special_toolsif is_tool_call and use_tool:content = {"name": function_name,"arguments": json.dumps(arguments_json if isinstance(arguments_json, dict) else arguments,ensure_ascii=False)}if function_name == "simple_browser":search_pattern = re.compile(r'search\("(.+?)"\s*,\s*recency_days\s*=\s*(\d+)\)')match = search_pattern.match(arguments)if match:content["arguments"] = json.dumps({"query": match.group(1),"recency_days": int(match.group(2))}, ensure_ascii=False)elif function_name == "cogview":content["arguments"] = json.dumps({"prompt": arguments}, ensure_ascii=False)return contentreturn output.strip()@torch.inference_mode()
async def generate_stream_glm4(params):messages = params["messages"]tools = params["tools"]tool_choice = params["tool_choice"]temperature = float(params.get("temperature", 1.0))repetition_penalty = float(params.get("repetition_penalty", 1.0))top_p = float(params.get("top_p", 1.0))max_new_tokens = int(params.get("max_tokens", 8192))messages = process_messages(messages, tools=tools, tool_choice=tool_choice)inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)params_dict = {"n": 1,"best_of": 1,"presence_penalty": 1.0,"frequency_penalty": 0.0,"temperature": temperature,"top_p": top_p,"top_k": -1,"repetition_penalty": repetition_penalty,"stop_token_ids": [151329, 151336, 151338],"ignore_eos": False,"max_tokens": max_new_tokens,"logprobs": None,"prompt_logprobs": None,"skip_special_tokens": True,}sampling_params = SamplingParams(**params_dict)async for output in engine.generate(prompt=inputs, sampling_params=sampling_params, request_id=f"{time.time()}"):output_len = len(output.outputs[0].token_ids)input_len = len(output.prompt_token_ids)ret = {"text": output.outputs[0].text,"usage": {"prompt_tokens": input_len,"completion_tokens": output_len,"total_tokens": output_len + input_len},"finish_reason": output.outputs[0].finish_reason,}yield retgc.collect()torch.cuda.empty_cache()def process_messages(messages, tools=None, tool_choice="none"):_messages = messagesprocessed_messages = []msg_has_sys = Falsedef filter_tools(tool_choice, tools):function_name = tool_choice.get('function', {}).get('name', None)if not function_name:return []filtered_tools = [tool for tool in toolsif tool.get('function', {}).get('name') == function_name]return filtered_toolsif tool_choice != "none":if isinstance(tool_choice, dict):tools = filter_tools(tool_choice, tools)if tools:processed_messages.append({"role": "system","content": None,"tools": tools})msg_has_sys = Trueif isinstance(tool_choice, dict) and tools:processed_messages.append({"role": "assistant","metadata": tool_choice["function"]["name"],"content": ""})for m in _messages:role, content, func_call = m.role, m.content, m.function_calltool_calls = getattr(m, 'tool_calls', None)if role == "function":processed_messages.append({"role": "observation","content": content})elif role == "tool":processed_messages.append({"role": "observation","content": content,"function_call": True})elif role == "assistant":if tool_calls:for tool_call in tool_calls:processed_messages.append({"role": "assistant","metadata": tool_call.function.name,"content": tool_call.function.arguments})else:for response in content.split("\n"):if "\n" in response:metadata, sub_content = response.split("\n", maxsplit=1)else:metadata, sub_content = "", responseprocessed_messages.append({"role": role,"metadata": metadata,"content": sub_content.strip()})else:if role == "system" and msg_has_sys:msg_has_sys = Falsecontinueprocessed_messages.append({"role": role, "content": content})if not tools or tool_choice == "none":for m in _messages:if m.role == 'system':processed_messages.insert(0, {"role": m.role, "content": m.content})breakreturn processed_messages@app.get("/health")
async def health() -> Response:"""Health check."""return Response(status_code=200)@app.get("/v1/models", response_model=ModelList)
async def list_models():model_card = ModelCard(id="glm-4")return ModelList(data=[model_card])@app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
async def create_chat_completion(request: ChatCompletionRequest):if len(request.messages) < 1 or request.messages[-1].role == "assistant":raise HTTPException(status_code=400, detail="Invalid request")gen_params = dict(messages=request.messages,temperature=request.temperature,top_p=request.top_p,max_tokens=request.max_tokens or 1024,echo=False,stream=request.stream,repetition_penalty=request.repetition_penalty,tools=request.tools,tool_choice=request.tool_choice,)logger.debug(f"==== request ====\n{gen_params}")if request.stream:predict_stream_generator = predict_stream(request.model, gen_params)output = await anext(predict_stream_generator)if output:return EventSourceResponse(predict_stream_generator, media_type="text/event-stream")logger.debug(f"First result output:\n{output}")function_call = Noneif output and request.tools:try:function_call = process_response(output, request.tools, use_tool=True)except:logger.warning("Failed to parse tool call")if isinstance(function_call, dict):function_call = ChoiceDeltaToolCallFunction(**function_call)generate = parse_output_text(request.model, output, function_call=function_call)return EventSourceResponse(generate, media_type="text/event-stream")else:return EventSourceResponse(predict_stream_generator, media_type="text/event-stream")response = ""async for response in generate_stream_glm4(gen_params):passif response["text"].startswith("\n"):response["text"] = response["text"][1:]response["text"] = response["text"].strip()usage = UsageInfo()function_call, finish_reason = None, "stop"tool_calls = Noneif request.tools:try:function_call = process_response(response["text"], request.tools, use_tool=True)except Exception as e:logger.warning(f"Failed to parse tool call: {e}")if isinstance(function_call, dict):finish_reason = "tool_calls"function_call_response = ChoiceDeltaToolCallFunction(**function_call)function_call_instance = FunctionCall(name=function_call_response.name,arguments=function_call_response.arguments)tool_calls = [ChatCompletionMessageToolCall(id=generate_id('call_', 24),function=function_call_instance,type="function")]message = ChatMessage(role="assistant",content=None if tool_calls else response["text"],function_call=None,tool_calls=tool_calls,)logger.debug(f"==== message ====\n{message}")choice_data = ChatCompletionResponseChoice(index=0,message=message,finish_reason=finish_reason,)task_usage = UsageInfo.model_validate(response["usage"])for usage_key, usage_value in task_usage.model_dump().items():setattr(usage, usage_key, getattr(usage, usage_key) + usage_value)return ChatCompletionResponse(model=request.model,choices=[choice_data],object="chat.completion",usage=usage)async def predict_stream(model_id, gen_params):output = ""is_function_call = Falsehas_send_first_chunk = Falsecreated_time = int(time.time())function_name = Noneresponse_id = generate_id('chatcmpl-', 29)system_fingerprint = generate_id('fp_', 9)tools = {tool['function']['name'] for tool in gen_params['tools']} if gen_params['tools'] else {}delta_text = ""async for new_response in generate_stream_glm4(gen_params):decoded_unicode = new_response["text"]delta_text += decoded_unicode[len(output):]output = decoded_unicodelines = output.strip().split("\n")# 检查是否为工具# 这是一个简单的工具比较函数,不能保证拦截所有非工具输出的结果,比如参数未对齐等特殊情况。##TODO 如果你希望做更多处理,可以在这里进行逻辑完善。if not is_function_call and len(lines) >= 2:first_line = lines[0].strip()if first_line in tools:is_function_call = Truefunction_name = first_linedelta_text = lines[1]# 工具调用返回if is_function_call:if not has_send_first_chunk:function_call = {"name": function_name, "arguments": ""}tool_call = ChatCompletionMessageToolCall(index=0,id=generate_id('call_', 24),function=FunctionCall(**function_call),type="function")message = DeltaMessage(content=None,role="assistant",function_call=None,tool_calls=[tool_call])choice_data = ChatCompletionResponseStreamChoice(index=0,delta=message,finish_reason=None)chunk = ChatCompletionResponse(model=model_id,id=response_id,choices=[choice_data],created=created_time,system_fingerprint=system_fingerprint,object="chat.completion.chunk")yield ""yield chunk.model_dump_json(exclude_unset=True)has_send_first_chunk = Truefunction_call = {"name": None, "arguments": delta_text}delta_text = ""tool_call = ChatCompletionMessageToolCall(index=0,id=None,function=FunctionCall(**function_call),type="function")message = DeltaMessage(content=None,role=None,function_call=None,tool_calls=[tool_call])choice_data = ChatCompletionResponseStreamChoice(index=0,delta=message,finish_reason=None)chunk = ChatCompletionResponse(model=model_id,id=response_id,choices=[choice_data],created=created_time,system_fingerprint=system_fingerprint,object="chat.completion.chunk")yield chunk.model_dump_json(exclude_unset=True)# 用户请求了 Function Call 但是框架还没确定是否为Function Callelif (gen_params["tools"] and gen_params["tool_choice"] != "none") or is_function_call:continue# 常规返回else:finish_reason = new_response.get("finish_reason", None)if not has_send_first_chunk:message = DeltaMessage(content="",role="assistant",function_call=None,)choice_data = ChatCompletionResponseStreamChoice(index=0,delta=message,finish_reason=finish_reason)chunk = ChatCompletionResponse(model=model_id,id=response_id,choices=[choice_data],created=created_time,system_fingerprint=system_fingerprint,object="chat.completion.chunk")yield chunk.model_dump_json(exclude_unset=True)has_send_first_chunk = Truemessage = DeltaMessage(content=delta_text,role="assistant",function_call=None,)delta_text = ""choice_data = ChatCompletionResponseStreamChoice(index=0,delta=message,finish_reason=finish_reason)chunk = ChatCompletionResponse(model=model_id,id=response_id,choices=[choice_data],created=created_time,system_fingerprint=system_fingerprint,object="chat.completion.chunk")yield chunk.model_dump_json(exclude_unset=True)# 工具调用需要额外返回一个字段以对齐 OpenAI 接口if is_function_call:yield ChatCompletionResponse(model=model_id,id=response_id,system_fingerprint=system_fingerprint,choices=[ChatCompletionResponseStreamChoice(index=0,delta=DeltaMessage(content=None,role=None,function_call=None,),finish_reason="tool_calls")],created=created_time,object="chat.completion.chunk",usage=None).model_dump_json(exclude_unset=True)elif delta_text != "":message = DeltaMessage(content="",role="assistant",function_call=None,)choice_data = ChatCompletionResponseStreamChoice(index=0,delta=message,finish_reason=None)chunk = ChatCompletionResponse(model=model_id,id=response_id,choices=[choice_data],created=created_time,system_fingerprint=system_fingerprint,object="chat.completion.chunk")yield chunk.model_dump_json(exclude_unset=True)finish_reason = 'stop'message = DeltaMessage(content=delta_text,role="assistant",function_call=None,)delta_text = ""choice_data = ChatCompletionResponseStreamChoice(index=0,delta=message,finish_reason=finish_reason)chunk = ChatCompletionResponse(model=model_id,id=response_id,choices=[choice_data],created=created_time,system_fingerprint=system_fingerprint,object="chat.completion.chunk")yield chunk.model_dump_json(exclude_unset=True)yield '[DONE]'else:yield '[DONE]'async def parse_output_text(model_id: str, value: str, function_call: ChoiceDeltaToolCallFunction = None):delta = DeltaMessage(role="assistant", content=value)if function_call is not None:delta.function_call = function_callchoice_data = ChatCompletionResponseStreamChoice(index=0,delta=delta,finish_reason=None)chunk = ChatCompletionResponse(model=model_id,choices=[choice_data],object="chat.completion.chunk")yield "{}".format(chunk.model_dump_json(exclude_unset=True))yield '[DONE]'if __name__ == "__main__":MODEL_PATH = "/data/model/glm-4-9b-chat"tensor_parallel_size = 1tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)engine_args = AsyncEngineArgs(model=MODEL_PATH,tokenizer=MODEL_PATH,# 如果你有多张显卡,可以在这里设置成你的显卡数量tensor_parallel_size=tensor_parallel_size,dtype=torch.float16,trust_remote_code=True,# 占用显存的比例,请根据你的显卡显存大小设置合适的值,例如,如果你的显卡有80G,您只想使用24G,请按照24/80=0.3设置gpu_memory_utilization=0.9,enforce_eager=True,worker_use_ray=False,disable_log_requests=True,max_model_len=MAX_MODEL_LENGTH,)engine = AsyncLLMEngine.from_engine_args(engine_args)uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)

4.2.vLLM服务端启动

(glm4) [root@gpu test]# python -u glm_server.py 
WARNING 11-06 12:11:19 config.py:1668] Casting torch.bfloat16 to torch.float16.
WARNING 11-06 12:11:23 config.py:395] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 11-06 12:11:23 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='/data/model/glm-4-9b-chat', speculative_config=None, tokenizer='/data/model/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/model/glm-4-9b-chat, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, mm_processor_kwargs=None)
WARNING 11-06 12:11:24 tokenizer.py:169] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 11-06 12:11:24 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 11-06 12:11:24 selector.py:115] Using XFormers backend.
/usr/local/miniconda3/envs/glm4/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/miniconda3/envs/glm4/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 11-06 12:11:25 model_runner.py:1056] Starting to load model /data/model/glm-4-9b-chat...
INFO 11-06 12:11:25 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 11-06 12:11:25 selector.py:115] Using XFormers backend.
Loading safetensors checkpoint shards:   0% Completed | 0/10 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  10% Completed | 1/10 [00:00<00:08,  1.01it/s]
Loading safetensors checkpoint shards:  20% Completed | 2/10 [00:01<00:07,  1.13it/s]
Loading safetensors checkpoint shards:  30% Completed | 3/10 [00:02<00:06,  1.14it/s]
Loading safetensors checkpoint shards:  40% Completed | 4/10 [00:03<00:05,  1.15it/s]
Loading safetensors checkpoint shards:  50% Completed | 5/10 [00:04<00:04,  1.18it/s]
Loading safetensors checkpoint shards:  60% Completed | 6/10 [00:05<00:03,  1.08it/s]
Loading safetensors checkpoint shards:  70% Completed | 7/10 [00:06<00:02,  1.07it/s]
Loading safetensors checkpoint shards:  80% Completed | 8/10 [00:07<00:01,  1.13it/s]
Loading safetensors checkpoint shards:  90% Completed | 9/10 [00:08<00:00,  1.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:08<00:00,  1.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:08<00:00,  1.11it/s]INFO 11-06 12:11:35 model_runner.py:1067] Loading model weights took 17.5635 GB
INFO 11-06 12:11:37 gpu_executor.py:122] # GPU blocks: 12600, # CPU blocks: 6553
INFO 11-06 12:11:37 gpu_executor.py:126] Maximum concurrency for 8192 tokens per request: 24.61x
INFO:     Started server process [1627618]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

4.3.客户端实现

# -*- coding: utf-8 -*-
from openai import OpenAIbase_url = "http://127.0.0.1:8000/v1/"
client = OpenAI(api_key="EMPTY", base_url=base_url)
MODEL_PATH = "/data/model/glm-4-9b-chat"def chat(use_stream=False):messages = [{"role": "system","content": "你是一名专业的导游。",},{"role": "user","content": "请推荐一些广州特色的景点?",}]response = client.chat.completions.create(model=MODEL_PATH,messages=messages,stream=use_stream,max_tokens=8192,temperature=0.4,presence_penalty=1.2,top_p=0.9,)if response:if use_stream:for chunk in response:msg = chunk.choices[0].delta.contentprint(msg,end='',flush=True)else:print(response)else:print("Error:", response.status_code)if __name__ == "__main__":chat(use_stream=True)

4.4.客户端调用

(glm4) [root@gpu test]# python -u glm_client.py 当然可以!广州是中国广东省的省会,历史悠久,文化底蕴深厚,同时也是一座现代化的大都市。以下是一些广州的特色景点推荐:1. **白云山** - 广州著名的风景区,有“羊城第一秀”之称。山上空气清新,景色优美,是登山和观赏广州市区全景的好地方。2. **珠江夜游** - 乘坐游船在珠江上欣赏两岸的夜景,可以看到广州塔、海心沙等著名地标,以及璀璨的灯光秀。3. **长隆旅游度假区** - 包括长隆野生动物世界、长隆水上乐园、长隆国际大马戏等多个主题公园,适合家庭游玩。4. **陈家祠** - 又称陈氏书院,是一座典型的岭南传统建筑,以其精美的木雕、石雕和砖雕闻名。5. **越秀公园** - 公园内有五羊雕像,是广州的象征之一。还有中山纪念碑、镇海楼等历史遗迹。6. **北京路步行街** - 这里集合了购物、餐饮、娱乐于一体,是一条充满活力的商业街区。7. **上下九步行街** - 这条古老的街道以骑楼建筑为特色,两旁有许多老字号商店和小吃店,是体验广州传统文化的好去处。8. **广州塔(小蛮腰)** - 作为广州的地标性建筑,游客可以从这里俯瞰整个城市的壮丽景观。9. **南越王宫博物馆** - 展示了两千多年前南越国的历史文化,馆内有一座复原的宫殿模型。10. **荔湾湖公园** - 一个集自然风光与人文景观于一体的公园,湖水清澈,环境宜人。11. **广州动物园** - 是中国最大的城市动物园之一,拥有多种珍稀动物。12. **广州艺术博物院** - 收藏了大量珍贵的艺术品和历史文物,是了解广东乃至华南地区文化艺术的重要场所。这些景点不仅展示了广州的自然美景,也体现了其丰富的文化遗产和现代都市的风貌。希望您在广州旅行时能有一个愉快的体验!

相关文章:

开源模型应用落地-glm模型小试-glm-4-9b-chat-vLLM集成(四)

一、前言 GLM-4是智谱AI团队于2024年1月16日发布的基座大模型&#xff0c;旨在自动理解和规划用户的复杂指令&#xff0c;并能调用网页浏览器。其功能包括数据分析、图表创建、PPT生成等&#xff0c;支持128K的上下文窗口&#xff0c;使其在长文本处理和精度召回方面表现优异&a…...

.net为什么要在单独的项目中定义扩展方法?C#

使用 扩展方法&#xff08;Extension Methods&#xff09; 和创建 扩展类&#xff08;Extension Class&#xff09; 在 C# 中有几个特定的目的&#xff0c;主要是为了提高代码的可扩展性、灵活性和可读性。让我们来详细解释这些概念以及为什么扩展类需要是静态的。 为什么使用…...

动态规划 —— dp 问题-打家劫舍II

1.打家劫舍II 题目链接&#xff1a; 213. 打家劫舍 II - 力扣&#xff08;LeetCode&#xff09;https://leetcode.cn/problems/house-robber-ii/ 2. 题目解析 通过分类讨论&#xff0c;将环形问题转换为两个线性的“打家劫舍|” 当偷第一个位置的时候&#xff0c;rob1在&#…...

Java基础-组件及事件处理(上)

(创作不易&#xff0c;感谢有你&#xff0c;你的支持&#xff0c;就是我前行的最大动力&#xff0c;如果看完对你有帮助&#xff0c;请留下您的足迹&#xff09; 目录 Swing 概述 MVC 架构 Swing 特点 控件 SWING UI 元素 JFrame SWING 容器 说明 常用方法 示例&a…...

Python实例:爱心代码

前言 在编程的奇妙世界里,代码不仅仅是冰冷的指令集合,它还可以成为表达情感、传递温暖的独特方式。今天,我们将一同探索用 Python 语言绘制爱心的神奇之旅。 爱心,这个象征着爱与温暖的符号,一直以来都在人类的情感世界中占据着特殊的地位。而通过 Python 的强大功能,…...

图解大模型训练系列:序列并行3,Ring Attention

在序列并行系列中&#xff0c;我们将详细介绍下面四种常用的框架/方法&#xff1a; Megatron Sequence Parallelism&#xff1a;本质是想通过降低单卡激活值大小的方式&#xff0c;尽可能多保存激活值&#xff0c;少做重计算&#xff0c;以此提升整体训练速度&#xff0c;一般…...

pyspark基础准备

1.前言介绍 学习目标&#xff1a;了解什么是Speak、PySpark&#xff0c;了解为什么学习PySpark&#xff0c;了解课程是如何和大数据开发方向进行衔接 使用pyspark库所写出来的代码&#xff0c;既可以在电脑上简单运行&#xff0c;进行数据分析处理&#xff0c;又可以把代码无缝…...

Netty报错

问题&#xff1a;因客户反馈Netty版本低&#xff0c;影响性能&#xff0c;建议提升。于是&#xff0c;我将所有Netty版本从4.1.82.Final到4.1.114.Final后&#xff0c;报下面的错误&#xff0c;java.lang.NoClassDefFoundError: io/netty/util/Recycler$EnhancedHandle&#xf…...

Kafka 之顺序消息

前言&#xff1a; 在分布式消息系统中&#xff0c;消息的顺序性是一个重要的问题&#xff0c;也是一个常见的业务场景&#xff0c;那 Kafka 作为一个高性能的分布式消息中间件&#xff0c;又是如何实现顺序消息的呢&#xff1f;本篇我们将对 Kafka 的顺序消息展开讨论。 Kafk…...

Kafka 之批量消息发送消费

前言&#xff1a; 前面我们分享了 Kafka 的一些基础知识&#xff0c;以及 Spring Boot 集成 Kafka 完成消息发送消费&#xff0c;本篇我们来分享一下 Kafka 的批量消息发送消费。 Kafka 系列文章传送门 Kafka 简介及核心概念讲解 Spring Boot 整合 Kafka 详解 Kafka Kafka…...

【大数据学习 | kafka】kafka的偏移量管理

1. 偏移量的概念 消费者在消费数据的时候需要将消费的记录存储到一个位置&#xff0c;防止因为消费者程序宕机而引起断点消费数据丢失问题&#xff0c;下一次可以按照相应的位置从kafka中找寻数据&#xff0c;这个消费位置记录称之为偏移量offset。 kafka0.9以前版本将偏移量信…...

实景三维赋能森林防灭火指挥调度智慧化

森林防灭火工作是保护森林资源和生态环境的重要任务。随着信息技术的发展&#xff0c;实景三维技术在森林防灭火指挥调度中的应用日益广泛&#xff0c;为提升防灭火工作的效率和效果提供了有力支持。 一、森林防灭火面临的挑战 森林火灾具有突发性强、破坏性大、蔓延速度快、…...

【C++课程学习】:string的模拟实现

&#x1f381;个人主页&#xff1a;我们的五年 &#x1f50d;系列专栏&#xff1a;C课程学习 &#x1f389;欢迎大家点赞&#x1f44d;评论&#x1f4dd;收藏⭐文章 目录 一.string的主体框架&#xff1a; 二.string的分析&#xff1a; &#x1f354;构造函数和析构函数&a…...

Linux(VMware + CentOS )设置固定ip

需求&#xff1a;设置ip为 192.168.88.130 先关闭虚拟机 启动虚拟机 查看当前自动获取的ip 使用 FinalShell 通过 ssh 服务远程登录系统&#xff0c;更换到 root 用户 修改ip配置文件 vim /etc/sysconfig/network-scripts/ifcfg-ens33 重启网卡 systemctl restart network …...

安卓 android studio各版本下载地址(官方)

https://developer.android.google.cn/studio/archive 别用中文&#xff0c;右上角的语言切换成英文...

如何在一个 Docker 容器中运行多个进程 ?

在容器化的世界里&#xff0c;Docker 彻底改变了开发人员构建、发布和运行应用程序的方式。Docker 容器封装了运行应用程序所需的所有依赖项&#xff0c;使其易于跨不同环境一致地部署。然而&#xff0c;在单个 Docker 容器中管理多个进程可能具有挑战性&#xff0c;这就是 Sup…...

poetry 配置多个cuda环境心得

操作系统&#xff1a;ubuntu22.04 LTS python版本&#xff1a;3.12.7 最近学习了用poetry配置python虚拟环境&#xff0c;当为不同的项目配置cuda时&#xff0c;会遇到不同的项目使用的cuda版本不一致的情况。 像torch 这样的库&#xff0c;它们会对cuda-toolkit有依赖&…...

网络编程入门

目录 1.网络编程入门 1.1 网络编程概述【理解】 1.2 网络编程三要素【理解】 1.3 IP地址【理解】 1.4InetAddress【应用】 1.5端口和协议【理解】 2.UDP通信程序 2.1 UDP发送数据【应用】 2.2UDP接收数据【应用】 2.3UDP通信程序练习【应用】 3.TCP通信程序 3.1TCP…...

Linux-socket详解

Linux-socket详解_socket linux-CSDN博客...

SQL Server 2022安装要求(硬件、软件、操作系统等)

SQL Server 2022安装要求 1、硬件要求2、软件要求3、操作系统支持4、Server Core 支持5、跨语言支持6、磁盘空间要求 1、硬件要求 以下内存和处理器要求适用于所有版本的 SQL Server&#xff1a; 组件要求存储SQL Server 要求最少 6 GB 的可用硬盘驱动器空间。 磁盘空间要求随…...

“众店模式”:创新驱动下的商业新生态

在数字化浪潮的推动下&#xff0c;传统商业模式正经历着前所未有的转型。“众店模式”作为一种新兴的商业模式&#xff0c;以其独特的商业逻辑和创新的玩法&#xff0c;为商家和消费者构建了一个共赢的商业新生态。 一、“众店模式”的核心构成 “众店模式”的成功&#xff0…...

54. 螺旋矩阵

https://leetcode.cn/problems/spiral-matrix/description/?envTypestudy-plan-v2&envIdtop-100-liked观察示例中的输出轨迹我们可以想到如下设计&#xff1a; 1.在朝某一方向行进到头后的改变方向是确定的&#xff0c;左->下&#xff0c;下->右&#xff0c;右->…...

剧本杀小程序,市场发展下的新机遇

剧本杀作为休闲娱乐的一种游戏方式&#xff0c;在短时间内进入了大众视野中&#xff0c;受到了广泛关注。近几年&#xff0c;剧本杀行业面临着创新挑战&#xff0c;商家需求寻求新的发展机遇&#xff0c;在市场饱和度下降的趋势下&#xff0c;获得市场份额。 随着科技的不断进…...

【系统架构设计师】论文:论基于 ABSD 的软件开发

更多内容请见: 备考系统架构设计师-专栏介绍和目录 文章目录 摘要正文摘要 2022年5月,我就职的公司承接了xx的智慧党建工作,建设“党建红云” 系统,为xx公司的党组织提供觉务管理、服务功能,促进党员学习和党组织交流。我在该项目中承担架构设计师的职责,主导需求分析和…...

为什么OLED透明屏在同类产品中显示效果最好

说起OLED透明屏&#xff0c;这家伙在同类产品里那真的是“一枝独秀”啊&#xff01;为啥这么说呢&#xff1f;且听我细细道来。 首先&#xff0c;OLED透明屏的透明度那是杠杠的&#xff01;它不像传统显示屏那样有个固定的背景&#xff0c;而是可以实现像素级的透明效果。这样一…...

深度学习基础知识-Batch Normalization(BN)超详细解析

一、背景和问题定义 在深层神经网络&#xff08;Deep Neural Networks, DNNs&#xff09;中&#xff0c;层与层之间的输入分布会随着参数更新不断发生变化&#xff0c;这种现象被称为内部协变量偏移&#xff08;Internal Covariate Shift&#xff09;。具体来说&#xff0c;由…...

基于单片机的燃气报警阀门系统

本设计基于单片机的燃气报警阀门系统&#xff0c;燃气报警阀门系统采用STM32主控制器为核心芯片&#xff0c;外围电路由燃气传感器、OLED液晶显示模块、按键模块、蜂鸣器报警模块、电磁阀以及SIM800模块等模块组成。燃气传感器模块负责采集燃气浓度数据&#xff0c;采集完成由S…...

watch与computed的区别、运用的场景

computed和watch都是响应式数据变化的重要机制&#xff0c;但它们在功能、使用场景和性能表现上有显著的区别。 主要区别 功能和用途 1、computed&#xff1a;计算属性&#xff0c;用于基于其他数据属性进行计算&#xff0c;并返回一个结果。它具有缓存机制&#xff0c;只有当…...

【ESP32+MicroPython】开发环境部署

本教程将指导你如何在Visual Studio Code&#xff08;VSCode&#xff09;中设置ESP32的MicroPython开发环境。我们将涵盖从安装Python到烧录MicroPython固件的整个过程&#xff0c;以及如何配置VSCode以便与ESP32进行交互。 准备工作 安装Python 确保你的计算机上安装了Pyth…...

Vision - 开源视觉分割算法框架 Grounded SAM2 配置与推理 教程 (1)

欢迎关注我的CSDN&#xff1a;https://spike.blog.csdn.net/ 本文地址&#xff1a;https://spike.blog.csdn.net/article/details/143388189 免责声明&#xff1a;本文来源于个人知识与公开资料&#xff0c;仅用于学术交流&#xff0c;欢迎讨论&#xff0c;不支持转载。 Ground…...