当前位置：首页 > article >正文

《Python Web部署应知应会》No2：如何基于FastAPI 和 OLLAMA 架构实现高并发 AI 推理服务

article 2026/2/17 8:08:50

《Python Web部署应知应会》No2：如何基于FastAPI 和 OLLAMA 架构实现高并发 AI 推理服务（上）

摘要：

在 FastAPI 和 OLLAMA 架构中实现高并发 AI 推理服务，并优化性能指标采集和缓存策略，可以充分利用 asyncio 的异步 I/O 操作来提升吞吐量和响应速度。以下是一个详细的解决方案，分为基础实现架构概述、实现步骤、性能指标采集、结合 FastAPI 和 OLLAMA、优化方案详细实现（批量推理、负载均衡、动态缓存失效、监控与告警）几个关键部分进行说明，本文章末提供了一个小型博客网站实现高并发推理的案例设计思路，本文章分为上中下三个系列组成，文章系列（中）将介绍具体网站实践，文章系列（下）将介绍高并发网站性能测试和监控的实现。

一、基础方案和优化方法：

1. 架构概述

FastAPI: 提供高性能的异步 Web 框架，支持异步请求处理。
OLLAMA: 假设它是一个 AI 推理引擎（例如基于 LLM 的推理服务），可以通过 API 或库调用进行交互。
Asyncio: 用于管理异步任务，确保 I/O 密集型操作（如网络请求、数据库访问等）不会阻塞主线程。
缓存策略: 使用内存缓存（如 Redis 或 functools.lru_cache）存储频繁使用的推理结果，减少重复计算。
性能指标采集: 利用异步任务记录性能数据（如请求耗时、错误率等），并将其汇总到监控系统。

在这里插入图片描述

2. 实现步骤

(1) 异步推理调用

使用 asyncio 实现对 OLLAMA 推理服务的异步调用。假设 OLLAMA 提供了一个 HTTP API，可以使用 httpx 库（支持异步请求）与之交互。

import httpx
from fastapi import FastAPI, Dependsapp = FastAPI()# 假设 OLLAMA 的推理服务地址
OLLAMA_API_URL = "http://ollama-service:8000/inference"async def call_ollama(prompt: str):async with httpx.AsyncClient() as client:response = await client.post(OLLAMA_API_URL,json={"prompt": prompt},timeout=30.0)response.raise_for_status()  # 确保请求成功return response.json()

(2) 缓存策略

为了减少重复推理的计算开销，可以引入缓存机制。以下是两种常见的缓存方式：

内存缓存（适合小规模应用）：
使用 functools.lru_cache 或 aiocache 库实现简单的内存缓存。

from functools import lru_cache@lru_cache(maxsize=128)
def cached_inference(prompt: str):# 这里假设推理是同步的，如果是异步的，需要调整逻辑return call_ollama_sync(prompt)async def call_ollama_with_cache(prompt: str):# 异步包装同步缓存调用return await asyncio.to_thread(cached_inference, prompt)

分布式缓存（适合大规模应用）：
使用 Redis 作为分布式缓存，利用 aioredis 库实现异步操作。

import aioredisredis = aioredis.from_url("redis://localhost")async def call_ollama_with_redis_cache(prompt: str):cache_key = f"inference:{prompt}"result = await redis.get(cache_key)if result:return result.decode("utf-8")  # 缓存命中# 缓存未命中，调用推理服务result = await call_ollama(prompt)await redis.setex(cache_key, 3600, result)  # 缓存1小时return result

(3) 性能指标采集

通过中间件或后台任务记录性能指标，例如请求耗时、成功率等。

中间件记录请求耗时：
在 FastAPI 中添加一个中间件，记录每个请求的处理时间。

import time
from fastapi import Request, Response@app.middleware("http")
async def add_process_time_header(request: Request, call_next):start_time = time.time()response: Response = await call_next(request)process_time = time.time() - start_timeresponse.headers["X-Process-Time"] = str(process_time)return response

后台任务采集指标：
使用 asyncio.create_task 定期将性能数据发送到监控系统（如 Prometheus 或 Datadog）。

import asynciometrics_queue = asyncio.Queue()async def metrics_collector():while True:metric = await metrics_queue.get()# 将 metric 发送到监控系统print(f"Collected Metric: {metric}")metrics_queue.task_done()# 启动后台任务
@app.on_event("startup")
async def startup_event():asyncio.create_task(metrics_collector())

(4) 结合 FastAPI 和 OLLAMA

将上述组件整合到 FastAPI 中，提供一个完整的高并发推理接口。

@app.post("/infer")
async def infer(prompt: str):start_time = time.time()# 调用推理服务（带缓存）try:result = await call_ollama_with_redis_cache(prompt)status = "success"except Exception as e:result = {"error": str(e)}status = "failure"process_time = time.time() - start_time# 记录性能指标await metrics_queue.put({"prompt": prompt,"status": status,"process_time": process_time})return result

3. 优化建议

(1) 批量推理

如果多个请求可以合并为一个批量推理请求（Batch Inference），可以显著提高吞吐量。例如，累积一定数量的请求后一次性发送给 OLLAMA。

(2) 负载均衡

在高并发场景下，部署多个实例并通过负载均衡器（如 Nginx 或 Kubernetes Service）分发流量。

(3) 动态缓存失效

对于时效性要求较高的数据，可以设置动态缓存失效策略。例如，根据数据更新频率自动刷新缓存。

(4) 监控与告警

结合 Prometheus 和 Grafana，实时监控服务性能，并设置告警规则（如请求失败率超过阈值）。

4. 总结

通过上述方法，我们可以实现一个高效的 FastAPI + OLLAMA 高并发推理服务：

利用 asyncio 和异步库（如 httpx 和 aioredis）提升 I/O 性能。
通过缓存策略减少重复计算，降低延迟。
使用中间件和后台任务采集性能指标，持续优化服务。

这种架构不仅能够满足高并发需求，还能通过缓存和性能监控进一步提升用户体验和系统稳定性。

二、完整实例代码：个人博客

以下是一个完整的个人博客 Flask 网站设计方案，结合了 FastAPI 和 OLLAMA 架构中的高并发 AI 推理服务技术（包括批量推理、负载均衡、动态缓存失效、监控与告警）。这个方案将分为以下几个部分：

1. 项目结构

项目的文件框架如下：

personal_blog/
├── app/                     # Flask 应用代码
│   ├── __init__.py          # 初始化 Flask 应用
│   ├── routes.py            # 博客路由
│   ├── models.py            # 数据库模型
│   ├── ai_service.py        # AI 推理服务集成
│   ├── cache.py             # 缓存逻辑
│   └── metrics.py           # 性能指标采集
├── static/                  # 静态资源
│   ├── css/                 # 样式表
│   ├── js/                  # JavaScript 文件
│   └── images/              # 图片资源
├── templates/               # 模板文件
│   ├── base.html            # 基础模板
│   ├── index.html           # 首页
│   ├── post.html            # 博客文章页面
│   └── ai_response.html     # AI 推理结果页面
├── migrations/              # 数据库迁移文件
├── requirements.txt         # Python 依赖
├── run.py                   # 启动脚本
└── README.md                # 项目说明文档

2. 技术栈

前端: HTML + CSS + JavaScript（Bootstrap 或 TailwindCSS 可选）
后端: Flask（主框架） + FastAPI（AI 推理服务）
数据库: SQLite（小型项目适用）或 PostgreSQL（生产环境推荐）
缓存: Redis（动态缓存失效）
监控: Prometheus + Grafana
负载均衡: Nginx 或 Kubernetes Service（可选）

3. 详细实现

(1) Flask 应用初始化

在 app/__init__.py 中初始化 Flask 应用，并集成数据库和缓存。

from flask import Flask
from flask_sqlalchemy import SQLAlchemy
import aioredis# 初始化 Flask 应用
app = Flask(__name__)
app.config['SQLALCHEMY_DATABASE_URI'] = 'sqlite:///blog.db'
app.config['SQLALCHEMY_TRACK_MODIFICATIONS'] = False# 初始化数据库
db = SQLAlchemy(app)# 初始化 Redis 缓存
redis = aioredis.from_url("redis://localhost")# 导入路由
from app.routes import *# 创建数据库表
with app.app_context():db.create_all()

(2) 数据库模型

在 app/models.py 中定义博客文章的数据库模型。

from . import dbclass Post(db.Model):id = db.Column(db.Integer, primary_key=True)title = db.Column(db.String(100), nullable=False)content = db.Column(db.Text, nullable=False)created_at = db.Column(db.DateTime, default=db.func.current_timestamp())

(3) 路由和视图

在 app/routes.py 中定义博客的路由和视图函数。

from flask import render_template, request, redirect, url_for
from .models import Post
from .ai_service import call_ollama_with_cache
from .cache import redis@app.route('/')
def index():posts = Post.query.order_by(Post.created_at.desc()).all()return render_template('index.html', posts=posts)@app.route('/post/<int:post_id>')
def view_post(post_id):post = Post.query.get_or_404(post_id)return render_template('post.html', post=post)@app.route('/create', methods=['GET', 'POST'])
def create_post():if request.method == 'POST':title = request.form['title']content = request.form['content']new_post = Post(title=title, content=content)db.session.add(new_post)db.session.commit()return redirect(url_for('index'))return render_template('create_post.html')@app.route('/ai-inference', methods=['POST'])
async def ai_inference():prompt = request.form['prompt']result = await call_ollama_with_cache(prompt)return render_template('ai_response.html', result=result)

(4) AI 推理服务

在 app/ai_service.py 中实现 FastAPI 的高并发 AI 推理服务集成。

批量推理

import asyncio
from collections import defaultdict
import httpxBATCH_SIZE = 10
BATCH_TIMEOUT = 2
OLLAMA_API_URL = "http://ollama-service:8000/inference"
batch_queue = asyncio.Queue()
batch_results = defaultdict(asyncio.Future)async def batch_inference_worker():while True:batch_prompts = []try:while len(batch_prompts) < BATCH_SIZE:prompt, future = await asyncio.wait_for(batch_queue.get(), timeout=BATCH_TIMEOUT)batch_prompts.append(prompt)batch_results[prompt] = futureexcept asyncio.TimeoutError:passif batch_prompts:results = await call_ollama_batch(batch_prompts)for prompt, result in zip(batch_prompts, results):batch_results[prompt].set_result(result)async def call_ollama_batch(prompts: list):async with httpx.AsyncClient() as client:response = await client.post(OLLAMA_API_URL,json={"prompts": prompts},timeout=30.0)response.raise_for_status()return response.json()["results"]

Redis 动态缓存失效

async def call_ollama_with_cache(prompt: str):cache_key = f"inference:{prompt}"result = await redis.get(cache_key)if result:return result.decode("utf-8")result = await call_ollama(prompt)ttl = calculate_ttl(prompt)await redis.setex(cache_key, ttl, result)return resultdef calculate_ttl(prompt: str) -> int:if "urgent" in prompt.lower():return 60return 3600

(5) 监控与告警

在 app/metrics.py 中集成 Prometheus。

from prometheus_client import Counter, Histogram, start_http_serverREQUEST_COUNT = Counter("request_count", "Total number of requests", ["status"])
REQUEST_LATENCY = Histogram("request_latency_seconds", "Request latency in seconds")@app.before_request
def before_request():request.start_time = time.time()@app.after_request
def after_request(response):process_time = time.time() - request.start_timestatus = "success" if response.status_code < 400 else "failure"REQUEST_COUNT.labels(status=status).inc()REQUEST_LATENCY.observe(process_time)return responsestart_http_server(8001)

(6) 静态资源与模板

在 templates/ 和 static/ 中提供前端页面和静态资源。

示例模板 (`base.html`)

<!DOCTYPE html>
<html lang="en">
<head><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><title>Personal Blog</title><link rel="stylesheet" href="{{ url_for('static', filename='css/style.css') }}">
</head>
<body><header><h1>My Personal Blog</h1></header><main>{% block content %}{% endblock %}</main>
</body>
</html>

4. 启动与部署

(1) 启动脚本

在 run.py 中启动 Flask 应用。

from app import appif __name__ == '__main__':app.run(debug=True)

(2) 部署建议

使用 Gunicorn 启动 Flask 应用：gunicorn -w 4 app:app
配置 Nginx 作为反向代理和负载均衡器。
部署 Prometheus 和 Grafana 进行性能监控。

5. 总结

通过上述设计，我们实现了一个完整的个人博客网站，集成了高并发 AI 推理服务（FastAPI + OLLAMA），并实现了批量推理、动态缓存失效、监控与告警等优化方案。这种架构不仅功能强大，还能满足高并发需求，适合中小型博客应用。