当前位置: 首页 > news >正文

完整地实现了推荐系统的构建、实验和评估过程,为不同推荐算法在同一数据集上的性能比较提供了可重复实验的框架

{"cells": [{"cell_type": "markdown","metadata": {},"source": ["# 基于用户的协同过滤算法"]},{"cell_type": "code","execution_count": 1,"metadata": {},"outputs": [],"source": ["# 导入包\n","import random\n","import math\n","import time\n","from tqdm import tqdm"]},{"cell_type": "markdown","metadata": {},"source": ["## 一. 通用函数定义"]},{"cell_type": "code","execution_count": 2,"metadata": {},"outputs": [],"source": ["# 定义装饰器,监控运行时间\n","def timmer(func):\n","    def wrapper(*args, **kwargs):\n","        start_time = time.time()\n","        res = func(*args, **kwargs)\n","        stop_time = time.time()\n","        print('Func %s, run time: %s' % (func.__name__, stop_time - start_time))\n","        return res\n","    return wrapper"]},{"cell_type": "markdown","metadata": {},"source": ["### 1. 数据处理相关\n","1. load data\n","2. split data"]},{"cell_type": "code","execution_count": 3,"metadata": {},"outputs": [],"source": ["class Dataset():\n","    \n","    def __init__(self, fp):\n","        # fp: data file path\n","        self.data = self.loadData(fp)\n","    \n","    @timmer\n","    def loadData(self, fp):\n","        data = []\n","        for l in open(fp):\n","            data.append(tuple(map(int, l.strip().split('::')[:2])))\n","        return data\n","    \n","    @timmer\n","    def splitData(self, M, k, seed=1):\n","        '''\n","        :params: data, 加载的所有(user, item)数据条目\n","        :params: M, 划分的数目,最后需要取M折的平均\n","        :params: k, 本次是第几次划分,k~[0, M)\n","        :params: seed, random的种子数,对于不同的k应设置成一样的\n","        :return: train, test\n","        '''\n","        train, test = [], []\n","        random.seed(seed)\n","        for user, item in self.data:\n","            # 这里与书中的不一致,本人认为取M-1较为合理,因randint是左右都覆盖的\n","            if random.randint(0, M-1) == k:  \n","                test.append((user, item))\n","            else:\n","                train.append((user, item))\n","\n","        # 处理成字典的形式,user->set(items)\n","        def convert_dict(data):\n","            data_dict = {}\n","            for user, item in data:\n","                if user not in data_dict:\n","                    data_dict[user] = set()\n","                data_dict[user].add(item)\n","            data_dict = {k: list(data_dict[k]) for k in data_dict}\n","            return data_dict\n","\n","        return convert_dict(train), convert_dict(test)"]},{"cell_type": "markdown","metadata": {},"source": ["### 2. 评价指标\n","1. Precision\n","2. Recall\n","3. Coverage\n","4. Popularity(Novelty)"]},{"cell_type": "code","execution_count": 4,"metadata": {},"outputs": [],"source": ["class Metric():\n","    \n","    def __init__(self, train, test, GetRecommendation):\n","        '''\n","        :params: train, 训练数据\n","        :params: test, 测试数据\n","        :params: GetRecommendation, 为某个用户获取推荐物品的接口函数\n","        '''\n","        self.train = train\n","        self.test = test\n","        self.GetRecommendation = GetRecommendation\n","        self.recs = self.getRec()\n","        \n","    # 为test中的每个用户进行推荐\n","    def getRec(self):\n","        recs = {}\n","        for user in self.test:\n","            rank = self.GetRecommendation(user)\n","            recs[user] = rank\n","        return recs\n","        \n","    # 定义精确率指标计算方式\n","    def precision(self):\n","        all, hit = 0, 0\n","        for user in self.test:\n","            test_items = set(self.test[user])\n","            rank = self.recs[user]\n","            for item, score in rank:\n","                if item in test_items:\n","                    hit += 1\n","            all += len(rank)\n","        return round(hit / all * 100, 2)\n","    \n","    # 定义召回率指标计算方式\n","    def recall(self):\n","        all, hit = 0, 0\n","        for user in self.test:\n","            test_items = set(self.test[user])\n","            rank = self.recs[user]\n","            for item, score in rank:\n","                if item in test_items:\n","                    hit += 1\n","            all += len(test_items)\n","        return round(hit / all * 100, 2)\n","    \n","    # 定义覆盖率指标计算方式\n","    def coverage(self):\n","        all_item, recom_item = set(), set()\n","        for user in self.test:\n","            for item in self.train[user]:\n","                all_item.add(item)\n","            rank = self.recs[user]\n","            for item, score in rank:\n","                recom_item.add(item)\n","        return round(len(recom_item) / len(all_item) * 100, 2)\n","    \n","    # 定义新颖度指标计算方式\n","    def popularity(self):\n","        # 计算物品的流行度\n","        item_pop = {}\n","        for user in self.train:\n","            for item in self.train[user]:\n","                if item not in item_pop:\n","                    item_pop[item] = 0\n","                item_pop[item] += 1\n","\n","        num, pop = 0, 0\n","        for user in self.test:\n","            rank = self.recs[user]\n","            for item, score in rank:\n","                # 取对数,防止因长尾问题带来的被流行物品所主导\n","                pop += math.log(1 + item_pop[item])\n","                num += 1\n","        return round(pop / num, 6)\n","    \n","    def eval(self):\n","        metric = {'Precision': self.precision(),\n","                  'Recall': self.recall(),\n","                  'Coverage': self.coverage(),\n","                  'Popularity': self.popularity()}\n","        print('Metric:', metric)\n","        return metric"]},{"cell_type": "markdown","metadata": {},"source": ["## 二. 算法实现\n","1. Random\n","2. MostPopular\n","3. UserCF\n","4. UserIIF"]},{"cell_type": "code","execution_count": 5,"metadata": {},"outputs": [],"source": ["# 1. 随机推荐\n","def Random(train, K, N):\n","    '''\n","    :params: train, 训练数据集\n","    :params: K, 可忽略\n","    :params: N, 超参数,设置取TopN推荐物品数目\n","    :return: GetRecommendation,推荐接口函数\n","    '''\n","    items = {}\n","    for user in train:\n","        for item in train[user]:\n","            items[item] = 1\n","    \n","    def GetRecommendation(user):\n","        # 随机推荐N个未见过的\n","        user_items = set(train[user])\n","        rec_items = {k: items[k] for k in items if k not in user_items}\n","        rec_items = list(rec_items.items())\n","        random.shuffle(rec_items)\n","        return rec_items[:N]\n","    \n","    return GetRecommendation"]},{"cell_type": "code","execution_count": 6,"metadata": {},"outputs": [],"source": ["# 2. 热门推荐\n","def MostPopular(train, K, N):\n","    '''\n","    :params: train, 训练数据集\n","    :params: K, 可忽略\n","    :params: N, 超参数,设置取TopN推荐物品数目\n","    :return: GetRecommendation, 推荐接口函数\n","    '''\n","    items = {}\n","    for user in train:\n","        for item in train[user]:\n","            if item not in items:\n","                items[item] = 0\n","            items[item] += 1\n","        \n","    def GetRecommendation(user):\n","        # 随机推荐N个没见过的最热门的\n","        user_items = set(train[user])\n","        rec_items = {k: items[k] for k in items if k not in user_items}\n","        rec_items = list(sorted(rec_items.items(), key=lambda x: x[1], reverse=True))\n","        return rec_items[:N]\n","    \n","    return GetRecommendation"]},{"cell_type": "code","execution_count": 7,"metadata": {},"outputs": [],"source": ["# 3. 基于用户余弦相似度的推荐\n","def UserCF(train, K, N):\n","    '''\n","    :params: train, 训练数据集\n","    :params: K, 超参数,设置取TopK相似用户数目\n","    :params: N, 超参数,设置取TopN推荐物品数目\n","    :return: GetRecommendation, 推荐接口函数\n","    '''\n","    # 计算item->user的倒排索引\n","    item_users = {}\n","    for user in train:\n","        for item in train[user]:\n","            if item not in item_users:\n","                item_users[item] = []\n","            item_users[item].append(user)\n","    \n","    # 计算用户相似度矩阵\n","    sim = {}\n","    num = {}\n","    for item in item_users:\n","        users = item_users[item]\n","        for i in range(len(users)):\n","            u = users[i]\n","            if u not in num:\n","                num[u] = 0\n","            num[u] += 1\n","            if u not in sim:\n","                sim[u] = {}\n","            for j in range(len(users)):\n","                if j == i: continue\n","                v = users[j]\n","                if v not in sim[u]:\n","                    sim[u][v] = 0\n","                sim[u][v] += 1\n","    for u in sim:\n","        for v in sim[u]:\n","            sim[u][v] /= math.sqrt(num[u] * num[v])\n","    \n","    # 按照相似度排序\n","    sorted_user_sim = {k: list(sorted(v.items(), \\\n","                               key=lambda x: x[1], reverse=True)) \\\n","                       for k, v in sim.items()}\n","    \n","    # 获取接口函数\n","    def GetRecommendation(user):\n","        items = {}\n","        seen_items = set(train[user])\n","        for u, _ in sorted_user_sim[user][:K]:\n","            for item in train[u]:\n","                # 要去掉用户见过的\n","                if item not in seen_items:\n","                    if item not in items:\n","                        items[item] = 0\n","                    items[item] += sim[user][u]\n","        recs = list(sorted(items.items(), key=lambda x: x[1], reverse=True))[:N]\n","        return recs\n","    \n","    return GetRecommendation"]},{"cell_type": "code","execution_count": 8,"metadata": {},"outputs": [],"source": ["# 4. 基于改进的用户余弦相似度的推荐\n","def UserIIF(train, K, N):\n","    '''\n","    :params: train, 训练数据集\n","    :params: K, 超参数,设置取TopK相似用户数目\n","    :params: N, 超参数,设置取TopN推荐物品数目\n","    :return: GetRecommendation, 推荐接口函数\n","    '''\n","    # 计算item->user的倒排索引\n","    item_users = {}\n","    for user in train:\n","        for item in train[user]:\n","            if item not in item_users:\n","                item_users[item] = []\n","            item_users[item].append(user)\n","    \n","    # 计算用户相似度矩阵\n","    sim = {}\n","    num = {}\n","    for item in item_users:\n","        users = item_users[item]\n","        for i in range(len(users)):\n","            u = users[i]\n","            if u not in num:\n","                num[u] = 0\n","            num[u] += 1\n","            if u not in sim:\n","                sim[u] = {}\n","            for j in range(len(users)):\n","                if j == i: continue\n","                v = users[j]\n","                if v not in sim[u]:\n","                    sim[u][v] = 0\n","                # 相比UserCF,主要是改进了这里\n","                sim[u][v] += 1 / math.log(1 + len(users))\n","    for u in sim:\n","        for v in sim[u]:\n","            sim[u][v] /&#

相关文章:

完整地实现了推荐系统的构建、实验和评估过程,为不同推荐算法在同一数据集上的性能比较提供了可重复实验的框架

{"cells": [{"cell_type": "markdown","metadata": {},"source": ["# 基于用户的协同过滤算法"]},{"cell_type": "code","execution_count": 1,"metadata": {},"ou…...

DRV8311三相PWM无刷直流电机驱动器

1 特性 • 三相 PWM 电机驱动器 – 三相无刷直流电机 • 3V 至 20V 工作电压 – 24V 绝对最大电压 • 高输出电流能力 – 5A 峰值电流驱动能力 • 低导通状态电阻 MOSFET – TA 25C 时,RDS(ON) (HS LS) 为210mΩ(典型值) • 低功耗睡眠模式…...

Mysql--运维篇--备份和恢复(逻辑备份,mysqldump,物理备份,热备份,温备份,冷备份,二进制文件备份和恢复等)

MySQL 提供了多种备份方式,每种方式适用于不同的场景和需求。根据备份的粒度、速度、恢复时间和对数据库的影响,可以选择合适的备份策略。主要备份方式有三大类:逻辑备份(mysqldump),物理备份和二进制文件备…...

机器学习-归一化

文章目录 一. 归一化二. 归一化的常见方法1. 最小-最大归一化 (Min-Max Normalization)2. Z-Score 归一化(标准化)3. MaxAbs 归一化 三. 归一化的选择四. 为什么要进行归一化1. 消除量纲差异2. 提高模型训练速度3. 增强模型的稳定性4. 保证正则化项的有效…...

Linux 串口检查状态的实用方法

在 Linux 系统中,串口通信是非常常见的操作,尤其在嵌入式系统、工业设备以及其他需要串行通信的场景中。为了确保串口设备的正常工作,检查串口的连接状态和配置信息是非常重要的。本篇文章将介绍如何在 Linux 上检查串口的连接状态&#xff0…...

Qt的核心机制概述

Qt的核心机制概述 1. 元对象系统(The Meta-Object System) 基本概念:元对象系统是Qt的核心机制之一,它通过moc(Meta-Object Compiler)工具为继承自QObject的类生成额外的代码,从而扩展了C语言…...

微调神经机器翻译模型全流程

MBART: Multilingual Denoising Pre-training for Neural Machine Translation 模型下载 mBART 是一个基于序列到序列的去噪自编码器,使用 BART 目标在多种语言的大规模单语语料库上进行预训练。mBART 是首批通过去噪完整文本在多种语言上预训练序列到序列模型的方…...

Cesium加载地形

Cesium的地形来源大致可以分为两种,一种是由Cesium官方提供的数据源,一种是第三方的数据源,官方源依赖于Cesium Assets,如果设置了AccessToken后,就可以直接使用Cesium的地形静态构造方法来获取数据源CesiumTerrainPro…...

gitlab runner正常连接 提示 作业挂起中,等待进入队列 解决办法

方案1 作业挂起中,等待进入队列 重启gitlab-runner gitlab-runner stop gitlab-runner start gitlab-runner run方案2 启动 gitlab-runner 服务 gitlab-runner start成功启动如下 [rootdocserver home]# gitlab-runner start Runtime platform …...

C#对动态加载的DLL进行依赖注入,并对DLL注入服务

文章目录 什么是依赖注入概念常用的依赖注入实现什么是动态加载定义示例对动态加载的DLL进行依赖注入什么是依赖注入 概念 依赖注入(Dependency Injection,简称 DI)是一种软件设计模式,用于解耦软件组件之间的依赖关系。在 C# 开发中,它主要解决的是类与类之间的强耦合问题…...

HDMI接口

HDMI接口 前言各版本区别概述(Overview)接口接口类型Type A/E 引脚定义Type B 引脚定义Type C 引脚定义Type D 引脚定义 传输流程概述Control Period前导码字符边界同步Control Period 编/解码 Data Island PeriodLeading/Trailing Guard BandTERC4 编/解…...

A/B 测试:玩转假设检验、t 检验与卡方检验

一、背景:当“审判”成为科学 1.1 虚拟场景——法庭审判 想象这样一个场景:有一天,你在王国里担任“首席审判官”。你面前站着一位嫌疑人,有人指控他说“偷了国王珍贵的金冠”。但究竟是他干的,还是他是被冤枉的&…...

第143场双周赛:最小可整除数位乘积 Ⅰ、执行操作后元素的最高频率 Ⅰ、执行操作后元素的最高频率 Ⅱ、最小可整除数位乘积 Ⅱ

Q1、最小可整除数位乘积 Ⅰ 1、题目描述 给你两个整数 n 和 t 。请你返回大于等于 n 的 最小 整数,且该整数的 各数位之积 能被 t 整除。 2、解题思路 问题拆解: 题目要求我们找到一个整数,其 数位的积 可以被 t 整除。 数位的积 是指将数…...

【STM32】LED状态翻转函数

1.利用状态标志位控制LED状态翻转 在平常编写LED状态翻转函数时,通常利用状态标志位实现LED状态的翻转。如下所示: unsigned char led_turn_flag; //LED状态标志位,1-点亮,0-熄灭/***************************************函…...

uniapp 小程序 textarea 层级穿透,聚焦光标位置错误怎么办?

前言 在开发微信小程序时,使用 textarea 组件可能会遇到一些棘手的问题。最近我在使用 uniapp 开发微信小程序时,就遇到了两个非常令人头疼的问题: 层级穿透:由于 textarea 是原生组件,任何元素都无法遮盖住它。当其…...

汽车 SOA 架构下的信息安全新问题及对策漫谈

摘要:随着汽车行业的快速发展,客户和制造商对车辆功能的新需求促使汽车架构从面向信号向面向服务的架构(SOA)转变。本文详细阐述了汽车 SOA 架构的协议、通信模式,并与传统架构进行对比,深入分析了 SOA 给信…...

Unity-Mirror网络框架-从入门到精通之RigidbodyPhysics示例

文章目录 前言示例一、球体的基础配置二、三个球体的设置差异三、示例意图LatencySimulation前言 在现代游戏开发中,网络功能日益成为提升游戏体验的关键组成部分。本系列文章将为读者提供对Mirror网络框架的深入了解,涵盖从基础到高级的多个主题。Mirror是一个用于Unity的开…...

小程序如何引入腾讯位置服务

小程序如何引入腾讯位置服务 1.添加服务 登录 微信公众平台 注意:小程序要企业版的 第三方服务 -> 服务 -> 开发者资源 -> 开通腾讯位置服务 在设置 -> 第三方设置 中可以看到开通的服务,如果没有就在插件管理中添加插件 2.腾讯位置服务…...

H3CNE-12-静态路由(一)

静态路由应用场景: 静态路由是指由管理员手动配置和维护的路由 路由表:路由器用来妆发数据包的一张“地图” 查看命令: dis ip routing-table 直连路由:接口配置好IP地址并UP后自动生成的路由 静态路由配置: ip…...

多线程锁

在并发编程中,锁(Lock)是一种用于控制多个线程对共享资源访问的机制。正确使用锁可以确保数据的一致性和完整性,避免出现竞态条件(Race Condition)、死锁(Deadlock)等问题。Java 提供…...

盘古信息PCB行业解决方案:以全域场景重构,激活智造新未来

一、破局:PCB行业的时代之问 在数字经济蓬勃发展的浪潮中,PCB(印制电路板)作为 “电子产品之母”,其重要性愈发凸显。随着 5G、人工智能等新兴技术的加速渗透,PCB行业面临着前所未有的挑战与机遇。产品迭代…...

2025年能源电力系统与流体力学国际会议 (EPSFD 2025)

2025年能源电力系统与流体力学国际会议(EPSFD 2025)将于本年度在美丽的杭州盛大召开。作为全球能源、电力系统以及流体力学领域的顶级盛会,EPSFD 2025旨在为来自世界各地的科学家、工程师和研究人员提供一个展示最新研究成果、分享实践经验及…...

力扣-35.搜索插入位置

题目描述 给定一个排序数组和一个目标值,在数组中找到目标值,并返回其索引。如果目标值不存在于数组中,返回它将会被按顺序插入的位置。 请必须使用时间复杂度为 O(log n) 的算法。 class Solution {public int searchInsert(int[] nums, …...

【Java学习笔记】BigInteger 和 BigDecimal 类

BigInteger 和 BigDecimal 类 二者共有的常见方法 方法功能add加subtract减multiply乘divide除 注意点:传参类型必须是类对象 一、BigInteger 1. 作用:适合保存比较大的整型数 2. 使用说明 创建BigInteger对象 传入字符串 3. 代码示例 import j…...

用机器学习破解新能源领域的“弃风”难题

音乐发烧友深有体会,玩音乐的本质就是玩电网。火电声音偏暖,水电偏冷,风电偏空旷。至于太阳能发的电,则略显朦胧和单薄。 不知你是否有感觉,近两年家里的音响声音越来越冷,听起来越来越单薄? —…...

DBLP数据库是什么?

DBLP(Digital Bibliography & Library Project)Computer Science Bibliography是全球著名的计算机科学出版物的开放书目数据库。DBLP所收录的期刊和会议论文质量较高,数据库文献更新速度很快,很好地反映了国际计算机科学学术研…...

DeepSeek源码深度解析 × 华为仓颉语言编程精粹——从MoE架构到全场景开发生态

前言 在人工智能技术飞速发展的今天,深度学习与大模型技术已成为推动行业变革的核心驱动力,而高效、灵活的开发工具与编程语言则为技术创新提供了重要支撑。本书以两大前沿技术领域为核心,系统性地呈现了两部深度技术著作的精华:…...

comfyui 工作流中 图生视频 如何增加视频的长度到5秒

comfyUI 工作流怎么可以生成更长的视频。除了硬件显存要求之外还有别的方法吗? 在ComfyUI中实现图生视频并延长到5秒,需要结合多个扩展和技巧。以下是完整解决方案: 核心工作流配置(24fps下5秒120帧) #mermaid-svg-yP…...

数据库正常,但后端收不到数据原因及解决

从代码和日志来看,后端SQL查询确实返回了数据,但最终user对象却为null。这表明查询结果没有正确映射到User对象上。 在前后端分离,并且ai辅助开发的时候,很容易出现前后端变量名不一致情况,还不报错,只是单…...

深入理解 React 样式方案

React 的样式方案较多,在应用开发初期,开发者需要根据项目业务具体情况选择对应样式方案。React 样式方案主要有: 1. 内联样式 2. module css 3. css in js 4. tailwind css 这些方案中,均有各自的优势和缺点。 1. 方案优劣势 1. 内联样式: 简单直观,适合动态样式和…...