当前位置: 首页 > news >正文

完整地实现了推荐系统的构建、实验和评估过程,为不同推荐算法在同一数据集上的性能比较提供了可重复实验的框架

{"cells": [{"cell_type": "markdown","metadata": {},"source": ["# 基于用户的协同过滤算法"]},{"cell_type": "code","execution_count": 1,"metadata": {},"outputs": [],"source": ["# 导入包\n","import random\n","import math\n","import time\n","from tqdm import tqdm"]},{"cell_type": "markdown","metadata": {},"source": ["## 一. 通用函数定义"]},{"cell_type": "code","execution_count": 2,"metadata": {},"outputs": [],"source": ["# 定义装饰器,监控运行时间\n","def timmer(func):\n","    def wrapper(*args, **kwargs):\n","        start_time = time.time()\n","        res = func(*args, **kwargs)\n","        stop_time = time.time()\n","        print('Func %s, run time: %s' % (func.__name__, stop_time - start_time))\n","        return res\n","    return wrapper"]},{"cell_type": "markdown","metadata": {},"source": ["### 1. 数据处理相关\n","1. load data\n","2. split data"]},{"cell_type": "code","execution_count": 3,"metadata": {},"outputs": [],"source": ["class Dataset():\n","    \n","    def __init__(self, fp):\n","        # fp: data file path\n","        self.data = self.loadData(fp)\n","    \n","    @timmer\n","    def loadData(self, fp):\n","        data = []\n","        for l in open(fp):\n","            data.append(tuple(map(int, l.strip().split('::')[:2])))\n","        return data\n","    \n","    @timmer\n","    def splitData(self, M, k, seed=1):\n","        '''\n","        :params: data, 加载的所有(user, item)数据条目\n","        :params: M, 划分的数目,最后需要取M折的平均\n","        :params: k, 本次是第几次划分,k~[0, M)\n","        :params: seed, random的种子数,对于不同的k应设置成一样的\n","        :return: train, test\n","        '''\n","        train, test = [], []\n","        random.seed(seed)\n","        for user, item in self.data:\n","            # 这里与书中的不一致,本人认为取M-1较为合理,因randint是左右都覆盖的\n","            if random.randint(0, M-1) == k:  \n","                test.append((user, item))\n","            else:\n","                train.append((user, item))\n","\n","        # 处理成字典的形式,user->set(items)\n","        def convert_dict(data):\n","            data_dict = {}\n","            for user, item in data:\n","                if user not in data_dict:\n","                    data_dict[user] = set()\n","                data_dict[user].add(item)\n","            data_dict = {k: list(data_dict[k]) for k in data_dict}\n","            return data_dict\n","\n","        return convert_dict(train), convert_dict(test)"]},{"cell_type": "markdown","metadata": {},"source": ["### 2. 评价指标\n","1. Precision\n","2. Recall\n","3. Coverage\n","4. Popularity(Novelty)"]},{"cell_type": "code","execution_count": 4,"metadata": {},"outputs": [],"source": ["class Metric():\n","    \n","    def __init__(self, train, test, GetRecommendation):\n","        '''\n","        :params: train, 训练数据\n","        :params: test, 测试数据\n","        :params: GetRecommendation, 为某个用户获取推荐物品的接口函数\n","        '''\n","        self.train = train\n","        self.test = test\n","        self.GetRecommendation = GetRecommendation\n","        self.recs = self.getRec()\n","        \n","    # 为test中的每个用户进行推荐\n","    def getRec(self):\n","        recs = {}\n","        for user in self.test:\n","            rank = self.GetRecommendation(user)\n","            recs[user] = rank\n","        return recs\n","        \n","    # 定义精确率指标计算方式\n","    def precision(self):\n","        all, hit = 0, 0\n","        for user in self.test:\n","            test_items = set(self.test[user])\n","            rank = self.recs[user]\n","            for item, score in rank:\n","                if item in test_items:\n","                    hit += 1\n","            all += len(rank)\n","        return round(hit / all * 100, 2)\n","    \n","    # 定义召回率指标计算方式\n","    def recall(self):\n","        all, hit = 0, 0\n","        for user in self.test:\n","            test_items = set(self.test[user])\n","            rank = self.recs[user]\n","            for item, score in rank:\n","                if item in test_items:\n","                    hit += 1\n","            all += len(test_items)\n","        return round(hit / all * 100, 2)\n","    \n","    # 定义覆盖率指标计算方式\n","    def coverage(self):\n","        all_item, recom_item = set(), set()\n","        for user in self.test:\n","            for item in self.train[user]:\n","                all_item.add(item)\n","            rank = self.recs[user]\n","            for item, score in rank:\n","                recom_item.add(item)\n","        return round(len(recom_item) / len(all_item) * 100, 2)\n","    \n","    # 定义新颖度指标计算方式\n","    def popularity(self):\n","        # 计算物品的流行度\n","        item_pop = {}\n","        for user in self.train:\n","            for item in self.train[user]:\n","                if item not in item_pop:\n","                    item_pop[item] = 0\n","                item_pop[item] += 1\n","\n","        num, pop = 0, 0\n","        for user in self.test:\n","            rank = self.recs[user]\n","            for item, score in rank:\n","                # 取对数,防止因长尾问题带来的被流行物品所主导\n","                pop += math.log(1 + item_pop[item])\n","                num += 1\n","        return round(pop / num, 6)\n","    \n","    def eval(self):\n","        metric = {'Precision': self.precision(),\n","                  'Recall': self.recall(),\n","                  'Coverage': self.coverage(),\n","                  'Popularity': self.popularity()}\n","        print('Metric:', metric)\n","        return metric"]},{"cell_type": "markdown","metadata": {},"source": ["## 二. 算法实现\n","1. Random\n","2. MostPopular\n","3. UserCF\n","4. UserIIF"]},{"cell_type": "code","execution_count": 5,"metadata": {},"outputs": [],"source": ["# 1. 随机推荐\n","def Random(train, K, N):\n","    '''\n","    :params: train, 训练数据集\n","    :params: K, 可忽略\n","    :params: N, 超参数,设置取TopN推荐物品数目\n","    :return: GetRecommendation,推荐接口函数\n","    '''\n","    items = {}\n","    for user in train:\n","        for item in train[user]:\n","            items[item] = 1\n","    \n","    def GetRecommendation(user):\n","        # 随机推荐N个未见过的\n","        user_items = set(train[user])\n","        rec_items = {k: items[k] for k in items if k not in user_items}\n","        rec_items = list(rec_items.items())\n","        random.shuffle(rec_items)\n","        return rec_items[:N]\n","    \n","    return GetRecommendation"]},{"cell_type": "code","execution_count": 6,"metadata": {},"outputs": [],"source": ["# 2. 热门推荐\n","def MostPopular(train, K, N):\n","    '''\n","    :params: train, 训练数据集\n","    :params: K, 可忽略\n","    :params: N, 超参数,设置取TopN推荐物品数目\n","    :return: GetRecommendation, 推荐接口函数\n","    '''\n","    items = {}\n","    for user in train:\n","        for item in train[user]:\n","            if item not in items:\n","                items[item] = 0\n","            items[item] += 1\n","        \n","    def GetRecommendation(user):\n","        # 随机推荐N个没见过的最热门的\n","        user_items = set(train[user])\n","        rec_items = {k: items[k] for k in items if k not in user_items}\n","        rec_items = list(sorted(rec_items.items(), key=lambda x: x[1], reverse=True))\n","        return rec_items[:N]\n","    \n","    return GetRecommendation"]},{"cell_type": "code","execution_count": 7,"metadata": {},"outputs": [],"source": ["# 3. 基于用户余弦相似度的推荐\n","def UserCF(train, K, N):\n","    '''\n","    :params: train, 训练数据集\n","    :params: K, 超参数,设置取TopK相似用户数目\n","    :params: N, 超参数,设置取TopN推荐物品数目\n","    :return: GetRecommendation, 推荐接口函数\n","    '''\n","    # 计算item->user的倒排索引\n","    item_users = {}\n","    for user in train:\n","        for item in train[user]:\n","            if item not in item_users:\n","                item_users[item] = []\n","            item_users[item].append(user)\n","    \n","    # 计算用户相似度矩阵\n","    sim = {}\n","    num = {}\n","    for item in item_users:\n","        users = item_users[item]\n","        for i in range(len(users)):\n","            u = users[i]\n","            if u not in num:\n","                num[u] = 0\n","            num[u] += 1\n","            if u not in sim:\n","                sim[u] = {}\n","            for j in range(len(users)):\n","                if j == i: continue\n","                v = users[j]\n","                if v not in sim[u]:\n","                    sim[u][v] = 0\n","                sim[u][v] += 1\n","    for u in sim:\n","        for v in sim[u]:\n","            sim[u][v] /= math.sqrt(num[u] * num[v])\n","    \n","    # 按照相似度排序\n","    sorted_user_sim = {k: list(sorted(v.items(), \\\n","                               key=lambda x: x[1], reverse=True)) \\\n","                       for k, v in sim.items()}\n","    \n","    # 获取接口函数\n","    def GetRecommendation(user):\n","        items = {}\n","        seen_items = set(train[user])\n","        for u, _ in sorted_user_sim[user][:K]:\n","            for item in train[u]:\n","                # 要去掉用户见过的\n","                if item not in seen_items:\n","                    if item not in items:\n","                        items[item] = 0\n","                    items[item] += sim[user][u]\n","        recs = list(sorted(items.items(), key=lambda x: x[1], reverse=True))[:N]\n","        return recs\n","    \n","    return GetRecommendation"]},{"cell_type": "code","execution_count": 8,"metadata": {},"outputs": [],"source": ["# 4. 基于改进的用户余弦相似度的推荐\n","def UserIIF(train, K, N):\n","    '''\n","    :params: train, 训练数据集\n","    :params: K, 超参数,设置取TopK相似用户数目\n","    :params: N, 超参数,设置取TopN推荐物品数目\n","    :return: GetRecommendation, 推荐接口函数\n","    '''\n","    # 计算item->user的倒排索引\n","    item_users = {}\n","    for user in train:\n","        for item in train[user]:\n","            if item not in item_users:\n","                item_users[item] = []\n","            item_users[item].append(user)\n","    \n","    # 计算用户相似度矩阵\n","    sim = {}\n","    num = {}\n","    for item in item_users:\n","        users = item_users[item]\n","        for i in range(len(users)):\n","            u = users[i]\n","            if u not in num:\n","                num[u] = 0\n","            num[u] += 1\n","            if u not in sim:\n","                sim[u] = {}\n","            for j in range(len(users)):\n","                if j == i: continue\n","                v = users[j]\n","                if v not in sim[u]:\n","                    sim[u][v] = 0\n","                # 相比UserCF,主要是改进了这里\n","                sim[u][v] += 1 / math.log(1 + len(users))\n","    for u in sim:\n","        for v in sim[u]:\n","            sim[u][v] /&#

相关文章:

完整地实现了推荐系统的构建、实验和评估过程,为不同推荐算法在同一数据集上的性能比较提供了可重复实验的框架

{"cells": [{"cell_type": "markdown","metadata": {},"source": ["# 基于用户的协同过滤算法"]},{"cell_type": "code","execution_count": 1,"metadata": {},"ou…...

DRV8311三相PWM无刷直流电机驱动器

1 特性 • 三相 PWM 电机驱动器 – 三相无刷直流电机 • 3V 至 20V 工作电压 – 24V 绝对最大电压 • 高输出电流能力 – 5A 峰值电流驱动能力 • 低导通状态电阻 MOSFET – TA 25C 时,RDS(ON) (HS LS) 为210mΩ(典型值) • 低功耗睡眠模式…...

Mysql--运维篇--备份和恢复(逻辑备份,mysqldump,物理备份,热备份,温备份,冷备份,二进制文件备份和恢复等)

MySQL 提供了多种备份方式,每种方式适用于不同的场景和需求。根据备份的粒度、速度、恢复时间和对数据库的影响,可以选择合适的备份策略。主要备份方式有三大类:逻辑备份(mysqldump),物理备份和二进制文件备…...

机器学习-归一化

文章目录 一. 归一化二. 归一化的常见方法1. 最小-最大归一化 (Min-Max Normalization)2. Z-Score 归一化(标准化)3. MaxAbs 归一化 三. 归一化的选择四. 为什么要进行归一化1. 消除量纲差异2. 提高模型训练速度3. 增强模型的稳定性4. 保证正则化项的有效…...

Linux 串口检查状态的实用方法

在 Linux 系统中,串口通信是非常常见的操作,尤其在嵌入式系统、工业设备以及其他需要串行通信的场景中。为了确保串口设备的正常工作,检查串口的连接状态和配置信息是非常重要的。本篇文章将介绍如何在 Linux 上检查串口的连接状态&#xff0…...

Qt的核心机制概述

Qt的核心机制概述 1. 元对象系统(The Meta-Object System) 基本概念:元对象系统是Qt的核心机制之一,它通过moc(Meta-Object Compiler)工具为继承自QObject的类生成额外的代码,从而扩展了C语言…...

微调神经机器翻译模型全流程

MBART: Multilingual Denoising Pre-training for Neural Machine Translation 模型下载 mBART 是一个基于序列到序列的去噪自编码器,使用 BART 目标在多种语言的大规模单语语料库上进行预训练。mBART 是首批通过去噪完整文本在多种语言上预训练序列到序列模型的方…...

Cesium加载地形

Cesium的地形来源大致可以分为两种,一种是由Cesium官方提供的数据源,一种是第三方的数据源,官方源依赖于Cesium Assets,如果设置了AccessToken后,就可以直接使用Cesium的地形静态构造方法来获取数据源CesiumTerrainPro…...

gitlab runner正常连接 提示 作业挂起中,等待进入队列 解决办法

方案1 作业挂起中,等待进入队列 重启gitlab-runner gitlab-runner stop gitlab-runner start gitlab-runner run方案2 启动 gitlab-runner 服务 gitlab-runner start成功启动如下 [rootdocserver home]# gitlab-runner start Runtime platform …...

C#对动态加载的DLL进行依赖注入,并对DLL注入服务

文章目录 什么是依赖注入概念常用的依赖注入实现什么是动态加载定义示例对动态加载的DLL进行依赖注入什么是依赖注入 概念 依赖注入(Dependency Injection,简称 DI)是一种软件设计模式,用于解耦软件组件之间的依赖关系。在 C# 开发中,它主要解决的是类与类之间的强耦合问题…...

HDMI接口

HDMI接口 前言各版本区别概述(Overview)接口接口类型Type A/E 引脚定义Type B 引脚定义Type C 引脚定义Type D 引脚定义 传输流程概述Control Period前导码字符边界同步Control Period 编/解码 Data Island PeriodLeading/Trailing Guard BandTERC4 编/解…...

A/B 测试:玩转假设检验、t 检验与卡方检验

一、背景:当“审判”成为科学 1.1 虚拟场景——法庭审判 想象这样一个场景:有一天,你在王国里担任“首席审判官”。你面前站着一位嫌疑人,有人指控他说“偷了国王珍贵的金冠”。但究竟是他干的,还是他是被冤枉的&…...

第143场双周赛:最小可整除数位乘积 Ⅰ、执行操作后元素的最高频率 Ⅰ、执行操作后元素的最高频率 Ⅱ、最小可整除数位乘积 Ⅱ

Q1、最小可整除数位乘积 Ⅰ 1、题目描述 给你两个整数 n 和 t 。请你返回大于等于 n 的 最小 整数,且该整数的 各数位之积 能被 t 整除。 2、解题思路 问题拆解: 题目要求我们找到一个整数,其 数位的积 可以被 t 整除。 数位的积 是指将数…...

【STM32】LED状态翻转函数

1.利用状态标志位控制LED状态翻转 在平常编写LED状态翻转函数时,通常利用状态标志位实现LED状态的翻转。如下所示: unsigned char led_turn_flag; //LED状态标志位,1-点亮,0-熄灭/***************************************函…...

uniapp 小程序 textarea 层级穿透,聚焦光标位置错误怎么办?

前言 在开发微信小程序时,使用 textarea 组件可能会遇到一些棘手的问题。最近我在使用 uniapp 开发微信小程序时,就遇到了两个非常令人头疼的问题: 层级穿透:由于 textarea 是原生组件,任何元素都无法遮盖住它。当其…...

汽车 SOA 架构下的信息安全新问题及对策漫谈

摘要:随着汽车行业的快速发展,客户和制造商对车辆功能的新需求促使汽车架构从面向信号向面向服务的架构(SOA)转变。本文详细阐述了汽车 SOA 架构的协议、通信模式,并与传统架构进行对比,深入分析了 SOA 给信…...

Unity-Mirror网络框架-从入门到精通之RigidbodyPhysics示例

文章目录 前言示例一、球体的基础配置二、三个球体的设置差异三、示例意图LatencySimulation前言 在现代游戏开发中,网络功能日益成为提升游戏体验的关键组成部分。本系列文章将为读者提供对Mirror网络框架的深入了解,涵盖从基础到高级的多个主题。Mirror是一个用于Unity的开…...

小程序如何引入腾讯位置服务

小程序如何引入腾讯位置服务 1.添加服务 登录 微信公众平台 注意:小程序要企业版的 第三方服务 -> 服务 -> 开发者资源 -> 开通腾讯位置服务 在设置 -> 第三方设置 中可以看到开通的服务,如果没有就在插件管理中添加插件 2.腾讯位置服务…...

H3CNE-12-静态路由(一)

静态路由应用场景: 静态路由是指由管理员手动配置和维护的路由 路由表:路由器用来妆发数据包的一张“地图” 查看命令: dis ip routing-table 直连路由:接口配置好IP地址并UP后自动生成的路由 静态路由配置: ip…...

多线程锁

在并发编程中,锁(Lock)是一种用于控制多个线程对共享资源访问的机制。正确使用锁可以确保数据的一致性和完整性,避免出现竞态条件(Race Condition)、死锁(Deadlock)等问题。Java 提供…...

Linux链表操作全解析

Linux C语言链表深度解析与实战技巧 一、链表基础概念与内核链表优势1.1 为什么使用链表?1.2 Linux 内核链表与用户态链表的区别 二、内核链表结构与宏解析常用宏/函数 三、内核链表的优点四、用户态链表示例五、双向循环链表在内核中的实现优势5.1 插入效率5.2 安全…...

【OSG学习笔记】Day 18: 碰撞检测与物理交互

物理引擎(Physics Engine) 物理引擎 是一种通过计算机模拟物理规律(如力学、碰撞、重力、流体动力学等)的软件工具或库。 它的核心目标是在虚拟环境中逼真地模拟物体的运动和交互,广泛应用于 游戏开发、动画制作、虚…...

【Java学习笔记】Arrays类

Arrays 类 1. 导入包:import java.util.Arrays 2. 常用方法一览表 方法描述Arrays.toString()返回数组的字符串形式Arrays.sort()排序(自然排序和定制排序)Arrays.binarySearch()通过二分搜索法进行查找(前提:数组是…...

深入浅出:JavaScript 中的 `window.crypto.getRandomValues()` 方法

深入浅出:JavaScript 中的 window.crypto.getRandomValues() 方法 在现代 Web 开发中,随机数的生成看似简单,却隐藏着许多玄机。无论是生成密码、加密密钥,还是创建安全令牌,随机数的质量直接关系到系统的安全性。Jav…...

visual studio 2022更改主题为深色

visual studio 2022更改主题为深色 点击visual studio 上方的 工具-> 选项 在选项窗口中,选择 环境 -> 常规 ,将其中的颜色主题改成深色 点击确定,更改完成...

Java 加密常用的各种算法及其选择

在数字化时代,数据安全至关重要,Java 作为广泛应用的编程语言,提供了丰富的加密算法来保障数据的保密性、完整性和真实性。了解这些常用加密算法及其适用场景,有助于开发者在不同的业务需求中做出正确的选择。​ 一、对称加密算法…...

JVM 内存结构 详解

内存结构 运行时数据区: Java虚拟机在运行Java程序过程中管理的内存区域。 程序计数器: ​ 线程私有,程序控制流的指示器,分支、循环、跳转、异常处理、线程恢复等基础功能都依赖这个计数器完成。 ​ 每个线程都有一个程序计数…...

Java毕业设计:WML信息查询与后端信息发布系统开发

JAVAWML信息查询与后端信息发布系统实现 一、系统概述 本系统基于Java和WML(无线标记语言)技术开发,实现了移动设备上的信息查询与后端信息发布功能。系统采用B/S架构,服务器端使用Java Servlet处理请求,数据库采用MySQL存储信息&#xff0…...

使用LangGraph和LangSmith构建多智能体人工智能系统

现在,通过组合几个较小的子智能体来创建一个强大的人工智能智能体正成为一种趋势。但这也带来了一些挑战,比如减少幻觉、管理对话流程、在测试期间留意智能体的工作方式、允许人工介入以及评估其性能。你需要进行大量的反复试验。 在这篇博客〔原作者&a…...

Caliper 负载(Workload)详细解析

Caliper 负载(Workload)详细解析 负载(Workload)是 Caliper 性能测试的核心部分,它定义了测试期间要执行的具体合约调用行为和交易模式。下面我将全面深入地讲解负载的各个方面。 一、负载模块基本结构 一个典型的负载模块(如 workload.js)包含以下基本结构: use strict;/…...