当前位置：首页 > news >正文

Lecture 20 Topic Modelling

news 2025/12/4 2:25:09

Topic Modelling

makeingsense of text
- English Wikipedia: 6M articles
- Twitter: 500M tweets per day
- New York Times: 15M articles
- arXiv: 1M articles
- What can we do if we want to learn something about these document collections?
questions
- What are the less popular topics on Wikipedia?
- What are the big trends on Twitter in the past month?
- How do the social issues evolve over time in New York Times from 1900s to 2000s?
- What are some influential research areas?
topic models to the rescue
- Topic models learn common, overlapping themes in a document collection
- Unsupervised model
  - No labels; input is just the documents!
- What’s the output of a topic model?
  - Topics: each topic associated with a list of words
  - Topic assignments: each document associated with a list of topics
what do topics look like
- A list of words
- Collectively describes a concept or subject
- Words of a topic typically appear in the same set of documents in the corpus(words overlapping in documents)
- Wikipedia topics(broad)
- Twitter topics(short,conversational)
- New York Times topics
applications of topic models
- Personalised advertising(e.g. types of products bought)
- Search engine
- Discover senses of polysemous words(e.g. apple: fruit, company, two different clusters)

A Brief History of Topic Models

latent semantic analysis
- LSA: truncate
- issues
  - Positive and negative values in the $U$ and $V^T$
  - Difficult to interpret(negative values)
probabilistic LSA
- based on a probabilistic model to get rid of negative values
- issues
  - No more negative values!
  - PLSA can learn topics and topic assignment for documents in the train corpus
  - But it is unable to infer topic distribution on new documents
  - PLSA needs to be re-trained for new documents
latent dirichlet allocation(LDA)
- Introduces a prior to the document-topic and topicword distribution
- Fully generative: trained LDA model can infer topics on unseen documents!
- LDA is a Bayesian version of PLSA

LDA

LDA
- Core idea: assume each document contains a mix of topics
- But the topic structure is hidden (latent)
- LDA infers the topic structure given the observed words and documents
- LDA produces soft clusters of documents (based on topic overlap), rather than hard clusters
- Given a trained LDA model, it can infer topics on new documents (not part of train data)
input
- A collection of documents
- Bag-of-words
- Good preprocessing practice:
  - Remove stopwords
  - Remove low and high frequency word types
  - Lemmatisation
output
- Topics: distribution over words in each topic
- Topic assignment: distribution over topics in each document
learning
- How do we learn the latent topics?
- Two main family of algorithms:
  - Variational methods
  - Sampling-based methods
- sampling method (Gibbs)
  1. Randomly assign topics to all tokens in documents
  2. Collect topic-word and document-topic co-occurrence statistics based on the assignments
    - first give some psudo-counts in every cell of two matrix(smoothing,no event is 0)
    - collect co-occurrence statistics
  3. Go through every word token in corpus and sample a new topic:
    - delete current topic assigned to a word
    - update two matrices
    - compute the probability distribution to sample: $P(t_i|w,d) \propto P(t_i|w)P(t_i|d)$ ( $P(t_i|w) \to$ topic-word, $P(t_i|d) \to$ document-topic)
      - $P(t_1|w,d)=P(t_1|mouse)\times{P(t_1|d_1)}=\frac{0.01}{0.01+0.01+2.01}\times{\frac{1.1}{1.1+1.1+2.1}}$
    - sample randomly based on the probability distribution
  4. Go to step 2 and repeat until convergence
  - when to stop
    - Train until convergence
    - Convergence = model probability of training set becomes stable
    - How to compute model probability?
      - $logP(w_1,w_2,...,w_m)=log\sum_{j=0}^TP(w_1|t_j)P(t_j|d_{w_1})+...+log\sum_{j=0}^TP(w_m|t_j)P(t_j|d_{w_m})$
      - m = #word tokens
      - $P(w_1|t_j) \to$ based on the topic-word co-occurrence matrix
      - $P(t_j|d_{w_1}) \to$ based on the document-topic co-occurrence matrix
  - infer topics for new documents
    1. Randomly assign topics to all tokens in new/test documents
    2. Update document-topic matrix based on the assignments; but use the trained topic-word matrix (kept fixed)
    3. Go through every word in the test documents and sample topics: $P(t_i|w,d) \propto P(t_i|w)P(t_i|d)$
    4. Go to step 2 and repeat until convergence
  - hyper-parameters
    - $T$ : number of topic
    - $\beta$ : prior on the topic-word distribution
    - $\alpha$ : prior on the document-topic distribution
    - Analogous to k in add-k smoothing in N-gram LM
    - Pseudo counts to initialise co-occurrence matrix:
    - High prior values $\to$ flatter distribution
      - a very very large value would lead to a uniform distribution
    - Low prior values $\to$ peaky distribution
    - $\beta$ : generally small (< 0.01)
      - Large vocabulary, but we want each topic to focus on specific themes
    - $\alpha$ : generally larger (> 0.1)
      - Multiple topics within a document

Evaluation

how to evaluate topic models
- Unsupervised learning $\to$ no labels
- Intrinsic(内在的，固有的) evaluation:
  - model logprob / perplexity(困惑度，复杂度) on test documents
  - $logL=\sum_W\sum_TlogP(w|t)P(t|d_w)$
  - $ppl=exp^{\frac{-logL}{W}}$
issues with perlexity
- More topics = better (lower) perplexity
- Smaller vocabulary = better perplexity
  - Perplexity not comparable for different corpora, or different tokenisation/preprocessing methods
- Does not correlate with human perception of topic quality
- Extrinsic(外在的) evaluation the way to go:
  - Evaluate topic models based on downstream task
topic coherence
- A better intrinsic evaluation method
- Measure how coherent the generated topics (blue more coherent than red)
- A good topic model is one that generates more coherent topics
word intrusion
- Idea: inject one random word to a topic
  - {farmers, farm, food, rice, agriculture} $\to$ {farmers, farm, food, rice, cat, agriculture}
- Ask users to guess which is the intruder word
- Correct guess $\to$ topic is coherent
- Try guess the intruder word in:
  - {choice, count, village, i.e., simply, unionist}
- Manual effort; does not scale
PMI $\approx$ coherence?
- High PMI for a pair of words $\to$ words are correlated
  - PMI(farm, rice) $\uparrow$
  - PMI(choice, village) $\downarrow$
- If all word pairs in a topic has high PMI $\to$ topic is coherent
- If most topics have high PMI $\to$ good topic model
- Where to get word co-occurrence statistics for PMI?
  - Can use same corpus for topic model
  - A better way is to use an external corpus (e.g. Wikipedia)
PMI
- Compute pairwise PMI of top-N words in a topic
  - $PMI(t)=\sum_{j=2}^N\sum_{i=1}^{j-1}log\frac{P(w_i,w_j)}{P(w_i)P(w_j)}$
- Given topic: {farmers, farm, food, rice, agriculture}
- Coherence = sum PMI for all word pairs:
  - PMI(farmers, farm) + PMI(farmers, food) + … + PMI(rice, agriculture)
- variants
  - Normalised PMI
    - $NPMI(t)=\sum_{j=2}^N\sum_{i=1}^{j-1}\frac{log\frac{P(w_i,w_j)}{P(w_i)P(w_j)}}{-logP(w_i,w_j)}$
  - conditional probability (proved not as good as PMI)
    - $LCP(t)=\sum_{j=2}^N\sum_{i=1}^{j-1}log\frac{P(w_i,w_j)}{P(w_i)}$
- example (PMI tends to favor rarer words, use NPMI to relieve this problem)

Conclusion

Topic model: an unsupervised model for learning latent concepts in a document collection
LDA: a popular topic model
- Learning
- Hyper-parameters
How to evaluate topic models?
- Topic coherence

Lecture 20 Topic Modelling

目录 Topic ModellingA Brief History of Topic ModelsLDAEvaluationConclusion Topic Modelling makeingsense of text English Wikipedia: 6M articlesTwitter: 500M tweets per dayNew York Times: 15M articlesarXiv: 1M articlesWhat can we do if we want to learn somet…...

编程日记 2023/6/11 10:19:17

ThreadPoolExecutor线程池

文章目录一、ThreadPool线程池状态二、ThreadPoolExecutor构造方法三、Executors3.1 固定大小线程池3.2 带缓冲线程池3.3 单线程线程池四、ThreadPoolExecutor4.1 execute(Runnable task)方法使用4.2 submit()方法4.3 invokeAll()4.4 invokeAny()4.5 shutdown()4.6 shutdownN…...

编程日记 2023/6/11 10:14:16

chatgpt赋能python：Python实践：如何升级pip

Python实践：如何升级pip Python作为一门高效的脚本语言，被广泛应用于数据分析、人工智能、Web开发等领域。而pip则是Python的包管理工具，是开发Python应用的必备工具。但是pip在使用过程中，有时候会出现版本不兼容或者出现漏洞等…...

编程日记 2023/6/11 10:09:15

【JavaEE进阶】mybatis

目录： 一、Mybatis是什么三个映射关系如下图： 二、mybatis的使用（前置工作简单案例） 第一步：导入MAVEN依赖第二步： 在spring项目当中新建数据源第三步：新建一个实体类，是和…...

编程日记 2023/6/11 10:04:14

Redis的大key

什么是 redis 的大 key redis 的大 key 不是指存储在 redis 中的某个 key 的大小超过一定的阈值，而是该 key 所对应的 value 过大对于 string 类型来说，一般情况下超过 10KB 则认为是大 key；对于set、zset、hash 等类型来说，一般…...

编程日记 2023/6/11 9:59:13

MMPretrain

title: mmpretrain实战 date: 2023-06-07 16:04:01 tags: [image classification,mmlab] mmpretrain实战 [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ccTl9bOl-1686129437336)(null)] 主要讲解了安装,还有使用教程.安装教程直接参考官网.下面讲…...

编程日记 2023/6/11 9:54:12

栈和队列(数据结构刷题)[一]-python

文章目录前言一、原理介绍二、用栈实现队列1.操作2.思路三、关于面试考察栈里面的元素在内存中是连续分布的么？ 前言提到栈和队列，大家可能对它们的了解只停留在表面，再深入一点，好像知道又好像不知道的感觉。本文我将从底层实…...

编程日记 2023/6/11 9:49:11

【备战秋招】JAVA集合

集合前言一方面， 面向对象语言对事物的体现都是以对象的形式，为了方便对多个对象的操作，就要对对象进行存储。另一方面，使用Array存储对象方面具有一些弊端，而Java 集合就像一种容器，可以动态地把多…...

编程日记 2023/6/11 9:44:10

setState详解

this. setState( [partialState], [callback]) 1.[partialState] :支持部分状态更改 this, setState({ x:100 //不论总共有多少状态，我们只修改了x，其余的状态不动 });callback :在状态更改/视图更新完毕后触发执行，也可以说只要执行了setS…...

编程日记 2023/6/11 9:39:09

Qt5.12.6配置Android Arm开发环境(windows)

1. 安装jdk1.8 2.安装Android Studio 并安装 SDK 与NDK SDK Tools 选择 26.0.3 SDK Platform 选择 Android SDK Platform 26 NDK选择19版本安卓ARM环境配置成功如下: JDK1.8 , SDK 26 , NDK 19 在安装QT时要选择 ARMv7(32位CPU)与ARM64-v8a(64位CPU) 选择支持android平台…...

编程日记 2023/6/11 9:34:08

七、进程程序替换

文章目录一、进程程序替换（一）概念（二）为什么程序替换（三）程序替换的原理（四）如何进行程序替换1. execl2. 引入进程创建——子进程执行程序替换，会不会影响父进程呢? &…...

编程日记 2023/6/11 9:29:07

C++核心编程——详解运算符重载

文章目录💬 一.运算符重载基础知识①基本概念②运算符重载的规则③运算符重载形式④运算符重载建议二.常用运算符重载①左移(<<)和右移(>>)运算符重载1️⃣重载后函数参数是什么？2️⃣重载的函数返回类型是什么？3️⃣重载为哪种…...

编程日记 2023/6/11 9:24:06

2023年前端面试汇总-CSS

1. CSS基础 1.1. CSS选择器及其优先级对于选择器的优先级： 1. 标签选择器、伪元素选择器：1； 2. 类选择器、伪类选择器、属性选择器：10； 3. id 选择器：100； 4. 内联样式：1000&a…...

编程日记 2023/6/11 9:19:04

Java调用Pytorch实现以图搜图（附源码）

Java调用Pytorch实现以图搜图设计技术栈： 1、ElasticSearch环境； 2、Python运行环境（如果事先没有pytorch模型时，可以用python脚本创建模型）； 1、运行效果 2、创建模型（有则可以跳过&#xf…...

编程日记 2023/6/11 9:14:03

【EasyX】实时时钟

目录实时时钟1. 绘制静态秒针2. 秒针的转动3. 根据实际时间转动4. 添加时针和分针5. 添加表盘刻度实时时钟本博客介绍利用EasyX实现一个实时钟表的小程序，同时学习时间函数的使用。本文源码可从github获取 1. 绘制静态秒针第一步定义钟表的中心坐标center&a…...

编程日记 2023/6/11 9:09:03

基于XC7Z100的PCIe采集卡（GMSL FMC采集卡）

GMSL 图像采集卡特性 ● PCIe Gen2.0 X8 总线； ● 支持V4L2调用； ● 1路CAN接口； ● 6路/12路 GMSL1/2摄像头输入，最高可达8MP； ● 2路可定义相机同步触发输入/输出； 优势 ● 采用PCIe主卡与FMC子…...

编程日记 2023/6/11 9:04:02

Kibana：使用 Kibana 自带数据进行可视化（一）

在今天的练习中，我们将使用 Kibana 自带的数据来进行一些可视化的展示。希望对刚开始使用 Kibana 的用户有所帮助。前提条件如果你还没有安装好自己的 Elastic Stack，你可以参考如下的视频来开启 Elastic Stack 并进行下面的练习。你可以开通阿里云检…...

编程日记 2023/6/11 8:59:01

MySQL数据库基础 07

第七章单行函数 1. 函数的理解1.1 什么是函数1.2 不同DBMS函数的差异1.3 MySQL的内置函数及分类 2. 数值函数2.1 基本函数2.2 角度与弧度互换函数2.3 三角函数2.4 指数与对数2.5 进制间的转换 3. 字符串函数4. 日期和时间函数4.1 获取日期、时间 4.2 日期与时间戳的转换 4.3 获…...

编程日记 2023/6/11 8:54:00

JVM | JVM垃圾回收

JVM | JVM垃圾回收 1、堆空间的基本结构2、内存分配和回收原则2.1、对象优先在 Eden 区分配2.2、大对象直接进入老年代2.3、长期存活的对象将进入老年代2.4、主要进行 gc 的区域2.5、空间分配担保3、死亡对象判断方法3.1、引用计数法3.2、可达性分析算法3.3、引用类型总结3.4、…...

编程日记 2023/6/11 8:48:59

avive零头撸矿

Avive 是一个透明的、自下而上替代自上而下的多元网络，旨在克服当前生态系统的局限性，实现去中心化社会。 aVive：一个基于 SBT 和市场的 deSoc，它使 dapps 能够与分散的位置 oracle 和 SBT 关系进行互操作。您的主权社交网络元宇宙…...

编程日记 2023/6/11 8:43:57

19c补丁后oracle属主变化，导致不能识别磁盘组

补丁后服务器重启，数据库再次无法启动 ORA01017: invalid username/password; logon denied Oracle 19c 在打上 19.23 或以上补丁版本后，存在与用户组权限相关的问题。具体表现为，Oracle 实例的运行用户（oracle）和集…...

编程新知 2025/12/2 9:10:28

label-studio的使用教程(导入本地路径)

文章目录 1. 准备环境2. 脚本启动2.1 Windows2.2 Linux 3. 安装label-studio机器学习后端3.1 pip安装(推荐)3.2 GitHub仓库安装 4. 后端配置4.1 yolo环境4.2 引入后端模型4.3 修改脚本4.4 启动后端 5. 标注工程5.1 创建工程5.2 配置图片路径5.3 配置工程类型标签5.4 配置模型5.…...

编程新知 2025/12/2 6:11:26

8k长序列建模，蛋白质语言模型Prot42仅利用目标蛋白序列即可生成高亲和力结合剂

蛋白质结合剂（如抗体、抑制肽）在疾病诊断、成像分析及靶向药物递送等关键场景中发挥着不可替代的作用。传统上，高特异性蛋白质结合剂的开发高度依赖噬菌体展示、定向进化等实验技术，但这类方法普遍面临资源消耗巨大、研发周期冗长…...

编程新知 2025/12/3 9:53:08

uni-app学习笔记二十二---使用vite.config.js全局导入常用依赖

在前面的练习中，每个页面需要使用ref，onShow等生命周期钩子函数时都需要像下面这样导入 import {onMounted, ref} from "vue" 如果不想每个页面都导入，需要使用node.js命令npm安装unplugin-auto-import npm install unplugin-au…...

编程新知 2025/9/14 18:41:15

无法与IP建立连接，未能下载VSCode服务器

如题，在远程连接服务器的时候突然遇到了这个提示。查阅了一圈，发现是VSCode版本自动更新惹的祸！！！ 在VSCode的帮助->关于这里发现前几天VSCode自动更新了，我的版本号变成了1.100.3 才导致了远程连接出…...

编程新知 2025/12/3 17:06:52

ESP32读取DHT11温湿度数据

芯片：ESP32 环境：Arduino 一、安装DHT11传感器库红框的库，别安装错了二、代码注意，DATA口要连接在D15上 #include "DHT.h" // 包含DHT库#define DHTPIN 15 // 定义DHT11数据引脚连接到ESP32的GPIO15 #define D…...

编程新知 2025/11/30 16:55:28

学习STC51单片机31（芯片为STC89C52RCRC）OLED显示屏1

每日一言生活的美好，总是藏在那些你咬牙坚持的日子里。硬件：OLED 以后要用到OLED的时候找到这个文件 OLED的设备地址 SSD1306"SSD" 是品牌缩写，"1306" 是产品编号。驱动 OLED 屏幕的 IIC 总线数据传输格式示意图 …...

编程新知 2025/11/28 21:21:08

拉力测试cuda pytorch 把 4070显卡拉满

import torch import timedef stress_test_gpu(matrix_size16384, duration300):"""对GPU进行压力测试，通过持续的矩阵乘法来最大化GPU利用率参数:matrix_size: 矩阵维度大小，增大可提高计算复杂度duration: 测试持续时间（秒&…...

编程新知 2025/11/5 4:59:05

c#开发AI模型对话

AI模型前面已经介绍了一般AI模型本地部署，直接调用现成的模型数据。这里主要讲述讲接口集成到我们自己的程序中使用方式。微软提供了ML.NET来开发和使用AI模型，但是目前国内可能使用不多，至少实践例子很少看见。开发训练模型就不介绍了&am…...

编程新知 2025/12/3 16:01:41

Lecture 20 Topic Modelling

目录

Topic Modelling

A Brief History of Topic Models

LDA

Evaluation

Conclusion

相关文章：

Lecture 20 Topic Modelling

ThreadPoolExecutor线程池

chatgpt赋能python：Python实践：如何升级pip

【JavaEE进阶】mybatis

Redis的大key

MMPretrain

栈和队列(数据结构刷题)[一]-python

【备战秋招】JAVA集合

setState详解

Qt5.12.6配置Android Arm开发环境(windows)

七、进程程序替换

C++核心编程——详解运算符重载

2023年前端面试汇总-CSS

Java调用Pytorch实现以图搜图（附源码）

【EasyX】实时时钟

基于XC7Z100的PCIe采集卡（GMSL FMC采集卡）

Kibana：使用 Kibana 自带数据进行可视化（一）

MySQL数据库基础 07

JVM | JVM垃圾回收

avive零头撸矿

19c补丁后oracle属主变化，导致不能识别磁盘组

label-studio的使用教程(导入本地路径)

8k长序列建模，蛋白质语言模型Prot42仅利用目标蛋白序列即可生成高亲和力结合剂

uni-app学习笔记二十二---使用vite.config.js全局导入常用依赖

无法与IP建立连接，未能下载VSCode服务器

ESP32读取DHT11温湿度数据

最新SpringBoot+SpringCloud+Nacos微服务框架分享

学习STC51单片机31（芯片为STC89C52RCRC）OLED显示屏1

拉力测试cuda pytorch 把 4070显卡拉满

c#开发AI模型对话