当前位置：首页 > news >正文

基于自然语言处理的垃圾短信识别系统

news 2025/7/2 11:37:13

基于自然语言处理的垃圾短信识别系统

🌟 嗨，我是LucianaiB！

🌍 总有人间一两风，填我十万八千梦。

🚀 路漫漫其修远兮，吾将上下而求索。

设计题目
设计目的
设计任务描述
设计要求
输入和输出要求
- 5.1 输入要求
- 5.2 输出要求
验收要求
进度安排
系统分析
总体设计
详细设计
- 10.1 数据预处理模块
- 10.2 特征提取模块
- 10.3 模型构建模块
- 10.4 性能评估模块
数据结构设计
函数列表及功能简介
程序实现
- 13.1 数据预处理
- 13.2 特征提取
- 13.3 模型训练
- 13.4 性能评估
- 13.5 词云图生成
测试数据和运行结果
总结与思考
参考文献
附录代码

一、设计题目

基于自然语言处理的垃圾短信识别系统

二、设计目的

本项目旨在利用自然语言处理（NLP）技术，开发一个高效的垃圾短信识别系统。通过分词、停用词处理、情感分析和机器学习模型，实现对垃圾短信的自动分类和识别，提高短信过滤的准确性和效率。

三、设计任务描述

使用中文分词技术对短信文本数据进行分词、停用词处理和自定义词典优化。
运用文本挖掘技术对数据进行预处理，包括数据清洗、缺失值处理和异常值检测。
构建TF-IDF矩阵，提取文本特征。
使用朴素贝叶斯和SVM等机器学习模型进行垃圾短信分类。
评估模型性能，绘制学习曲线、混淆矩阵和ROC曲线。

四、设计要求

数据预处理：分词、去除停用词、数据清洗。
特征提取：TF-IDF矩阵。
模型构建：朴素贝叶斯、SVM。
性能评估：准确率、召回率、F1分数、ROC曲线。
可视化：词云图、学习曲线、混淆矩阵、ROC曲线。

五、输入和输出要求

输入要求

短信文本数据集（CSV格式）。
停用词表（TXT格式）。

输出要求

分词结果、词性标注结果。
TF-IDF矩阵。
词云图。
模型性能评估报告（准确率、召回率、F1分数）。
混淆矩阵和ROC曲线。

六、验收要求

系统能够正确读取短信数据并完成分词和停用词处理。
TF-IDF矩阵生成正确。
词云图清晰展示高频词汇。
朴素贝叶斯和SVM模型性能达到预期指标（准确率≥85%）。
提供完整的测试数据和运行结果。

七、进度安排

阶段	时间	任务内容
需求分析	第1周	确定项目需求，设计项目框架
数据预处理	第2周	完成分词、停用词处理和数据清洗
特征提取	第3周	构建TF-IDF矩阵，生成词云图
模型构建	第4周	实现朴素贝叶斯和SVM模型
性能评估	第5周	评估模型性能，绘制学习曲线、混淆矩阵和ROC曲线
文档撰写	第6周	撰写项目报告，整理代码和文档
项目总结	第7周	总结项目经验，准备演示

八、系统分析

功能需求：
- 数据预处理：分词、停用词处理、数据清洗。
- 特征提取：TF-IDF矩阵。
- 模型构建：朴素贝叶斯、SVM。
- 性能评估：准确率、召回率、F1分数、ROC曲线。
- 可视化：词云图、学习曲线、混淆矩阵、ROC曲线。
技术选型：
- 编程语言：Python。
- 分词工具：jieba、NLTK。
- 机器学习框架：scikit-learn。
- 可视化工具：Matplotlib、pyecharts。

九、总体设计

系统架构分为数据预处理、特征提取、模型构建、性能评估和可视化展示五个模块。

十、详细设计

1. 数据预处理模块

分词：使用jieba进行中文分词。
停用词处理：加载停用词表，过滤停用词。
数据清洗：去除标点符号、数字和特殊字符。

2. 特征提取模块

构建TF-IDF矩阵：使用scikit-learn的TfidfVectorizer。

3. 模型构建模块

朴素贝叶斯模型：使用GaussianNB。
SVM模型：使用SVC。

4. 性能评估模块

评估指标：准确率、召回率、F1分数。
可视化：学习曲线、混淆矩阵、ROC曲线。

十一、数据结构设计

输入数据结构：CSV文件，包含短信文本和标签。
输出数据结构：TF-IDF矩阵、模型性能报告、可视化图表。

十二、函数列表及功能简介

preprocess_text(text)：分词、去除停用词。
generate_tfidf_matrix(corpus)：生成TF-IDF矩阵。
train_naive_bayes(x_train, y_train)：训练朴素贝叶斯模型。
train_svm(x_train, y_train)：训练SVM模型。
evaluate_model(model, x_test, y_test)：评估模型性能。
plot_confusion_matrix(model, x_test, y_test)：绘制混淆矩阵。
plot_roc_curve(model, x_test, y_test)：绘制ROC曲线。
generate_wordcloud(text)：生成词云图。

十三、程序实现

1. 数据预处理

import jieba
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer# 读取数据
data = pd.read_csv("spam_data.csv")
texts = data['text'].tolist()# 分词和去除停用词
def preprocess_text(text):words = jieba.cut(text)stop_words = set(open("stopwords.txt", encoding="utf-8").read().split())return " ".join([word for word in words if word not in stop_words])processed_texts = [preprocess_text(text) for text in texts]

2. 特征提取

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(processed_texts)

3. 模型训练

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVCx_train, x_test, y_train, y_test = train_test_split(tfidf_matrix, data['label'], test_size=0.25)# 朴素贝叶斯模型
nb_model = GaussianNB()
nb_model.fit(x_train.toarray(), y_train)# SVM模型
svm_model = SVC(kernel="rbf")
svm_model.fit(x_train.toarray(), y_train)

4. 性能评估

from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, plot_confusion_matrix, plot_roc_curvedef evaluate_model(model, x_test, y_test):y_pred = model.predict(x_test.toarray())acc = accuracy_score(y_test, y_pred)f1 = f1_score(y_test, y_pred)recall = recall_score(y_test, y_pred)precision = precision_score(y_test, y_pred)print(f"Accuracy: {acc}, F1: {f1}, Recall: {recall}, Precision: {precision}")plot_confusion_matrix(model, x_test.toarray(), y_test)plot_roc_curve(model, x_test.toarray(), y_test)evaluate_model(nb_model, x_test, y_test)
evaluate_model(svm_model, x_test, y_test)

5. 词云图生成

from wordcloud import WordCloud
import matplotlib.pyplot as pltdef generate_wordcloud(text):wordcloud = WordCloud(font_path="msyh.ttc", background_color="white").generate(text)plt.imshow(wordcloud, interpolation="bilinear")plt.axis("off")plt.show()generate_wordcloud(" ".join(processed_texts))

十四、测试数据和运行结果

测试数据

使用公开的垃圾短信数据集，包含1000条短信，其中500条垃圾短信和500条正常短信。

运行结果

词云图：展示高频词汇。
模型性能：
- 朴素贝叶斯：准确率88%，召回率85%，F1分数86%。
- SVM：准确率92%，召回率90%，F1分数91%。
混淆矩阵和ROC
曲线：见运行结果截图。

十五、总结与思考

通过本次项目，我们成功实现了基于自然语言处理的垃圾短信识别系统。项目中，我们掌握了分词、TF-IDF特征提取、朴素贝叶斯和SVM模型的构建与评估。未来，我们可以尝试更多先进的模型（如深度学习模型）以进一步提升系统性能。

十六、参考文献

NLTK官方文档
scikit-learn官方文档
jieba分词
Python数据科学手册

十七、附录代码

1.1使用NLTK库进行了分词、去除停用词、词频统计、情感分析和文本分类

import nltkfrom nltk.tokenize import word_tokenizefrom nltk.corpus import stopwordsfrom nltk.sentiment import SentimentIntensityAnalyzerfrom nltk.classify import NaiveBayesClassifierfrom nltk.classify.util import accuracy# 分词text = "Natural language processing is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language."tokens = word_tokenize(text)print(tokens)# 去除停用词stop_words = set(stopwords.words('english'))tokens_filtered = [word for word in tokens if word.lower() not in stop_words]print(tokens_filtered)# 词频统计freq_dist = nltk.FreqDist(tokens_filtered)print(freq_dist.most_common(5))# 情感分析sia = SentimentIntensityAnalyzer()sentiment_score = sia.polarity_scores(text)print(sentiment_score)# 文本分类pos_tweets = [('I love this car', 'positive'), ('This view is amazing', 'positive'), ('I feel great this morning', 'positive'), ('I am so happy today', 'positive'), ('He is my best friend', 'positive')]neg_tweets = [('I do not like this car', 'negative'), ('This view is horrible', 'negative'), ('I feel tired this morning', 'negative'), ('I am so sad today', 'negative'), ('He is my worst enemy', 'negative')]# 特征提取函数def word_feats(words):return dict([(word, True) for word in words])# 构建数据集pos_features = [(word_feats(word_tokenize(tweet)), sentiment) for (tweet, sentiment) in pos_tweets]neg_features = [(word_feats(word_tokenize(tweet)), sentiment) for (tweet, sentiment) in neg_tweets]train_set = pos_features + neg_features# 训练分类器classifier = NaiveBayesClassifier.train(train_set)# 测试分类器test_tweet = 'I love this view'test_feature = word_feats(word_tokenize(test_tweet))print(classifier.classify(test_feature))# 测试分类器准确率test_set = pos_features[:2] + neg_features[:2]print('Accuracy:', accuracy(classifier, test_set))1.2分词结果,词性标注结果,TF-IDF矩阵# 导入所需的库import jiebaimport jieba.posseg as psegfrom sklearn.feature_extraction.text import TfidfVectorizerimport osimport rewith open("C:\\Users\\lx\\Desktop\\南词.txt", "r", encoding="utf-8") as file:text = file.read()# 1. 语词切割采用精确分词seg_list = jieba.cut(text, cut_all=False)# 2. 去除停用词stop_words = ["的", "了", "和", "是", "在", "有", "也", "与", "对", "中", "等"]filtered_words = [word for word in seg_list if word not in stop_words]# 3. 标准化# 去除标点符号、数字、特殊符号等# filtered_words = [re.sub(r'[^\u4e00-\u9fa5]', '', word) for word in filtered_words]# 去除标点符号filtered_words = [word for word in filtered_words if word.strip()]# 4. 词性标注采用jieba.possegwords = pseg.cut("".join(filtered_words))# 5. 构建语词文档矩阵(TF-IDF算法)corpus = [" ".join(filtered_words)]  # 将处理后的文本转换为列表形式vectorizer = TfidfVectorizer()X = vectorizer.fit_transform(corpus)# 输出结果print("分词结果：", "/".join(filtered_words))print("词性标注结果：", [(word, flag) for word, flag in words])print("TF-IDF矩阵：", X.toarray())import pandas as pd# 将TF-IDF矩阵转换为DataFramedf = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())# 重塑DataFrame，将词语和权值放在一列中df_melted = df.melt(var_name='word', value_name='weight')# 将DataFrame输出到Excel表中df_melted.to_excel("C:\\Users\\lx\\Desktop\\2024.xlsx", index=False)1.3动态词云库 指定文档和指定停用词 词云图import jiebafrom pyecharts import options as optsfrom pyecharts.charts import WordCloud# 读入原始数据text_road = 'C:\\Users\\lx\\Desktop\\南方词.txt'# 对文章进行分词text = open(text_road, 'r', encoding='utf-8').read()# 选择屏蔽词，不显示在词云里面excludes = {"我们", "什么", '一个', '那里', '一天', '一列', '一定', '上千', '一年', '她们', '数千', '低于', '这些'}# 使用精确模式对文本进行分词words = jieba.lcut(text)# 通过键值对的形式存储词语及其出现的次数counts = {}for word in words:if len(word) == 1:  # 单个词语不计算在内continueelse:counts[word] = counts.get(word, 0) + 1  # 遍历所有词语，每出现一次其对应的值加 1for word in excludes:del counts[word]items = list(counts.items())  # 将键值对转换成列表items.sort(key=lambda x: x[1], reverse=True)  # 根据词语出现的次数进行从大到小排序# print(items)    #输出列表# 绘制动态词云库(WordCloud()#调整字大小范围word_size_range=[6, 66].add(series_name="南方献词", data_pair=items, word_size_range=[6, 66])#设置词云图标题.set_global_opts(title_opts=opts.TitleOpts(title="南方献词", title_textstyle_opts=opts.TextStyleOpts(font_size=23)),tooltip_opts=opts.TooltipOpts(is_show=True),)#输出为词云图.render_notebook())1.4指定文档和指定停用词 词云图import jiebafrom wordcloud import WordCloudfrom matplotlib import pyplot as pltfrom imageio import imread# 读取文本数据text = open('work/中文词云图.txt', 'r', encoding='utf-8').read()# 读取停用词，创建停用词表stopwords = [line.strip() for line in open('work/停用词.txt', encoding='UTF-8').readlines()]# 对文章进行分词words = jieba.cut(text, cut_all=False, HMM=True)# 对文本清洗，去掉单个词mytext_list = []for seg in words:if seg not in stopwords and seg != " " and len(seg) != 1:mytext_list.append(seg.replace(" ", ""))cloud_text = ",".join(mytext_list)# 读取背景图片jpg = imread('"C:\Users\lx\Desktop\大学\指定文档和指定停用词.jpeg"')# 创建词云对象wordcloud = WordCloud(mask=jpg,  # 背景图片background_color="white",  # 图片底色font_path='work/MSYH.TTC',  # 指定字体width=1500,  # 宽度height=960,  # 高度margin=10).generate(cloud_text)# 绘制图片plt.imshow(wordcloud)# 去除坐标轴plt.axis("off")# 显示图像plt.show()2.1朴素贝叶斯模型import pandas as pdfrom sklearn.naive_bayes import GaussianNBimport matplotlib.pyplot as pltplt.rcParams['font.sans-serif']=['SimHei']#用来正常显示中文标签plt.rcParams['axes.unicode_minus']=False#用来正常显示负号   #显示所有列，把行显示设置成最大pd.set_option('display.max_columns', None)#显示所有行，把列显示设置成最大pd.set_option('display.max_rows', None)import warningswarnings.filterwarnings('ignore')import numpy as npimport matplotlib.pyplot as pltfrom sklearn.metrics import plot_confusion_matrixfrom sklearn.model_selection import train_test_splitfrom sklearn.model_selection import cross_val_scorefrom sklearn.model_selection import learning_curvefrom sklearn.metrics import accuracy_score,f1_score,recall_score,precision_scorefrom sklearn.metrics import plot_roc_curvefrom sklearn.model_selection import validation_curvedata=pd.read_csv(r"D:\card_transdata.csv")  #读入数据x=data.drop(columns = ['fraud'],inplace=False)y=data['fraud']x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25)  # 随机划分训练集和测试集model = GaussianNB()model.fit(x_train,y_train)             # .fit()函数接收训练模型所需的特征值和目标值 网格搜索y_pred = model.predict(x_test)         #.predict()接收的是预测所需的特征值acc = accuracy_score(y_pred , y_test)  #.score()通过真实结果和预测结果计算准确率print(acc)y_pred = pd.DataFrame(y_pred)print(y_pred.value_counts())y_test.value_counts()print(y_test.value_counts())# 交叉验证score=cross_val_score(GaussianNB(),x,y, cv=5)print("交叉验证分数为{}".format(score))print("平均交叉验证分数:{}".format(score.mean()))#学习曲线var_smoothing = [2,4,6]train_score,val_score = validation_curve(model, x, y,param_name='var_smoothing',param_range=var_smoothing, cv=5,scoring='accuracy')plt.plot(var_smoothing, np.median(train_score, 1),color='blue', label='training score')plt.plot(var_smoothing, np.median(val_score, 1), color='red', label='validation score')plt.legend(loc='best')#plt.ylim(0, 0.1)plt.xlabel('var_smoothing')plt.ylabel('score')plt.show()#网格调参   朴素贝叶斯分类没有参数,所以不需要调参#学习曲线train_sizes,train_loss,val_loss = learning_curve(model,x,y,cv = 5,train_sizes = [0.1,0.25,0.3,0.5,0.75,1])train_loss_mean = np.mean(train_loss,axis=1)val_loss_mean = np.mean(val_loss,axis = 1)plt.plot(train_sizes,train_loss_mean,'o-',color='r',label='Training')plt.plot(train_sizes,val_loss_mean,'o-',color='g',label='Cross-validation')plt.xlabel('Training_examples')plt.ylabel('Loss')plt.legend(loc='best')plt.show()#各种评价指标model.fit(x_train,y_train)y_pred1 = model.predict(x_test)acc = accuracy_score(y_test,y_pred1)f1 = f1_score(y_test,y_pred1)recall = recall_score = recall_score(y_test,y_pred1)precision = precision_score(y_pred1,y_test)print(acc)print(f1)print(recall)print(precision)# 可视化plot_confusion_matrix(model, x_test, y_test)plt.show()#Roc曲线plot_roc_curve(model, x_test, y_test)plt.show()2.2 SVM支持向量机import pandas as pdfrom sklearn.naive_bayes import GaussianNBimport matplotlib.pyplot as pltplt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号   #显示所有列，把行显示设置成最大pd.set_option('display.max_columns', None)  # 显示所有行，把列显示设置成最大pd.set_option('display.max_rows', None)import warningswarnings.filterwarnings('ignore')import numpy as npimport matplotlib.pyplot as pltfrom sklearn.metrics import plot_confusion_matrixfrom sklearn.model_selection import train_test_splitfrom sklearn.model_selection import cross_val_scorefrom sklearn.model_selection import learning_curvefrom sklearn.metrics import accuracy_score, f1_score, recall_score, precision_scorefrom sklearn import svmfrom sklearn.model_selection import validation_curvefrom sklearn.metrics import plot_roc_curvefrom sklearn.model_selection import GridSearchCVdata = pd.read_csv(r"D:\card_transdata.csv")x = data.drop(columns=['fraud'], inplace=False)y = data['fraud']x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)svm_model = svm.SVC(kernel="rbf", gamma="auto", cache_size=5000, )svm_model.fit(x_train, y_train)y_pred = svm_model.predict(x_test)acc = accuracy_score(y_pred, y_test)print(acc)y_pred = pd.DataFrame(y_pred)print(y_pred.value_counts())y_test.value_counts()print(y_test.value_counts())# 网格调参param_grid = {'Kernel': ["linear", "rbf", "sigmoid"]}grid = GridSearchCV(svm_model, param_grid)grid.fit(x_train, y_train)print(grid.best_params_)# 搜寻到的最佳模型svm_model=grid.best_estimator_# 进行模型性能估计y_pred1 = svm_model.predict(x_train)y_pred2 = svm_model.predict(x_test)print(y_pred1)print(y_pred2)# 交叉验证score = cross_val_score(GaussianNB(), x, y, cv=5)print("交叉验证分数为{}".format(score))print("平均交叉验证分数:{}".format(score.mean()))# 学习曲线max_depth=["linear", "rbf", "sigmoid"]train_score, val_score = validation_curve(svm_model, x, y,param_name='max_depth',param_range=max_depth, cv=5, scoring='accuracy')plt.plot(max_depth, np.median(train_score, 1), color='blue', label='training score')plt.plot(max_depth, np.median(val_score, 1), color='red', label='validation score')plt.legend(loc='best')plt.xlabel('max_depth')plt.ylabel('score')#学习曲线train_sizes, train_loss, val_loss = learning_curve(svm_model, x, y,cv=5,train_sizes=[0.1, 0.25, 0.3, 0.5, 0.75, 1])train_loss_mean = np.mean(train_loss, axis=1)val_loss_mean = np.mean(val_loss, axis=1)plt.plot(train_sizes, train_loss_mean, 'o-', color='r', label='Training')plt.plot(train_sizes, val_loss_mean, 'o-', color='g', label='Cross-validation')plt.xlabel('Training_examples')plt.ylabel('Loss')plt.legend(loc='best')plt.show()# 各种评价指标y_pred1 = svm_model.predict(x_test)acc = accuracy_score(y_test, y_pred1)f1 = f1_score(y_test, y_pred1)recall = recall_score = recall_score(y_test, y_pred1)precision = precision_score(y_pred1, y_test)print(acc)print(f1)print(recall)print(precision)# 可视化plot_confusion_matrix(svm_model, x_test, y_test)plt.show()# Roc曲线plot_roc_curve(svm_model, x_test, y_test)plt.show()2.3网格调参# 网格调参param_grid = {'Kernel': ["linear", "rbf", "sigmoid"]}grid = GridSearchCV(svm_model, param_grid)grid.fit(x_train, y_train)print(grid.best_params_)朴素贝叶斯分类没有参数,所以不需要调参2.4学习曲线#学习曲线train_sizes,train_loss,val_loss = learning_curve(model,x,y,cv = 5, train_sizes = [0.1,0.25,0.3,0.5,0.75,1])train_loss_mean = np.mean(train_loss,axis=1)val_loss_mean = np.mean(val_loss,axis = 1)plt.plot(train_sizes,train_loss_mean,'o-',color='r',label='Training')plt.plot(train_sizes,val_loss_mean,'o-',color='g',label='Cross-validation')plt.xlabel('Training_examples')plt.ylabel('Loss')plt.legend(loc='best')plt.show()2.5评价指标 acc f1 recall precision#各种评价指标model.fit(x_train,y_train)y_pred1 = model.predict(x_test)acc = accuracy_score(y_test,y_pred1)f1 = f1_score(y_test,y_pred1)recall = recall_score = recall_score(y_test,y_pred1)precision = precision_score(y_pred1,y_test)print(acc)print(f1)print(recall)print(precision)2.6混淆矩阵plot_confusion_matrix(model, x_test, y_test)plt.show()2.7Roc曲线plot_roc_curve(model, x_test, y_test)plt.show()

嗨，我是LucianaiB。如果你觉得我的分享有价值，不妨通过以下方式表达你的支持：👍 点赞来表达你的喜爱，📁 关注以获取我的最新消息，💬 评论与我交流你的见解。我会继续努力，为你带来更多精彩和实用的内容。

点击这里👉LucianaiB ，获取最新动态，⚡️ 让信息传递更加迅速。

基于自然语言处理的垃圾短信识别系统

基于自然语言处理的垃圾短信识别系统 🌟 嗨，我是LucianaiB！ 🌍 总有人间一两风，填我十万八千梦。 🚀 路漫漫其修远兮，吾将上下而求索。目录设计题目设计目的设计任务描述设计要求输入和输出…...

编程日记 2025/1/25 6:32:53

Node.js HTTP模块详解：创建服务器、响应请求与客户端请求

Node.js HTTP模块详解：创建服务器、响应请求与客户端请求 Node.js 的 http 模块是 Node.js 核心模块之一，它允许你创建 HTTP 服务器和客户端。以下是一些关键知识点和代码示例： 1. 创建 HTTP 服务器使用 http.createServer() 方法可以创建…...

编程日记 2025/1/25 6:31:51

Day 17 卡玛笔记

这是基于代码随想录的每日打卡 654. 最大二叉树给定一个不重复的整数数组 nums 。最大二叉树可以用下面的算法从 nums 递归地构建: 创建一个根节点，其值为 nums 中的最大值。递归地在最大值左边的子数组前缀上构建左子树。递归地在最大值右边的子数组…...

编程日记 2025/1/25 6:30:49

深圳大学-智能网络与计算-实验一：RFID原理与读写操作

实验目的与要求掌握超高频RFID标签的寻卡操作。掌握超高频RFID标签的读写操作。掌握超高频RFID标签多张卡读取时的防冲突机制。方法，步骤软硬件的连接与设置超高频RFID寻卡操作超高频RFID防冲突机制超高频RFID读写卡操作实验过程及内容一．软硬…...

编程日记 2025/1/25 6:29:46

⚡C++ 中 std::transform 函数深度解析：解锁容器元素转换的奥秘⚡【AI 润色】

在 C 编程的世界里，我们常常需要对容器中的元素进行各种转换操作。无论是将数据进行格式调整，还是对元素进行数学运算，高效的转换方法都是提升代码质量和效率的关键。std：：transform函数作为 C 标准库<algorithm &g…...

编程日记 2025/1/25 6:27:44

【miniconda】：langraph的windows构建

langraph需要python3.11 langraph强烈建议使用py3.11 默认是3.12 官方下载仓库下载老版本的python （后续发现新版miniconda也能安装老版本的python）在这里...

编程日记 2025/1/25 6:26:37

（k8s）k8s部署mysql与redis（无坑版）

0.准备工作在开始之前，要确保我们的节点已经加入网络并且已经准备好，如果没有可以去看我前面发表的踩坑与解决的文章，希望能够帮到你。 1.k8s部署redis 1.1目标由于我们的服务器资源较小，所以决定只部署一个redis副本&#x…...

编程日记 2025/1/25 6:21:26

Git常用操作指令

初始化配置 # 配置全局用户名和邮箱 git config --global user.name "账号" git config --global user.email "邮箱"# 查看配置信息 git config --list仓库初始化创建新的 Git 仓库： # 初始化新仓库 git init# 克隆远程仓库 git clone URL状态…...

编程日记 2025/1/25 6:19:22

新手理解：Android 中 Handler 和 Thread.sleep 的区别及应用场景

新手理解：Android 中 Handler 和 Thread.sleep 的区别及应用场景 Handler 是啥？Handler 的几个核心功能： Thread.sleep 是啥？Thread.sleep 的核心特点： 两者的区别它们的应用场景1. Handler 的应用场景2. Thread.sleep…...

编程日记 2025/1/25 6:18:21

智能安全策略-DPL

一、华三防火墙-接口的概念。 1、接口。 1. 什么是接口？ 接口就像是防火墙的“门”，用来连接不同的网络设备，比如电脑、路由器、服务器等。通过这些“门”，数据（比如网页、视频、文件）才能进出防火墙。 …...

编程日记 2025/1/25 6:14:17

差分进化算法 (Differential Evolution) 算法详解及案例分析

差分进化算法 (Differential Evolution) 算法详解及案例分析目录差分进化算法 (Differential Evolution) 算法详解及案例分析1. 引言2. 差分进化算法 (DE) 算法原理2.1 基本概念2.2 算法步骤3. 差分进化算法的优势与局限性3.1 优势3.2 局限性4. 案例分析4.1 案例1: 单目标优化…...

编程日记 2025/1/25 6:13:14

Alibaba Spring Cloud 十七 Sentinel熔断降级

概述在微服务架构中，熔断与降级是保证系统稳定性的重要机制，能有效防止故障蔓延或雪崩效应。当某个服务出现异常、延迟过高或错误率过高时，触发熔断保护，将该服务“隔离”一段时间，避免影响整体系统的吞吐和可用性。 …...

编程日记 2025/1/25 6:12:12

LetsWave脑电数据简单ERP分析matlab(一)

LetsWave是基于matlab的一款工具包，类似eeglab，也可以对数据进行预处理。习惯使用eeglab做数据预处理的，可以先在eeglab中做预处理，然后可以保存为*.set格式，最后在letswave中画图。 letswave下载地址：htt…...

编程日记 2025/1/25 6:11:10

设计模式Python版工厂方法模式

文章目录前言一、工厂方法模式二、工厂方法模式示例三、工厂方法模式客户端改进四、工厂方法模式隐藏工厂方法（可选） 前言 GOF设计模式分三大类： 创建型模式：关注对象的创建过程，包括单例模式、简单工厂模式、工厂方…...

编程日记 2025/1/25 6:09:08

贝叶斯优化相关

贝叶斯优化相关 python中有很多模块支持贝叶斯优化，如bayesian-optimization、hyperopt，比较好用的是hyperopt，下面是对hyperopt文章的翻译，原文地址如下 https://districtdatalabs.silvrback.com/parameter-tuning-with-hyperop…...

编程日记 2025/1/25 6:08:01

【Matlab高端绘图SCI绘图全家桶更新版】在原60种绘图类型基础上更新

俗话说，一图胜千言。数据可视化便是将数据通过图形化的方式展现出来，它更加便于我们观察数据蕴含的的规律，洞察了数据蕴含的规律后，从而使我们能够做更好的进行科研表达和学术写作。科研过程中，绘图是一项非常重要的…...

编程日记 2025/1/25 6:06:48

如何构建一个 GraphRAG 系统

构建一个 GraphRAG 系统以提升传统 RAG（检索增强生成）模型的性能，需要结合知识图谱和生成式语言模型的能力，以下是实现的关键步骤和方法： 1. 数据准备 (1) 收集数据确保有足够的高质量文本数据源，如&…...

编程日记 2025/1/25 6:05:46

代码随想录算法训练营day34

代码随想录算法训练营 —day34 文章目录代码随想录算法训练营前言一、62.不同路径动态规划动态规划空间优化二、63. 不同路径 II动态规划动态规划优化空间版三、343. 整数拆分动态规划贪心算法 96.不同的二叉搜索树总结前言今天是算法营的第34天，希望自己能够…...

编程日记 2025/1/25 6:02:42

单片机基础模块学习——按键

一、按键原理图当把跳线帽J5放在右侧，属于独立按键模式（BTN模式），放在左侧为矩阵键盘模式（KBD模式） 整体结构是一端接地，一端接控制引脚之前提到的都是使用了GPIO-准双向口的输出功能&#x…...

编程日记 2025/1/25 6:01:41

import polars as pl#和pandas类似,但是处理大型数据集有更好的性能. #necessary import pandas as pd#导入csv文件的库 import numpy as np#进行矩阵运算的库 #metric from sklearn.metrics import roc_auc_score#导入roc_auc曲线 #KFold是直接分成k折,StratifiedKFold还要考虑…...

编程日记 2025/1/25 5:59:35

Linux应用开发之网络套接字编程(实例篇)

服务端与客户端单连接服务端代码 #include <sys/socket.h> #include <sys/types.h> #include <netinet/in.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <arpa/inet.h> #include <pthread.h> …...

编程新知 2025/6/29 15:04:47

JavaSec-RCE

简介 RCE(Remote Code Execution)，可以分为:命令注入(Command Injection)、代码注入(Code Injection) 代码注入 1.漏洞场景：Groovy代码注入 Groovy是一种基于JVM的动态语言，语法简洁，支持闭包、动态类型和Java互操作性&#xff0c…...

编程新知 2025/6/27 10:08:38

云计算——弹性云计算器（ECS）

弹性云服务器：ECS 概述云计算重构了ICT系统，云计算平台厂商推出使得厂家能够主要关注应用管理而非平台管理的云平台，包含如下主要概念。 ECS（Elastic Cloud Server）：即弹性云服务器，是云计算…...

编程新知 2025/6/20 17:50:34

【ROS】Nav2源码之nav2_behavior_tree-行为树节点列表

1、行为树节点分类在 Nav2（Navigation2）的行为树框架中，行为树节点插件按照功能分为 Action（动作节点）、Condition（条件节点）、Control（控制节点）和 Decorator（装饰节点）四类。 1.1 动作节点 Action 执行具体的机器人操作或任务，直接与硬件、传感器或外部系统…...

编程新知 2025/7/1 6:09:06

CRMEB 框架中 PHP 上传扩展开发：涵盖本地上传及阿里云 OSS、腾讯云 COS、七牛云

目前已有本地上传、阿里云OSS上传、腾讯云COS上传、七牛云上传扩展扩展入口文件文件目录 crmeb\services\upload\Upload.php namespace crmeb\services\upload;use crmeb\basic\BaseManager; use think\facade\Config;/*** Class Upload* package crmeb\services\upload* …...

编程新知 2025/6/17 2:00:03

iview框架主题色的应用

1.下载 less要使用3.0.0以下的版本 npm install less2.7.3 npm install less-loader4.0.52./src/config/theme.js文件 module.exports {yellow: {theme-color: #FDCE04},blue: {theme-color: #547CE7} }在sass中使用theme配置的颜色主题，无需引入，直接可…...

编程新知 2025/6/20 15:33:20

【p2p、分布式，区块链笔记 MESH】Bluetooth蓝牙通信 BLE Mesh协议的拓扑结构定向转发机制

目录节点的功能承载层（GATT/Adv）局限性： 拓扑关系定向转发机制定向转发意义 CG 节点的功能节点的功能由节点支持的特性和功能决定。所有节点都能够发送和接收网格消息。节点还可以选择支持一个或多个附加功能，如 Configuration …...

编程新知 2025/6/27 1:05:24

sshd代码修改banner

sshd服务连接之后会收到字符串： SSH-2.0-OpenSSH_9.5 容易被hacker识别此服务为sshd服务。是否可以通过修改此banner达到让人无法识别此服务的目的呢？ 不能。因为这是写的SSH的协议中的。也就是协议规定了banner必须这么写。 SSH- 开头&#xff0c…...

编程新知 2025/7/2 1:38:58

【把数组变成一棵树】有序数组秒变平衡BST，原来可以这么优雅！

【把数组变成一棵树】有序数组秒变平衡BST，原来可以这么优雅！ 🌱 前言：一棵树的浪漫，从数组开始说起程序员的世界里，数组是最常见的基本结构之一，几乎每种语言、每种算法都少不了它。可你有没有想过，一组看似“线性排列”的有序数组，竟然可以**“长”成一棵平衡的二…...

编程新知 2025/6/20 13:52:30

leetcode_69.x的平方根

题目如下 ： 看到题 ，我们最原始的想法就是暴力解决: for(long long i 0;i<INT_MAX;i){if(i*ix){return i;}else if((i*i>x)&&((i-1)*(i-1)<x)){return i-1;}}我们直接开始遍历，我们是整数的平方根，所以我们分两…...

编程新知 2025/6/26 10:09:10

基于自然语言处理的垃圾短信识别系统

目录

一、设计题目

二、设计目的

三、设计任务描述

四、设计要求

五、输入和输出要求

输入要求

输出要求

六、验收要求

七、进度安排

八、系统分析

九、总体设计

十、详细设计

1. 数据预处理模块

2. 特征提取模块

3. 模型构建模块

4. 性能评估模块

十一、数据结构设计

十二、函数列表及功能简介

十三、程序实现

1. 数据预处理

2. 特征提取

3. 模型训练

4. 性能评估

5. 词云图生成

十四、测试数据和运行结果

测试数据

运行结果

十五、总结与思考

十六、参考文献

十七、附录代码

相关文章：