当前位置：首页 > news >正文

大模型安全相关论文

news 2026/2/9 9:58:36

LLM对于安全的优势

“Generating secure hardware using chatgpt resistant to cwes,” Cryptology ePrint Archive, Paper 2023/212, 2023评估了ChatGPT平台上代码生成过程的安全性，特别是在硬件领域。探索了设计者可以采用的策略，使ChatGPT能够提供安全的硬件代码生成

“Fixing hardware security bugs with large language models,” arXiv preprint arXiv:2302.01215, 2023. 将关注点转移到硬件安全上。研究了LLMs，特别是OpenAI的Codex，在自动识别和修复硬件设计中与安全相关的bug方面的使用。

“Novel approach to cryptography implementation using chatgpt,” 使用ChatGPT实现密码学，最终保护数据机密性。尽管缺乏广泛的编码技巧或编程知识，但作者能够通过ChatGPT成功地实现密码算法。这凸显了个体利用ChatGPT进行密码学任务的潜力。

“Agentsca: Advanced physical side channel analysis agent with llms.” 2023.探索了应用LLM技术来开发侧信道分析方法。该研究包括3种不同的方法：提示工程、微调LLM和基于人类反馈强化学习的微调LLM

LLM的隐私保护

通过最先进的隐私增强技术(例如,零知识证明 ,差分隐私[ 233,175,159 ]和联邦学习[ 140,117,77 ] )来增强LLM

“Privacy and data protection in chatgpt and other ai chatbots: Strategies for securing user information,”
“Differentially private decoding in large language models,”
“Privacy-preserving prompt tuning for large language model services,”
“Federatedscope-llm: A comprehensive package for fine-tuning large language models in federated learning,”
“Chatgpt passing usmle shines a spotlight on the flaws of medical education,”
“Fate-llm: A industrial grade federated learning framework for large language models,”

对LLM的攻击

侧信道攻击

“Privacy side channels in machine learning systems,”引入了隐私侧信道攻击，这是一种利用系统级组件(例如,数据过滤、输出监控等)以远高于单机模型所能实现的速度提取隐私信息的攻击。提出了覆盖整个ML生命周期的4类侧信道，实现了增强型成员推断攻击和新型威胁(例如,提取用户的测试查询)

数据中毒攻击

“Universal jailbreak backdoors from poisoned human feedback,”
“On the exploitability of instruction tuning,”
“Promptspecific poisoning attacks on text-to-image generative models,”
“Poisoning language models during instruction tuning,”

后门攻击

“Chatgpt as an attack tool: Stealthy textual backdoor attack via blackbox generative model trigger,”
“Large language models are better adversaries: Exploring generative clean-label backdoor attacks against text classifiers,”
“Poisonprompt: Backdoor attack on prompt-based large language models,”

属性推断攻击

“Beyond memorization: Violating privacy via inference with large language models,”首次全面考察了预训练的LLMs从文本中推断个人信息的能力

提取训练数据

“Ethicist: Targeted training data extraction through loss smoothed soft prompting and calibrated confidence estimation,”
“Canary extraction in natural language understanding models,”
“What do code models memorize? an empirical study on large language models of code,”
“Are large pre-trained language models leaking your personal information?”
“Text revealer: Private text reconstruction via model inversion attacks against transformers,”

提取模型

“Data-free model extraction,”

对LLM的防御

模型架构防御

“Large language models can be strong differentially private learners,”具有较大参数规模的语言模型可以更有效地以差分隐私的方式进行训练。
“Promptbench: Towards evaluating the robustness of large language models on adversarial prompts,”
“Evaluating the instructionfollowing robustness of large language models to prompt injection,”更广泛的参数规模的LLMs，通常表现出对对抗攻击更高的鲁棒性。
“Revisiting out-of-distribution robustness in nlp: Benchmark, analysis, and llms evaluations,”在Out - of - distribution ( OOD )鲁棒性场景中也验证了这一点
“Synergistic integration of large language models and cognitive architectures for robust ai: An exploratory analysis,”通过将多种认知架构融入LLM来提高人工智能的鲁棒性。
“Building trust in conversational ai: A comprehensive review and solution architecture for explainable, privacy-aware systems using llms and knowledge graph,”与外部模块（知识图谱）相结合来提高LLM的安全性

LLM训练的防御：对抗训练

“Adversarial training for large neural language models,”
“Improving neural language modeling via adversarial training,”
“Freelb: Enhanced adversarial training for natural language understanding,”
“Towards improving adversarial training of nlp models,”
“Token-aware virtual adversarial training in natural language understanding,”
“Towards deep learning models resistant to adversarial attacks,”
“Achieving model robustness through discrete adversarial training,”
“Towards improving adversarial training of nlp models,”
“Improving neural language modeling via adversarial training,”
“Adversarial training for large neural language models,”
“Freelb: Enhanced adversarial training for natural language understanding,”
“Token-aware virtual adversarial training in natural language understanding,”

LLM训练的防御：鲁棒微调

“How should pretrained language models be fine-tuned towards adversarial robustness?”
“Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization,”
“Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions,”

LLM推理的防御：指令预处理

“Baseline defenses for adversarial attacks against aligned language models,”评估了多种针对越狱攻击的基线预处理方法，包括重令牌化和复述。
“On the reliability of watermarks for large language models,”评估了多种针对越狱攻击的基线预处理方法，包括重令牌化和复述
“Text adversarial purification as defense against adversarial attacks,”通过先对输入令牌进行掩码，然后与其他LLMs一起预测被掩码的令牌来净化指令。
“Jailbreak and guard aligned language models with only few in-context demonstrations,”证明了在指令中插入预定义的防御性证明可以有效地防御LLMs的越狱攻击。
“Testtime backdoor mitigation for black-box large language models with defensive demonstrations,”证明了在指令中插入预定义的防御性证明可以有效地防御LLMs的越狱攻击。

LLM推理的防御：恶意检测

提供了对LLM中间结果的深度检查，如神经元激活

“Defending against backdoor attacks in natural language generation,”提出用后向概率检测后门指令。
“A survey on evaluation of large language models,”从掩蔽敏感性的角度区分了正常指令和中毒指令。
“Bddr: An effective defense against textual backdoor attacks,”根据可疑词的文本相关性来识别可疑词。
“Rmlm: A flexible defense framework for proactively mitigating word-level adversarial attacks,”根据多代之间的语义一致性来检测对抗样本
“Shifting attention to relevance: Towards the uncertainty estimation of large language models,”在LLMs的不确定性量化中对此进行了探索
“Onion: A simple and effective defense against textual backdoor attacks,”利用了语言统计特性，例如检测孤立词。

LLM推理的防御：生成后处理

“Jailbreaker in jail: Moving target defense for large language models,”通过与多个模型候选物比较来减轻生成的毒性。
“Llm self defense: By self examination, llms know they are being tricked,”