rac异常hang死故障分析(sskgxpsnd2)
x86虚拟化的平台麒麟系统的一套RAC。事件梳理20:24左右,发现一个节点hang死,关闭操作没有响应。关闭hang死节点,另一个节点也发生hang死,然后重启了另一个节点。
无效分析部分
检查gi的alert日志
有一个很大跨度的时间回退
再看crsd日志
直接的崩溃,并没有有价值的信息
又看了ctssd日志,也用处不大。
经过一顿确认,这个应该是重启虚拟机然后时间同步到宿主机的缘故。不是hang死的原因。
有效分析部分
查看asm实例日志
2023-08-29T14:08:32.939390+08:00
skgxpvfynet: mtype: 61 process 12032 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
2023-08-29T14:08:32.995797+08:00
opiodr aborting process unknown ospid (11828) as a result of ORA-603
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_m000_12032.trc (incident=247819):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
Incident details in: /oracle/app/grid/diag/asm/+asm/+ASM1/incident/incdir_247819/+ASM1_m000_12032_i247819.trc
2023-08-29T14:08:33.135129+08:00
skgxpvfynet: mtype: 61 process 12030 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_12030.trc (incident=247820):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
Incident details in: /oracle/app/grid/diag/asm/+asm/+ASM1/incident/incdir_247820/+ASM1_ora_12030_i247820.trc
2023-08-29T14:08:35.419844+08:00
opidrv aborting process M000 ospid (12032) as a result of ORA-603
2023-08-29T14:08:35.699231+08:00
opiodr aborting process unknown ospid (12030) as a result of ORA-603
2023-08-29T14:08:35.868746+08:00
Process m000 died, see its trace file
2023-08-29T15:46:28.224445+08:00
NOTE: cleaning up empty system-created directory '+OCRVOTE/gz11db/OCRBACKUP/backup01.ocr.303.1146123969'
NOTE: cleaning up empty system-created directory '+OCRVOTE/gz11db/OCRBACKUP/backup00.ocr.302.1146138375'
NOTE: cleaning up empty system-created directory '+OCRVOTE/gz11db/OCRBACKUP/40051182.285.1146152783'
2023-08-29T16:10:47.520135+08:00
skgxpvfynet: mtype: 61 process 8178 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_8178.trc (incident=247821):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
Incident details in: /oracle/app/grid/diag/asm/+asm/+ASM1/incident/incdir_247821/+ASM1_ora_8178_i247821.trc
2023-08-29T16:10:49.918887+08:00
opiodr aborting process unknown ospid (8178) as a result of ORA-603
2023-08-29T19:46:34.829519+08:00
NOTE: cleaning up empty system-created directory '+OCRVOTE/gz11db/OCRBACKUP/backup01.ocr.302.1146138375'
NOTE: cleaning up empty system-created directory '+OCRVOTE/gz11db/OCRBACKUP/backup00.ocr.285.1146152783'
NOTE: cleaning up empty system-created directory '+OCRVOTE/gz11db/OCRBACKUP/40735354.289.1146167189'
2023-08-29T20:21:04.087366+08:00
NOTE: ASM client -MGMTDB:_mgmtdb:gz11db disconnected unexpectedly.
NOTE: check client alert log.
NOTE: cleaned up ASM client -MGMTDB:_mgmtdb:gz11db connection state (reg:1944369513)
2023-08-29T20:21:04.693879+08:00
Dumping diagnostic data in directory=[cdmp_20230829202104], requested by (instance=0, osid=23886), summary=[trace bucket dump request (kfnclDelete0)].
2023-08-29T20:21:05.338195+08:00
NOTE: detected orphaned client id 0x10001.
2023-08-29T20:21:50.697339+08:00
NOTE: ASM client YXPTDB1:YXPTDB:gz11db disconnected unexpectedly.
NOTE: check client alert log.
NOTE: cleaned up ASM client YXPTDB1:YXPTDB:gz11db connection state (reg:1287912358)
2023-08-29T20:21:53.345392+08:00
NOTE: detected orphaned client id 0x10002.
2023-08-29T20:22:08.589123+08:00
NOTE: client exited [11015]
2023-08-29T20:22:09.901462+08:00
NOTE: client +ASM1:+ASM:gz11db no longer has group 2 (OCRVOTE) mounted
NOTE: client +ASM1:+ASM:gz11db no longer has group 1 (DATA) mounted
2023-08-29T20:22:09.972392+08:00
NOTE: ASMB0 process exiting due to ASM instance shutdown (inactive for 1 seconds)
NOTE: ASMB0 clearing idle groups before exit
2023-08-29T20:22:10.123064+08:00
NOTE: client +ASM1:+ASM:gz11db deregistered
2023-08-29T20:22:10.195414+08:00
Shutting down instance (immediate) (OS id: 20496)
Shutting down instance: further logons disabled
Stopping background process MMNL
2023-08-29T20:22:11.333414+08:00
Stopping background process MMON
2023-08-29T20:22:13.335540+08:00
License high water mark = 19
2023-08-29T20:22:14.528890+08:00
SQL> ALTER DISKGROUP ALL DISMOUNT /* asm agent *//* {0:0:6404} */
2023-08-29T20:22:14.641327+08:00
NOTE: cache dismounting (clean) group 1/0xA5E0179B (DATA)
NOTE: messaging CKPT to quiesce pins Unix process pid: 20496, image: oracle@gz11db1 (TNS V1-V3)
2023-08-29T20:22:15.254098+08:00
NOTE: LGWR doing clean dismount of group 1 (DATA) thread 1
NOTE: LGWR closing thread 1 of diskgroup 1 (DATA) at ABA 31.3948
NOTE: LGWR released recovery enqueue for thread 1 group 1 (DATA)
2023-08-29T20:22:15.525844+08:00
kjbdomdet send to inst 2
detach from dom 1, sending detach message to inst 2
2023-08-29T20:22:15.652953+08:00
NOTE: detached from domain 1
2023-08-29T20:22:15.656908+08:00
NOTE: cache dismounted group 1/0xA5E0179B (DATA)
2023-08-29T20:22:15.714886+08:00
GMON dismounting group 1 at 6571 for pid 33, osid 20496
2023-08-29T20:22:15.778396+08:00
NOTE: Disk DATA_0000 in mode 0x7f marked for de-assignment
2023-08-29T20:22:15.813238+08:00
SUCCESS: diskgroup DATA was dismounted
NOTE: cache deleting context for group DATA 1/0xa5e0179b
2023-08-29T20:22:15.820667+08:00
NOTE: cache dismounting (clean) group 2/0xA5F0179C (OCRVOTE)
NOTE: messaging CKPT to quiesce pins Unix process pid: 20496, image: oracle@gz11db1 (TNS V1-V3)
2023-08-29T20:22:15.830477+08:00
NOTE: LGWR doing clean dismount of group 2 (OCRVOTE) thread 1
NOTE: LGWR closing thread 1 of diskgroup 2 (OCRVOTE) at ABA 33.8895
NOTE: LGWR released recovery enqueue for thread 1 group 2 (OCRVOTE)
2023-08-29T20:22:16.030466+08:00
kjbdomdet send to inst 2
detach from dom 2, sending detach message to inst 2
2023-08-29T20:22:16.058295+08:00
NOTE: detached from domain 2
2023-08-29T20:22:16.066375+08:00
NOTE: cache dismounted group 2/0xA5F0179C (OCRVOTE)
2023-08-29T20:22:16.068034+08:00
GMON dismounting group 2 at 6572 for pid 33, osid 20496
2023-08-29T20:22:16.076227+08:00
NOTE: Disk OCRVOTE_0000 in mode 0x7f marked for de-assignment
NOTE: Disk OCRVOTE_0001 in mode 0x7f marked for de-assignment
NOTE: Disk OCRVOTE_0002 in mode 0x7f marked for de-assignment
2023-08-29T20:22:16.086385+08:00
SUCCESS: diskgroup OCRVOTE was dismounted
NOTE: cache deleting context for group OCRVOTE 2/0xa5f0179c
2023-08-29T20:22:16.092465+08:00
SUCCESS: ALTER DISKGROUP ALL DISMOUNT /* asm agent *//* {0:0:6404} */
Shutting down archive processes
Archiving is disabled
2023-08-29T20:22:17.511581+08:00
Shutting down archive processes
Archiving is disabled
2023-08-29T20:22:17.702392+08:00
Stopping background process VKTM
2023-08-29T20:22:24.267747+08:00
freeing rdom 2
freeing rdom 1
freeing rdom 0
2023-08-29T20:22:27.651980+08:00
Instance shutdown complete (OS id: 20496)
2023-08-29T20:05:25.578362+08:00
MEMORY_TARGET defaulting to 1128267776.
WARNING: ASM does not support ipclw. Switching to skgxp
WARNING: ASM does not support ipclw. Switching to skgxp
WARNING: ASM does not support ipclw. Switching to skgxp
ksxp_exafusion_enabled_dcf: ipclw_enabled=0
WARNING: ASM does not support ipclw. Switching to skgxp
WARNING: ASM does not support ipclw. Switching to skgxp
WARNING: ASM does not support ipclw. Switching to skgxp
* instance_number obtained from CSS = 1, checking for the existence of node 0...
* node 0 does not exist. instance_number = 1
Starting ORACLE instance (normal) (OS id: 10339)
2023-08-29T20:05:25.597821+08:00
CLI notifier numLatches:7 maxDescs:2103
2023-08-29T20:05:25.640338+08:00
**********************************************************************
2023-08-29T20:05:25.640902+08:00
Dump of system resources acquired for SHARED GLOBAL AREA (SG
重启前的asm日志,看到了比较重要的一个报错
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
在asm实例上看到buffer空间不足,可能是系统层的缓存出现问题
再看db实例的alert日志
2023-08-29T20:06:53.015478+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0)
2023-08-29T20:06:53.015515+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0)
2023-08-29T20:06:53.015616+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0)
2023-08-29T20:06:53.015651+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0)
2023-08-29T20:06:53.015861+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0)
2023-08-29T20:06:53.015911+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0)
2023-08-29T20:06:53.016094+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0)
2023-08-29T20:06:53.016176+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0)
2023-08-29T20:06:53.016311+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0)
2023-08-29T20:06:53.016330+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0)
2023-08-29T20:06:53.016377+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0)
2023-08-29T20:06:53.016444+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0)
2023-08-29T20:06:53.066125+08:00
Started redo scan
2023-08-29T20:06:53.220605+08:00
Completed redo scanread 7782 KB redo, 707 data blocks need recovery
2023-08-29T20:06:53.314712+08:00
Started redo application atThread 2: logseq 17827, block 147240, offset 0
2023-08-29T20:06:53.393986+08:00
Recovery of Online Redo Log: Thread 2 Group 3 Seq 17827 Reading mem 0Mem# 0: +DATA/YXPTDB/redo03.log
2023-08-29T20:06:53.491915+08:00
Completed redo application of 4.09MB
再往前翻,也看到了进程无法启动的日志
Errors in file /oracle/app/oracle/diag/rdbms/yxptdb/YXPTDB1/trace/YXPTDB1_m000_14903.trc (incident=420364):
ORA-00700: 软内部错误, 参数: [kfnRConnect2], [0], [0x7F263EE44E48], [], [], [], [], [], [], [], [], []
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
Incident details in: /oracle/app/oracle/diag/rdbms/yxptdb/YXPTDB1/incident/incdir_420364/YXPTDB1_m000_14903_i420364.trc
2023-08-25T19:25:01.488828+08:00
Dumping diagnostic data in directory=[cdmp_20230825192501], requested by (instance=1, osid=14903 (M000)), summary=[incident=420364].
2023-08-25T21:42:03.079640+08:00
Thread 1 advanced to log sequence 10126 (LGWR switch)Current log# 2 seq# 10126 mem# 0: +DATA/YXPTDB/redo02.log
看到这里崩溃的m000进程,因为该进程不重要,当时没有影响业务。
官方资料确认
查看mos,发现一篇文档2484025.1
对比可知现象比较类似,原因是内部bug没有找到说明。
但是这个bug只出现在UEK4上,解决方法是调整lo网卡的mtu,调整到16436
我们的环境是
中标麒麟v7u6 3.10.0-957的内核
错误现象能对应,但是内核不能对应。只能作为一个参考。有修改的价值,期待后续反馈。
相关文章:

rac异常hang死故障分析(sskgxpsnd2)
x86虚拟化的平台麒麟系统的一套RAC。事件梳理20:24左右,发现一个节点hang死,关闭操作没有响应。关闭hang死节点,另一个节点也发生hang死,然后重启了另一个节点。 无效分析部分 检查gi的alert日志 有一个很大跨度的时间回退 再看…...

2023.9.7 关于 TCP / IP 的基本认知
目录 网络协议分层 TCP/IP 五层(四层)模型 应用层 传输层 网络层(互联网层) 数据链路层(网络接口层) 物理层 网络数据传输的基本流程 网络协议分层 为什么需要分层? 分层之后,…...
Python 图片处理
Step1 提取PDF中的图片,并另存 Step2 去除灰色纸张背景 import PyPDF2 from PIL import ImageEnhance,Image,ImageFilter import cv2 import numpy as np from skimage.filters import unsharp_mask from skimage.filters import gaussian from skimage.restora…...
信道估计 | 信道
文章目录 定义分类LS 估计MMSE估计LS vs MMSE 定义 从接收数据中将假定的某个信道模型参数估计出来的过程,如果信道是线性的,信道估计是对系统的冲击响应进行估计,需强调的是,信道估计是信道对输入信号影响的一种数学表示&#x…...

腾讯发布超千亿参数规模的混元大模型;深度学习与音乐分析与生成课程介绍
🦉 AI新闻 🚀 腾讯发布超千亿参数规模的混元大模型 摘要:腾讯在2023腾讯全球数字生态大会上发布混元大模型,该模型拥有超千亿的参数规模和超2万亿 tokens 的预训练语料。混元大模型将支持多轮对话、内容创作、逻辑推理、知识增强…...

[html]当网站搭建、维护的时候,你会放个什么界面?
效果图: <!DOCTYPE html> <html lang"en"> <head><meta charset"UTF-8"><title>网站建设中</title><style>/* 基础样式 */body, html {margin: 0;padding: 0;height: 100%;font-family: Arial, sa…...

javaee spring aop 的五种通知方式
spring配置文件 <?xml version"1.0" encoding"UTF-8"?> <beans xmlns"http://www.springframework.org/schema/beans"xmlns:xsi"http://www.w3.org/2001/XMLSchema-instance" xmlns:aop"http://www.springframework.…...

【Redis】3、Redis主从复制、哨兵、集群
Redis主从复制 主从复制,是指将一台Redis服务器的数据,复制到其他的Redis服务器。前者称为主节点(Master),后者称为从节点(Slave);数据的复制是单向的,只能由主节点到从节点。 默认情况下,每台Redis服务器…...

vcpkg方式安装zlmediakit
主要参考: https://github.com/ZLMediaKit/ZLMediaKit/wiki/vcpkg%E6%96%B9%E5%BC%8F%E5%AE%89%E8%A3%85zlmediakit vcpkg的相关配置不在本文论述。很多库或源码下载不下来,通过第三方下载后放在download目录下,名称要和vcpkg期望的一致。可通过云服务器…...

【大数据】基于 Flink CDC 高效构建入湖通道
基于 Flink CDC 高效构建入湖通道 1.Flink CDC 核心技术解析2.CDC 数据入湖入仓的挑战2.1 CDC 数据入湖架构2.2 CDC 数据 ETL 架构 3.基于 Flink CDC 的入湖入仓方案3.1 Flink CDC 入湖入仓架构3.2 Flink CDC ETL 分析3.3 存储友好的写入设计3.4 Flink CDC 实现异构数据源集成3…...
微信小程序开发---网络数据请求
目录 一、小程序中网络数据请求的限制 二、发起get请求 三、发起post请求 一、小程序中网络数据请求的限制 具体有两个限制: (1)只能请求HTTPS类型的接口 (2)必须将接口的域名添加到信任列表中,在调试的时…...
vulkan学习路径
1.学习路径 了解图形渲染基础知识: 学习计算机图形学基础概念,包括坐标系统、三角形渲染、光照模型等。可以参考经典的图形学教材,如《Real-Time Rendering》和《Computer Graphics: Principles and Practice》。了解图形API的发展历史&#…...

NIFI使用InvokeHTTP发送http请求
说明 这里介绍四种平时常用的http请求方法:GET、POST、PUT、DELETE。 在官方的介绍文档中关于InvokeHTTP处理器的描述是这么说的: An HTTP client processor which can interact with a configurable HTTP Endpoint. The destination URL and HTTP Met…...
Spire.xls+excel文件实现单据打印
报表和单据打印,通常都是使用fastreport之类的,因为有了现成的xls模板样式,如果转成fastreport那还需要花时间,是用spire.xls这个玩意简单,超好用。 一.引用 using Spire.Xls; 二.基本的操作 // 创建工作簿ÿ…...

win10系统配置vmware网络NAT模式
1,查看win10 IP地址:ipconfig 2, vmware设置:编辑>>虚拟网络编辑器>>点击添加网络(选择NAT模式) 3,虚拟机网络设置:点击VMware虚拟机>>设置>>网络适配器 4ÿ…...

什么是数据中台,关于数据中台的6问6答6方法
在大数据/数字孪生时代,数据中台已经成为企业治理数据的核心平台。数据中台不仅处理和整合大量数据,还负责数据的存储、管理和保护工作,确保数据的准确性和可用性。数据中台的特点在于其能够提高业务效率,降低成本,增加…...
什么是机器学习中的目标函数和优化算法,列举几种常见的优化算法
1、什么是机器学习中的目标函数和优化算法,列举几种常见的优化算法。 在机器学习中,目标函数和优化算法是两个核心概念。目标函数用于描述模型预测结果与实际结果之间的差距,而优化算法则用于最小化目标函数,从而得到最优的模型参…...
Edge被2345浏览器劫持 解决方法
Edge 被 hao123 劫持解决方法_edge被hao123锁定改不了_小子宝丁的博客-CSDN博客...

uni-app:重置表单数据
效果 代码 <template><form><input type"text" v-model"inputValue" placeholder"请输入信息"/><input type"text" v-model"inputValue1" placeholder"请输入信息"/><input type&quo…...

全球城市汇总【最新】
文章目录 案例图国家城市大洲 数据获取政策: 全球城市、国家、介绍汇总。包含 .csv .sql .xml 格式数据。 案例图 国家 城市 大洲 数据 获取上图资源绑定 https://blog.csdn.net/qq_40374604/category_12435042.html 获取政策: 如找不到在合集中查找…...
C++:std::is_convertible
C++标志库中提供is_convertible,可以测试一种类型是否可以转换为另一只类型: template <class From, class To> struct is_convertible; 使用举例: #include <iostream> #include <string>using namespace std;struct A { }; struct B : A { };int main…...
STM32+rt-thread判断是否联网
一、根据NETDEV_FLAG_INTERNET_UP位判断 static bool is_conncected(void) {struct netdev *dev RT_NULL;dev netdev_get_first_by_flags(NETDEV_FLAG_INTERNET_UP);if (dev RT_NULL){printf("wait netdev internet up...");return false;}else{printf("loc…...
大语言模型如何处理长文本?常用文本分割技术详解
为什么需要文本分割? 引言:为什么需要文本分割?一、基础文本分割方法1. 按段落分割(Paragraph Splitting)2. 按句子分割(Sentence Splitting)二、高级文本分割策略3. 重叠分割(Sliding Window)4. 递归分割(Recursive Splitting)三、生产级工具推荐5. 使用LangChain的…...
测试markdown--肇兴
day1: 1、去程:7:04 --11:32高铁 高铁右转上售票大厅2楼,穿过候车厅下一楼,上大巴车 ¥10/人 **2、到达:**12点多到达寨子,买门票,美团/抖音:¥78人 3、中饭&a…...
五年级数学知识边界总结思考-下册
目录 一、背景二、过程1.观察物体小学五年级下册“观察物体”知识点详解:由来、作用与意义**一、知识点核心内容****二、知识点的由来:从生活实践到数学抽象****三、知识的作用:解决实际问题的工具****四、学习的意义:培养核心素养…...

【单片机期末】单片机系统设计
主要内容:系统状态机,系统时基,系统需求分析,系统构建,系统状态流图 一、题目要求 二、绘制系统状态流图 题目:根据上述描述绘制系统状态流图,注明状态转移条件及方向。 三、利用定时器产生时…...

如何在最短时间内提升打ctf(web)的水平?
刚刚刷完2遍 bugku 的 web 题,前来答题。 每个人对刷题理解是不同,有的人是看了writeup就等于刷了,有的人是收藏了writeup就等于刷了,有的人是跟着writeup做了一遍就等于刷了,还有的人是独立思考做了一遍就等于刷了。…...

【Redis】笔记|第8节|大厂高并发缓存架构实战与优化
缓存架构 代码结构 代码详情 功能点: 多级缓存,先查本地缓存,再查Redis,最后才查数据库热点数据重建逻辑使用分布式锁,二次查询更新缓存采用读写锁提升性能采用Redis的发布订阅机制通知所有实例更新本地缓存适用读多…...

华为OD机考-机房布局
import java.util.*;public class DemoTest5 {public static void main(String[] args) {Scanner in new Scanner(System.in);// 注意 hasNext 和 hasNextLine 的区别while (in.hasNextLine()) { // 注意 while 处理多个 caseSystem.out.println(solve(in.nextLine()));}}priv…...

wpf在image控件上快速显示内存图像
wpf在image控件上快速显示内存图像https://www.cnblogs.com/haodafeng/p/10431387.html 如果你在寻找能够快速在image控件刷新大图像(比如分辨率3000*3000的图像)的办法,尤其是想把内存中的裸数据(只有图像的数据,不包…...