rac异常hang死故障分析(sskgxpsnd2)
x86虚拟化的平台麒麟系统的一套RAC。事件梳理20:24左右,发现一个节点hang死,关闭操作没有响应。关闭hang死节点,另一个节点也发生hang死,然后重启了另一个节点。
无效分析部分
检查gi的alert日志
有一个很大跨度的时间回退
再看crsd日志
直接的崩溃,并没有有价值的信息
又看了ctssd日志,也用处不大。
经过一顿确认,这个应该是重启虚拟机然后时间同步到宿主机的缘故。不是hang死的原因。
有效分析部分
查看asm实例日志
2023-08-29T14:08:32.939390+08:00
skgxpvfynet: mtype: 61 process 12032 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
2023-08-29T14:08:32.995797+08:00
opiodr aborting process unknown ospid (11828) as a result of ORA-603
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_m000_12032.trc (incident=247819):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
Incident details in: /oracle/app/grid/diag/asm/+asm/+ASM1/incident/incdir_247819/+ASM1_m000_12032_i247819.trc
2023-08-29T14:08:33.135129+08:00
skgxpvfynet: mtype: 61 process 12030 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_12030.trc (incident=247820):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
Incident details in: /oracle/app/grid/diag/asm/+asm/+ASM1/incident/incdir_247820/+ASM1_ora_12030_i247820.trc
2023-08-29T14:08:35.419844+08:00
opidrv aborting process M000 ospid (12032) as a result of ORA-603
2023-08-29T14:08:35.699231+08:00
opiodr aborting process unknown ospid (12030) as a result of ORA-603
2023-08-29T14:08:35.868746+08:00
Process m000 died, see its trace file
2023-08-29T15:46:28.224445+08:00
NOTE: cleaning up empty system-created directory '+OCRVOTE/gz11db/OCRBACKUP/backup01.ocr.303.1146123969'
NOTE: cleaning up empty system-created directory '+OCRVOTE/gz11db/OCRBACKUP/backup00.ocr.302.1146138375'
NOTE: cleaning up empty system-created directory '+OCRVOTE/gz11db/OCRBACKUP/40051182.285.1146152783'
2023-08-29T16:10:47.520135+08:00
skgxpvfynet: mtype: 61 process 8178 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
Errors in file /oracle/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_8178.trc (incident=247821):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
Incident details in: /oracle/app/grid/diag/asm/+asm/+ASM1/incident/incdir_247821/+ASM1_ora_8178_i247821.trc
2023-08-29T16:10:49.918887+08:00
opiodr aborting process unknown ospid (8178) as a result of ORA-603
2023-08-29T19:46:34.829519+08:00
NOTE: cleaning up empty system-created directory '+OCRVOTE/gz11db/OCRBACKUP/backup01.ocr.302.1146138375'
NOTE: cleaning up empty system-created directory '+OCRVOTE/gz11db/OCRBACKUP/backup00.ocr.285.1146152783'
NOTE: cleaning up empty system-created directory '+OCRVOTE/gz11db/OCRBACKUP/40735354.289.1146167189'
2023-08-29T20:21:04.087366+08:00
NOTE: ASM client -MGMTDB:_mgmtdb:gz11db disconnected unexpectedly.
NOTE: check client alert log.
NOTE: cleaned up ASM client -MGMTDB:_mgmtdb:gz11db connection state (reg:1944369513)
2023-08-29T20:21:04.693879+08:00
Dumping diagnostic data in directory=[cdmp_20230829202104], requested by (instance=0, osid=23886), summary=[trace bucket dump request (kfnclDelete0)].
2023-08-29T20:21:05.338195+08:00
NOTE: detected orphaned client id 0x10001.
2023-08-29T20:21:50.697339+08:00
NOTE: ASM client YXPTDB1:YXPTDB:gz11db disconnected unexpectedly.
NOTE: check client alert log.
NOTE: cleaned up ASM client YXPTDB1:YXPTDB:gz11db connection state (reg:1287912358)
2023-08-29T20:21:53.345392+08:00
NOTE: detected orphaned client id 0x10002.
2023-08-29T20:22:08.589123+08:00
NOTE: client exited [11015]
2023-08-29T20:22:09.901462+08:00
NOTE: client +ASM1:+ASM:gz11db no longer has group 2 (OCRVOTE) mounted
NOTE: client +ASM1:+ASM:gz11db no longer has group 1 (DATA) mounted
2023-08-29T20:22:09.972392+08:00
NOTE: ASMB0 process exiting due to ASM instance shutdown (inactive for 1 seconds)
NOTE: ASMB0 clearing idle groups before exit
2023-08-29T20:22:10.123064+08:00
NOTE: client +ASM1:+ASM:gz11db deregistered
2023-08-29T20:22:10.195414+08:00
Shutting down instance (immediate) (OS id: 20496)
Shutting down instance: further logons disabled
Stopping background process MMNL
2023-08-29T20:22:11.333414+08:00
Stopping background process MMON
2023-08-29T20:22:13.335540+08:00
License high water mark = 19
2023-08-29T20:22:14.528890+08:00
SQL> ALTER DISKGROUP ALL DISMOUNT /* asm agent *//* {0:0:6404} */
2023-08-29T20:22:14.641327+08:00
NOTE: cache dismounting (clean) group 1/0xA5E0179B (DATA)
NOTE: messaging CKPT to quiesce pins Unix process pid: 20496, image: oracle@gz11db1 (TNS V1-V3)
2023-08-29T20:22:15.254098+08:00
NOTE: LGWR doing clean dismount of group 1 (DATA) thread 1
NOTE: LGWR closing thread 1 of diskgroup 1 (DATA) at ABA 31.3948
NOTE: LGWR released recovery enqueue for thread 1 group 1 (DATA)
2023-08-29T20:22:15.525844+08:00
kjbdomdet send to inst 2
detach from dom 1, sending detach message to inst 2
2023-08-29T20:22:15.652953+08:00
NOTE: detached from domain 1
2023-08-29T20:22:15.656908+08:00
NOTE: cache dismounted group 1/0xA5E0179B (DATA)
2023-08-29T20:22:15.714886+08:00
GMON dismounting group 1 at 6571 for pid 33, osid 20496
2023-08-29T20:22:15.778396+08:00
NOTE: Disk DATA_0000 in mode 0x7f marked for de-assignment
2023-08-29T20:22:15.813238+08:00
SUCCESS: diskgroup DATA was dismounted
NOTE: cache deleting context for group DATA 1/0xa5e0179b
2023-08-29T20:22:15.820667+08:00
NOTE: cache dismounting (clean) group 2/0xA5F0179C (OCRVOTE)
NOTE: messaging CKPT to quiesce pins Unix process pid: 20496, image: oracle@gz11db1 (TNS V1-V3)
2023-08-29T20:22:15.830477+08:00
NOTE: LGWR doing clean dismount of group 2 (OCRVOTE) thread 1
NOTE: LGWR closing thread 1 of diskgroup 2 (OCRVOTE) at ABA 33.8895
NOTE: LGWR released recovery enqueue for thread 1 group 2 (OCRVOTE)
2023-08-29T20:22:16.030466+08:00
kjbdomdet send to inst 2
detach from dom 2, sending detach message to inst 2
2023-08-29T20:22:16.058295+08:00
NOTE: detached from domain 2
2023-08-29T20:22:16.066375+08:00
NOTE: cache dismounted group 2/0xA5F0179C (OCRVOTE)
2023-08-29T20:22:16.068034+08:00
GMON dismounting group 2 at 6572 for pid 33, osid 20496
2023-08-29T20:22:16.076227+08:00
NOTE: Disk OCRVOTE_0000 in mode 0x7f marked for de-assignment
NOTE: Disk OCRVOTE_0001 in mode 0x7f marked for de-assignment
NOTE: Disk OCRVOTE_0002 in mode 0x7f marked for de-assignment
2023-08-29T20:22:16.086385+08:00
SUCCESS: diskgroup OCRVOTE was dismounted
NOTE: cache deleting context for group OCRVOTE 2/0xa5f0179c
2023-08-29T20:22:16.092465+08:00
SUCCESS: ALTER DISKGROUP ALL DISMOUNT /* asm agent *//* {0:0:6404} */
Shutting down archive processes
Archiving is disabled
2023-08-29T20:22:17.511581+08:00
Shutting down archive processes
Archiving is disabled
2023-08-29T20:22:17.702392+08:00
Stopping background process VKTM
2023-08-29T20:22:24.267747+08:00
freeing rdom 2
freeing rdom 1
freeing rdom 0
2023-08-29T20:22:27.651980+08:00
Instance shutdown complete (OS id: 20496)
2023-08-29T20:05:25.578362+08:00
MEMORY_TARGET defaulting to 1128267776.
WARNING: ASM does not support ipclw. Switching to skgxp
WARNING: ASM does not support ipclw. Switching to skgxp
WARNING: ASM does not support ipclw. Switching to skgxp
ksxp_exafusion_enabled_dcf: ipclw_enabled=0
WARNING: ASM does not support ipclw. Switching to skgxp
WARNING: ASM does not support ipclw. Switching to skgxp
WARNING: ASM does not support ipclw. Switching to skgxp
* instance_number obtained from CSS = 1, checking for the existence of node 0...
* node 0 does not exist. instance_number = 1
Starting ORACLE instance (normal) (OS id: 10339)
2023-08-29T20:05:25.597821+08:00
CLI notifier numLatches:7 maxDescs:2103
2023-08-29T20:05:25.640338+08:00
**********************************************************************
2023-08-29T20:05:25.640902+08:00
Dump of system resources acquired for SHARED GLOBAL AREA (SG
重启前的asm日志,看到了比较重要的一个报错
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
在asm实例上看到buffer空间不足,可能是系统层的缓存出现问题
再看db实例的alert日志
2023-08-29T20:06:53.015478+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0)
2023-08-29T20:06:53.015515+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0)
2023-08-29T20:06:53.015616+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0)
2023-08-29T20:06:53.015651+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0)
2023-08-29T20:06:53.015861+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0)
2023-08-29T20:06:53.015911+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0)
2023-08-29T20:06:53.016094+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0)
2023-08-29T20:06:53.016176+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0)
2023-08-29T20:06:53.016311+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0)
2023-08-29T20:06:53.016330+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0)
2023-08-29T20:06:53.016377+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0)
2023-08-29T20:06:53.016444+08:00
start recovery: pdb 0, passed in flags x5 (domain enable 0)
2023-08-29T20:06:53.066125+08:00
Started redo scan
2023-08-29T20:06:53.220605+08:00
Completed redo scanread 7782 KB redo, 707 data blocks need recovery
2023-08-29T20:06:53.314712+08:00
Started redo application atThread 2: logseq 17827, block 147240, offset 0
2023-08-29T20:06:53.393986+08:00
Recovery of Online Redo Log: Thread 2 Group 3 Seq 17827 Reading mem 0Mem# 0: +DATA/YXPTDB/redo03.log
2023-08-29T20:06:53.491915+08:00
Completed redo application of 4.09MB
再往前翻,也看到了进程无法启动的日志
Errors in file /oracle/app/oracle/diag/rdbms/yxptdb/YXPTDB1/trace/YXPTDB1_m000_14903.trc (incident=420364):
ORA-00700: 软内部错误, 参数: [kfnRConnect2], [0], [0x7F263EE44E48], [], [], [], [], [], [], [], [], []
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
Incident details in: /oracle/app/oracle/diag/rdbms/yxptdb/YXPTDB1/incident/incdir_420364/YXPTDB1_m000_14903_i420364.trc
2023-08-25T19:25:01.488828+08:00
Dumping diagnostic data in directory=[cdmp_20230825192501], requested by (instance=1, osid=14903 (M000)), summary=[incident=420364].
2023-08-25T21:42:03.079640+08:00
Thread 1 advanced to log sequence 10126 (LGWR switch)Current log# 2 seq# 10126 mem# 0: +DATA/YXPTDB/redo02.log
看到这里崩溃的m000进程,因为该进程不重要,当时没有影响业务。
官方资料确认
查看mos,发现一篇文档2484025.1
对比可知现象比较类似,原因是内部bug没有找到说明。
但是这个bug只出现在UEK4上,解决方法是调整lo网卡的mtu,调整到16436
我们的环境是
中标麒麟v7u6 3.10.0-957的内核
错误现象能对应,但是内核不能对应。只能作为一个参考。有修改的价值,期待后续反馈。
相关文章:

rac异常hang死故障分析(sskgxpsnd2)
x86虚拟化的平台麒麟系统的一套RAC。事件梳理20:24左右,发现一个节点hang死,关闭操作没有响应。关闭hang死节点,另一个节点也发生hang死,然后重启了另一个节点。 无效分析部分 检查gi的alert日志 有一个很大跨度的时间回退 再看…...

2023.9.7 关于 TCP / IP 的基本认知
目录 网络协议分层 TCP/IP 五层(四层)模型 应用层 传输层 网络层(互联网层) 数据链路层(网络接口层) 物理层 网络数据传输的基本流程 网络协议分层 为什么需要分层? 分层之后,…...
Python 图片处理
Step1 提取PDF中的图片,并另存 Step2 去除灰色纸张背景 import PyPDF2 from PIL import ImageEnhance,Image,ImageFilter import cv2 import numpy as np from skimage.filters import unsharp_mask from skimage.filters import gaussian from skimage.restora…...
信道估计 | 信道
文章目录 定义分类LS 估计MMSE估计LS vs MMSE 定义 从接收数据中将假定的某个信道模型参数估计出来的过程,如果信道是线性的,信道估计是对系统的冲击响应进行估计,需强调的是,信道估计是信道对输入信号影响的一种数学表示&#x…...

腾讯发布超千亿参数规模的混元大模型;深度学习与音乐分析与生成课程介绍
🦉 AI新闻 🚀 腾讯发布超千亿参数规模的混元大模型 摘要:腾讯在2023腾讯全球数字生态大会上发布混元大模型,该模型拥有超千亿的参数规模和超2万亿 tokens 的预训练语料。混元大模型将支持多轮对话、内容创作、逻辑推理、知识增强…...

[html]当网站搭建、维护的时候,你会放个什么界面?
效果图: <!DOCTYPE html> <html lang"en"> <head><meta charset"UTF-8"><title>网站建设中</title><style>/* 基础样式 */body, html {margin: 0;padding: 0;height: 100%;font-family: Arial, sa…...

javaee spring aop 的五种通知方式
spring配置文件 <?xml version"1.0" encoding"UTF-8"?> <beans xmlns"http://www.springframework.org/schema/beans"xmlns:xsi"http://www.w3.org/2001/XMLSchema-instance" xmlns:aop"http://www.springframework.…...

【Redis】3、Redis主从复制、哨兵、集群
Redis主从复制 主从复制,是指将一台Redis服务器的数据,复制到其他的Redis服务器。前者称为主节点(Master),后者称为从节点(Slave);数据的复制是单向的,只能由主节点到从节点。 默认情况下,每台Redis服务器…...

vcpkg方式安装zlmediakit
主要参考: https://github.com/ZLMediaKit/ZLMediaKit/wiki/vcpkg%E6%96%B9%E5%BC%8F%E5%AE%89%E8%A3%85zlmediakit vcpkg的相关配置不在本文论述。很多库或源码下载不下来,通过第三方下载后放在download目录下,名称要和vcpkg期望的一致。可通过云服务器…...

【大数据】基于 Flink CDC 高效构建入湖通道
基于 Flink CDC 高效构建入湖通道 1.Flink CDC 核心技术解析2.CDC 数据入湖入仓的挑战2.1 CDC 数据入湖架构2.2 CDC 数据 ETL 架构 3.基于 Flink CDC 的入湖入仓方案3.1 Flink CDC 入湖入仓架构3.2 Flink CDC ETL 分析3.3 存储友好的写入设计3.4 Flink CDC 实现异构数据源集成3…...
微信小程序开发---网络数据请求
目录 一、小程序中网络数据请求的限制 二、发起get请求 三、发起post请求 一、小程序中网络数据请求的限制 具体有两个限制: (1)只能请求HTTPS类型的接口 (2)必须将接口的域名添加到信任列表中,在调试的时…...
vulkan学习路径
1.学习路径 了解图形渲染基础知识: 学习计算机图形学基础概念,包括坐标系统、三角形渲染、光照模型等。可以参考经典的图形学教材,如《Real-Time Rendering》和《Computer Graphics: Principles and Practice》。了解图形API的发展历史&#…...

NIFI使用InvokeHTTP发送http请求
说明 这里介绍四种平时常用的http请求方法:GET、POST、PUT、DELETE。 在官方的介绍文档中关于InvokeHTTP处理器的描述是这么说的: An HTTP client processor which can interact with a configurable HTTP Endpoint. The destination URL and HTTP Met…...
Spire.xls+excel文件实现单据打印
报表和单据打印,通常都是使用fastreport之类的,因为有了现成的xls模板样式,如果转成fastreport那还需要花时间,是用spire.xls这个玩意简单,超好用。 一.引用 using Spire.Xls; 二.基本的操作 // 创建工作簿ÿ…...

win10系统配置vmware网络NAT模式
1,查看win10 IP地址:ipconfig 2, vmware设置:编辑>>虚拟网络编辑器>>点击添加网络(选择NAT模式) 3,虚拟机网络设置:点击VMware虚拟机>>设置>>网络适配器 4ÿ…...

什么是数据中台,关于数据中台的6问6答6方法
在大数据/数字孪生时代,数据中台已经成为企业治理数据的核心平台。数据中台不仅处理和整合大量数据,还负责数据的存储、管理和保护工作,确保数据的准确性和可用性。数据中台的特点在于其能够提高业务效率,降低成本,增加…...
什么是机器学习中的目标函数和优化算法,列举几种常见的优化算法
1、什么是机器学习中的目标函数和优化算法,列举几种常见的优化算法。 在机器学习中,目标函数和优化算法是两个核心概念。目标函数用于描述模型预测结果与实际结果之间的差距,而优化算法则用于最小化目标函数,从而得到最优的模型参…...
Edge被2345浏览器劫持 解决方法
Edge 被 hao123 劫持解决方法_edge被hao123锁定改不了_小子宝丁的博客-CSDN博客...

uni-app:重置表单数据
效果 代码 <template><form><input type"text" v-model"inputValue" placeholder"请输入信息"/><input type"text" v-model"inputValue1" placeholder"请输入信息"/><input type&quo…...

全球城市汇总【最新】
文章目录 案例图国家城市大洲 数据获取政策: 全球城市、国家、介绍汇总。包含 .csv .sql .xml 格式数据。 案例图 国家 城市 大洲 数据 获取上图资源绑定 https://blog.csdn.net/qq_40374604/category_12435042.html 获取政策: 如找不到在合集中查找…...

MPNet:旋转机械轻量化故障诊断模型详解python代码复现
目录 一、问题背景与挑战 二、MPNet核心架构 2.1 多分支特征融合模块(MBFM) 2.2 残差注意力金字塔模块(RAPM) 2.2.1 空间金字塔注意力(SPA) 2.2.2 金字塔残差块(PRBlock) 2.3 分类器设计 三、关键技术突破 3.1 多尺度特征融合 3.2 轻量化设计策略 3.3 抗噪声…...

工业安全零事故的智能守护者:一体化AI智能安防平台
前言: 通过AI视觉技术,为船厂提供全面的安全监控解决方案,涵盖交通违规检测、起重机轨道安全、非法入侵检测、盗窃防范、安全规范执行监控等多个方面,能够实现对应负责人反馈机制,并最终实现数据的统计报表。提升船厂…...

DAY 47
三、通道注意力 3.1 通道注意力的定义 # 新增:通道注意力模块(SE模块) class ChannelAttention(nn.Module):"""通道注意力模块(Squeeze-and-Excitation)"""def __init__(self, in_channels, reduction_rat…...
基础测试工具使用经验
背景 vtune,perf, nsight system等基础测试工具,都是用过的,但是没有记录,都逐渐忘了。所以写这篇博客总结记录一下,只要以后发现新的用法,就记得来编辑补充一下 perf 比较基础的用法: 先改这…...
Robots.txt 文件
什么是robots.txt? robots.txt 是一个位于网站根目录下的文本文件(如:https://example.com/robots.txt),它用于指导网络爬虫(如搜索引擎的蜘蛛程序)如何抓取该网站的内容。这个文件遵循 Robots…...

让AI看见世界:MCP协议与服务器的工作原理
让AI看见世界:MCP协议与服务器的工作原理 MCP(Model Context Protocol)是一种创新的通信协议,旨在让大型语言模型能够安全、高效地与外部资源进行交互。在AI技术快速发展的今天,MCP正成为连接AI与现实世界的重要桥梁。…...
JVM暂停(Stop-The-World,STW)的原因分类及对应排查方案
JVM暂停(Stop-The-World,STW)的完整原因分类及对应排查方案,结合JVM运行机制和常见故障场景整理而成: 一、GC相关暂停 1. 安全点(Safepoint)阻塞 现象:JVM暂停但无GC日志,日志显示No GCs detected。原因:JVM等待所有线程进入安全点(如…...

【笔记】WSL 中 Rust 安装与测试完整记录
#工作记录 WSL 中 Rust 安装与测试完整记录 1. 运行环境 系统:Ubuntu 24.04 LTS (WSL2)架构:x86_64 (GNU/Linux)Rust 版本:rustc 1.87.0 (2025-05-09)Cargo 版本:cargo 1.87.0 (2025-05-06) 2. 安装 Rust 2.1 使用 Rust 官方安…...
JavaScript基础-API 和 Web API
在学习JavaScript的过程中,理解API(应用程序接口)和Web API的概念及其应用是非常重要的。这些工具极大地扩展了JavaScript的功能,使得开发者能够创建出功能丰富、交互性强的Web应用程序。本文将深入探讨JavaScript中的API与Web AP…...

华为OD机试-最短木板长度-二分法(A卷,100分)
此题是一个最大化最小值的典型例题, 因为搜索范围是有界的,上界最大木板长度补充的全部木料长度,下界最小木板长度; 即left0,right10^6; 我们可以设置一个候选值x(mid),将木板的长度全部都补充到x,如果成功…...