ORACLE数据库实例报错ORA-00470: LGWR process terminated with error宕机问题分析报告
- 服务概述
10月21号03:22分,BOSS数据库实例发生异常宕机;工程师及时响应此问题并对此故障原因进行分析及相关建议,详细的故障情况及相关日志、TRACE文件的分析及总结、建议,请参阅本文档。
- hzboss数据库实例宕机分析
4.1 数据库层面日志的分析
4.1.1故障数据库实例alert日志
数据库实例的Alert日志
对故障时间点附近的ALERT日志分析如下:
Wed Oct 21 03:06:22 2024
Thread 1 advanced to log sequence 254206 (LGWR switch)
Current log# 9 seq# 254206 mem# 0: /boss_sysdata/oradata/dtvboss/redo09.log
Wed Oct 21 03:17:10 2024
WARNING: aiowait timed out 1 times --该时间段系统出现aio等待的告警,该告警由_iowait_timeouts隐含参数控制,超过该参数设置的值后会导致系统宕机,这里并未超过系统设置值。但是提示我们当时系统在io调度上出现了问题。
Wed Oct 21 03:22:11 2024
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_arc0_404.trc:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 367' --22分17秒系统抛错,进程367持有controlfile的enqueue lock时间被其他进程(arc0)等待超时,该enqueue lock持有时间由隐含参数_controlfile_enqueue_timeout控制,默认为900秒
Wed Oct 21 03:22:17 2024
System State dumped to trace file /oracle/admin/dtvboss/bdump/dtvboss_arc0_404.trc
Wed Oct 21 03:22:50 2024
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_lgwr_365.trc:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 367' --22分50秒,进程367持有的controlfile的enqueue lock时间被其他进程(lgwr)等待超时抛错
Wed Oct 21 03:22:55 2024
Killing enqueue blocker (pid=367) on resource CF-00000000-00000000 --LGWR进程发现异常查杀持有controlfile equeue lock的进程367
by killing session 1648.1
Wed Oct 21 03:22:55 2024
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_lgwr_365.trc:
ORA-07445: exception encountered: core dump [ksuklms()+672] [SIGSEGV] [Address not mapped to object] [0x000000062] [] []
Wed Oct 21 03:23:18 2024
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_pmon_343.trc: --后续一系列报错宕机开始
ORA-00470: LGWR process terminated with error
Wed Oct 21 03:23:18 2024
PMON: terminating instance due to error 470
Wed Oct 21 03:23:18 2024
ORA-470 encountered when generating server alert SMG-3503
Wed Oct 21 03:23:18 2024
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_j006_22335.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-00470: LGWR process terminated with error
Wed Oct 21 03:23:22 2024
Termination issued to instance processes. Waiting for the processes to exit
Wed Oct 21 03:23:28 2024
Instance termination failed to kill one or more processes
4.1.2 故障时trace文件分析
故障时,节点数据库实例的alert日志中提到了如下TRACE文件:dtvboss_arc0_404.trc
分析如下:
/oracle/admin/dtvboss/bdump/dtvboss_arc0_404.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /oracle/product/10.2.0/db_1
System name: SunOS
Node name: hzboss-ora1
Release: 5.10
Version: Generic_144488-12
Machine: sun4u
Instance name: dtvboss
Redo thread mounted by this instance: 1
Oracle process number: 23
Unix process pid: 404, image: oracle@hzboss-ora1 (ARC0)
*** SERVICE NAME:(SYS$BACKGROUND) 2024-10-21 03:22:11.264
*** SESSION ID:(1646.3) 2024-10-21 03:22:11.264
*** 2024-10-21 03:22:11.264
Unable to get enqueue on resource CF-00000000-00000000 (ges mode req=3 held=6)
Possible local blocker ospid=367 sid=1648 sser=1 time_held=37 secs (ges mode req=6 held=4)
DUMP LOCAL BLOCKER: initiate state dump for KILL BLOCKER
possible owner[17.367] on resource CF-00000000-00000000
Dumping process info of pid[17.367] requested by pid[23.404]
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 367'
-------------------------------------------------------------------------------
ENQUEUE [CF] HELD FOR TOO LONG
enqueue holder: 'inst 1, osid 367'
Process 'inst 1, osid 367' is holding an enqueue for maximum allowed time.
The process will be terminated.
Oracle Support Services triaging information: to find the root-cause, look
at the call stack of process 'inst 1, osid 367' located below. Ask the
developer that owns the first NON-service layer in the stack to investigate.
Common service layers are enqueues (ksq), latches (ksl), library cache
pins and locks (kgl), and row cache locks (kqr).
Dumping process 17.367 info:
*** 2024-10-21 03:22:11.300
---------------------------------------------------------------
从这一段信息可以看到,此TRACE文件是ARC0进程的TRACE,ARC0进程的ORACLE SID是23,OS PID是404;
ARC0进程无法获取资源:CF-00000000-00000000 (controlfile enqueue 队列锁),此资源的持有者是pid[17.367]。
ARC0进程的TRACE分析
PROCESS 23:
----------------------------------------
SO: 8a9004f88, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=23, calls cur/top: 8a95c6650/8a95c6650, flag: (2) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 24
last post received-location: ksasnd
last process to post me: b66002018 1 6
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: b66005770 1 2
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 5eb3388f0
O/S info: user: oracle, term: UNKNOWN, ospid: 404
OSD pid info: Unix process pid: 404, image: oracle@hzboss-ora1 (ARC0)
Dump of memory from 0x00000005F033B1F0 to 0x00000005F033B3F8
5F033B1F0 0000000C 00000000 00000008 AF739C70 [.............s.p]
………………………………
5F033B3E0 00000005 000313A7 00000008 AF739C70 [.............s.p]
5F033B3F0 00000006 000313A7 [........]
(FOB) flags=2 fib=8af713780 incno=0 pending i/o cnt=0
fname=/boss_sysdata/oradata/dtvboss/control03.ctl
fno=2 lblksz=16384 fsiz=1308
(FOB) flags=2 fib=8af7133e0 incno=0 pending i/o cnt=0
fname=/boss_sysdata/oradata/dtvboss/control02.ctl
fno=1 lblksz=16384 fsiz=1308
(FOB) flags=2 fib=8af713040 incno=0 pending i/o cnt=0
fname=/boss_sysdata/oradata/dtvboss/control01.ctl
fno=0 lblksz=16384 fsiz=1308
----------------------------------------
SO: 8a9778500, type: 11, owner: 8a9004f88, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 8a9004f88,
event: 21, last message event: 21,
last message waited event: 21, messages read: 0
channel: (5eb52cc20) scumnt mount lock
scope: 1, event: 22, last mesage event: 20,
publishers/subscribers: 0/20,
messages published: 1
----------------------------------------
SO: 8a95b7328, type: 4, owner: 8a9004f88, flag: INIT/-/-/0x00
(session) sid: 1646 trans: 0, creator: 8a9004f88, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0001-0017-0000000D, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
service name: SYS$BACKGROUND
last wait for 'enq: CF - contention' blocking sess=0x0 seq=43322 wait_time=2929723 seconds since wait started=36
name|mode=43460004, 0=0, operation=0
Dumping Session Wait History
for 'enq: CF - contention' count=1 wait_time=2929723
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929725
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929726
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929726
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929725
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929723
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929725
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929726
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929724
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929750
name|mode=43460004, 0=0, operation=0
temporary object counter: 0
----------------------------------------
UOL used : 0 locks(used=0, free=0)
KGX Atomic Operation Log 5eaffc848
Mutex 0(0, 0) idn 0 oper NONE
Library Cache uid 1646 efd 0 whr 0 slp 0
KGX Atomic Operation Log 5eaffc890
Mutex 0(0, 0) idn 0 oper NONE
Library Cache uid 1646 efd 0 whr 0 slp 0
KGX Atomic Operation Log 5eaffc8d8
Mutex 0(0, 0) idn 0 oper NONE
Library Cache uid 1646 efd 0 whr 0 slp 0
----------------------------------------
SO: 8afaa4330, type: 41, owner: 8a95b7328, flag: INIT/-/-/0x00
(dummy) nxc=0, nlb=0
----------------------------------------
SO: 8a97783d8, type: 11, owner: 8a9004f88, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 8a9004f88,
event: 24, last message event: 27,
last message waited event: 27, messages read: 1
channel: (5eb528a48) system events broadcast channel
scope: 2, event: 2923059, last mesage event: 27,
publishers/subscribers: 0/421,
messages published: 1
----------------------------------------
SO: 8a95c6650, type: 3, owner: 8a9004f88, flag: INIT/-/-/0x00
(call) sess: cur 8a95b7328, rec 8a9591bc8, usr 8a95b7328; depth: 0
----------------------------------------
SO: 8a95d1fb8, type: 5, owner: 8a95c6650, flag: INIT/-/-/0x00
(enqueue) CF-00000000-00000000 DID: 0001-0017-0000000D
lv: 00 79 4a fb 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0x5eb40f608, lock_flag: 0x0
own: 0x8a95b7328, sess: 0x8a95b7328, prv: 0x8a95d1fe8
----------------------------------------
SO: 8a9591bc8, type: 4, owner: 8a95c6650, flag: INIT/-/-/0x00
(session) sid: 1618 trans: 0, creator: 0, flag: (2) -/REC -/-/-/-/-/-
DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
temporary object counter: 0
----------------------------------------
SO: 8a97ad460, type: 16, owner: 8a9004f88, flag: INIT/-/-/0x00
(osp req holder)
此段信息可以看到,ARC0进程等待controlfile enqueue 队列锁。等待事件为enq: CF - contention
SO: 8a95d1fb8, type: 5, owner: 8a95c6650, flag: INIT/-/-/0x00
(enqueue) CF-00000000-00000000 DID: 0001-0017-0000000D
lv: 00 79 4a fb 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0x5eb40f608, lock_flag: 0x0
own: 0x8a95b7328, sess: 0x8a95b7328, prv: 0x8a95d1fe8
这里说明arc0在申请的资源是:res:0x5eb40f608。
而在process 17号的ckpt中可以看到0x5eb40f608资源被ckpt以mode: SSX方式持有持有。
CKPT进程的TRACE分析
PROCESS 17:
----------------------------------------
SO: 8a90037d0, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=17, calls cur/top: 8a95c5df8/8a95c5df8, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 114
last post received-location: kcbbza
last process to post me: 8a9001830 1 6
last post sent: 0 0 9
last post sent-location: ksqrcl
last process posted by me: b66011d18 91 0
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 5eb3388f0
O/S info: user: oracle, term: UNKNOWN, ospid: 367
OSD pid info: Unix process pid: 367, image: oracle@hzboss-ora1 (CKPT)
Dump of memory from 0x00000005EF308A18 to 0x00000005EF308C20
5EF308A10 0000000C 00000000 [........]
…………………………………………
----------------------------------------
SO: 8a95b9df8, type: 4, owner: 8a90037d0, flag: INIT/-/-/0x00
(session) sid: 1648 trans: 0, creator: 8a90037d0, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0001-0011-00000005, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
service name: SYS$BACKGROUND
waiting for 'direct path write' blocking sess=0x0 seq=39369 wait_time=0 seconds since wait started=37
file number=72, first dba=1, block cnt=1
……………………………………
----------------------------------------
SO: 8a9777a30, type: 11, owner: 8a90037d0, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 8a90037d0,
event: 14, last message event: 27,
last message waited event: 27, messages read: 1
channel: (5eb528a48) system events broadcast channel
scope: 2, event: 2923059, last mesage event: 27,
publishers/subscribers: 0/421,
messages published: 1
----------------------------------------
SO: 8a95c5df8, type: 3, owner: 8a90037d0, flag: INIT/-/-/0x00
(call) sess: cur 8a95b9df8, rec 8a95b5dc0, usr 8a95b9df8; depth: 0
----------------------------------------
SO: 8a95d1f20, type: 5, owner: 8a95c5df8, flag: INIT/-/-/0x00
(enqueue) CF-00000000-00000004 DID: 0001-0011-00000005
lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0x5eb507858, mode: S, lock_flag: 0x0
own: 0x8a95b9df8, sess: 0x8a95b9df8, proc: 0x8a90037d0, prv: 0x5eb507868
----------------------------------------
SO: 8a95d1cc0, type: 5, owner: 8a95c5df8, flag: INIT/-/-/0x00
(enqueue) CF-00000000-00000000 DID: 0001-0011-00000005
lv: 00 79 4a fb 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0x5eb40f608, mode: SSX, lock_flag: 0x0
own: 0x8a95b9df8, sess: 0x8a95b9df8, proc: 0x8a90037d0, prv: 0x5eb40f618
----------------------------------------
SO: 8a95b5dc0, type: 4, owner: 8a95c5df8, flag: INIT/-/-/0x00
(session) sid: 1645 trans: 0, creator: 0, flag: (2) -/REC -/-/-/-/-/-
DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
temporary object counter: 0
----------------------------------------
SO: 8a95cb468, type: 5, owner: 8a95c5df8, flag: INIT/-/-/0x00
(enqueue) RS-00000019-00000001 DID: 0001-0011-00000005
lv: 20 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0x5eb40f8c8, mode: SS, lock_flag: 0x0
own: 0x8a95b9df8, sess: 0x8a95b9df8, proc: 0x8a90037d0, prv: 0x5eb40f8d8
----------------------------------------
SO: 8a95cb338, type: 5, owner: 8a95c5df8, flag: INIT/-/-/0x00
(enqueue) CF-00000000-00000000 DID: 0000-0011-00000004
lv: 00 79 4a fb 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0x5eb40f608, mode: SS, lock_flag: 0x0
own: 0x8a95b9df8, sess: 0x8a95b9df8, proc: 0x8a90037d0, prv: 0x8a95d1cf0
----------------------------------------
SO: 8a97ad2c8, type: 16, owner: 8a90037d0, flag: INIT/-/-/0x00
(osp req holder)
此处可以看到CKPT进程持有(enqueue) CF-00000000-00000000,持有模式是:res: 0x5eb40f608, mode: SSX, lock_flag: 0x0,可以对应到ARC0进程请求的资源。
LGWR进程的TRACE分析
PROCESS 10:
----------------------------------------
SO: b66002018, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=10, calls cur/top: b665c6de0/b665c6de0, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 24
last post received-location: ksasnd
last process to post me: 8a9000860 1 6
last post sent: 109 0 4
last post sent-location: kslpsr
last process posted by me: 8a902ef90 105 0
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 5eb3388f0
O/S info: user: oracle, term: UNKNOWN, ospid: 365
OSD pid info: Unix process pid: 365, image: oracle@hzboss-ora1 (LGWR)
Dump of memory from 0x00000005EF308810 to 0x00000005EF308A18
5EF308810 00000015 00000000 0000000B 6C1C1158 [............l..X]
………………
----------------------------------------
SO: b665cc938, type: 5, owner: b66002018, flag: INIT/-/-/0x00
(enqueue) RT-00000001-00000000 DID: 0000-000A-00000005
lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0x5eb40f828, mode: X, lock_flag: 0x0
own: 0xb665bf398, sess: 0xb665bf398, proc: 0xb66002018, prv: 0x5eb40f838
(FOB) flags=2 fib=8af713780 incno=0 pending i/o cnt=0
fname=/boss_sysdata/oradata/dtvboss/control03.ctl
fno=2 lblksz=16384 fsiz=1308
(FOB) flags=2 fib=8af7133e0 incno=0 pending i/o cnt=0
fname=/boss_sysdata/oradata/dtvboss/control02.ctl
fno=1 lblksz=16384 fsiz=1308
(FOB) flags=2 fib=8af713040 incno=0 pending i/o cnt=0
fname=/boss_sysdata/oradata/dtvboss/control01.ctl
fno=0 lblksz=16384 fsiz=1308
----------------------------------------
SO: b66778b58, type: 11, owner: b66002018, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: b66002018,
event: 12, last message event: 20,
last message waited event: 20, messages read: 1
channel: (5eb52cc20) scumnt mount lock
scope: 1, event: 22, last mesage event: 20,
publishers/subscribers: 0/20,
messages published: 1
----------------------------------------
SO: b665bf398, type: 4, owner: b66002018, flag: INIT/-/-/0x00
(session) sid: 824 trans: 0, creator: b66002018, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0001-000A-00000006, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
service name: SYS$BACKGROUND
waiting for 'enq: CF - contention' blocking sess=0x8a95b9df8 seq=61266 wait_time=0 seconds since wait started=0
name|mode=43460005, 0=0, operation=0
Dumping Session Wait History
for 'enq: CF - contention' count=1 wait_time=2929719
name|mode=43460005, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929716
name|mode=43460005, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929716
name|mode=43460005, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929718
name|mode=43460005, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929719
name|mode=43460005, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929719
name|mode=43460005, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929713
name|mode=43460005, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929715
name|mode=43460005, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929722
name|mode=43460005, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929736
name|mode=43460005, 0=0, operation=0
temporary object counter: 0
----------------------------------------
UOL used : 0 locks(used=0, free=0)
KGX Atomic Operation Log 5ebff7cd0
Mutex 0(0, 0) idn 0 oper NONE
Library Cache uid 824 efd 0 whr 0 slp 0
KGX Atomic Operation Log 5ebff7d18
Mutex 0(0, 0) idn 0 oper NONE
Library Cache uid 824 efd 0 whr 0 slp 0
KGX Atomic Operation Log 5ebff7d60
Mutex 0(0, 0) idn 0 oper NONE
Library Cache uid 824 efd 0 whr 0 slp 0
----------------------------------------
SO: b6caa57e0, type: 41, owner: b665bf398, flag: INIT/-/-/0x00
(dummy) nxc=0, nlb=0
----------------------------------------
SO: b66778a48, type: 11, owner: b66002018, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: b66002018,
event: 13, last message event: 27,
last message waited event: 27, messages read: 1
channel: (5eb528a48) system events broadcast channel
scope: 2, event: 2923059, last mesage event: 27,
publishers/subscribers: 0/421,
messages published: 1
----------------------------------------
SO: b665c6de0, type: 3, owner: b66002018, flag: INIT/-/-/0x00
(call) sess: cur b665bf398, rec 0, usr b665bf398; depth: 0
----------------------------------------
SO: b665cce90, type: 5, owner: b665c6de0, flag: INIT/-/-/0x00
(enqueue) CF-00000000-00000000 DID: 0001-000A-00000006
lv: 00 79 4a fb 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0x5eb40f608, req: SSX, lock_flag: 0x0
own: 0xb665bf398, sess: 0xb665bf398, proc: 0xb66002018, prv: 0x5eb40f628
----------------------------------------
SO: b667ae720, type: 16, owner: b66002018, flag: INIT/-/-/0x00
(osp req holder)
从LGWR进程的TRACE中可以看到:waiting for 'enq: CF - contention' blocking sess=0x8a95b9df8 seq=61266 wait_time=0 seconds since wait started=0
说明LGWR进程也在等待controlfile enqueue 队列锁,等待事件为enq: CF – contention。
SO: b665cce90, type: 5, owner: b665c6de0, flag: INIT/-/-/0x00
(enqueue) CF-00000000-00000000 DID: 0001-000A-00000006
lv: 00 79 4a fb 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0x5eb40f608, req: SSX, lock_flag: 0x0
own: 0xb665bf398, sess: 0xb665bf398, proc: 0xb66002018, prv: 0x5eb40f628
从此段可以看到,LGWR进程请求的CF锁资源是0x5eb40f608,并且是以SSX模式请求;通过CKPT进程的TRACE可以看到其已经own: 0x8a95b9df8 持有此资源res: 0x5eb40f608, mode: SSX, lock_flag: 0x0 SSX模式持有此锁;因此LGWR进程也处于等待此资源的状态。
4.2 主机负载情况分析
从ALERT日志中提到的故障发生时的TRACE文件:
/oracle/admin/dtvboss/bdump/dtvboss_pmon_343.trc
/oracle/admin/dtvboss/bdump/dtvboss_arc0_404.trc
中可以找到故障发生时的主机负载情况:
*** 2024-10-21 03:22:11.300
Dumping diagnostic information for ospid 367:
OS pid = 367
loadavg : 0.44 0.62 1.82
swap info: free_mem = 1435.35M rsv = 59461.24M
alloc = 55941.14M avail = 55124.91 swap_free = 58645.01M
Dumping diagnostic information for oracle@hzboss-ora1:
OS pid = 24887
loadavg : 1.91 1.01 1.84
swap info: free_mem = 7661.14M rsv = 53520.42M
alloc = 50905.74M avail = 62991.89 swap_free = 65606.57M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 S oracle 24887 1 0 39 20 ? 4115427 ? Sep 24 ? 29:24 oracledtvboss (LOCAL=NO)
pstack: cannot examine 24887: no such process
*** 2024-10-21 03:23:53.644
ksuitm_check: OS PID=8690 is still alive
*** 2024-10-21 03:23:53.644
Dumping diagnostic information for oracle@hzboss-ora1 (J002):
OS pid = 8690
loadavg : 1.93 1.02 1.84
swap info: free_mem = 7660.87M rsv = 53524.68M
alloc = 50905.95M avail = 62987.56 swap_free = 65606.30M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 O oracle 8690 1 2 39 20 ? 4121237 00:57:24 ? 29:43 ora_j002_dtvboss
pstack: cannot examine 8690: no such process
*** 2024-10-21 03:23:54.751
ksuitm_check: OS PID=21430 is still alive
*** 2024-10-21 03:23:54.751
Dumping diagnostic information for oracle@hzboss-ora1 (J003):
OS pid = 21430
loadavg : 1.94 1.02 1.84
swap info: free_mem = 7660.27M rsv = 53521.18M
alloc = 50906.49M avail = 62991.00 swap_free = 65605.69M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 O oracle 21430 1 2 39 20 ? 4119652 02:59:44 ? 0:27 ora_j003_dtvboss
pstack: cannot examine 21430: unanticipated system error
*** 2024-10-21 03:23:57.839
ksuitm_check: OS PID=367 is still alive
*** 2024-10-21 03:23:57.840
Dumping diagnostic information for oracle@hzboss-ora1 (CKPT):
OS pid = 367
loadavg : 1.98 1.04 1.84
swap info: free_mem = 7660.46M rsv = 53520.98M
alloc = 50906.02M avail = 62990.88 swap_free = 65605.85M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 S oracle 367 1 0 39 20 ? 4116160 ? Mar 30 ? 498:14 ora_ckpt_dtvboss
pstack: cannot examine 367: no such process
*** 2024-10-21 03:23:58.926
从这些负载信息可以看到,整个系统当时的空闲内存和swap还留有余地,足够当时系统的使用,所以内存不足的原因可以排除。
4.3 数据库版本情况
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /oracle/product/10.2.0/db_1
System name: SunOS
Node name: hzboss-ora1
Release: 5.10
Version: Generic_144488-12
Machine: sun4u
Instance name: dtvboss
Redo thread mounted by this instance: 1
Oracle process number: 2
COMP_ID COMP_NAME VERSION STATUS MODIFIED
---------- ----------------------------------- --------------- -------- -----------------------
SDO Spatial 10.2.0.4.0 VALID 18/FEB/2011:21:05:30
ORDIM Oracle.interMedia 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
AMD OLAP.Catalog 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
EM Oracle.Enterprise.Manager 10.2.0.4.0 VALID 18/FEB/2011:21:04:33
XDB Oracle.XML.Database 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
CONTEXT Oracle.Text 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
EXF Oracle.Expression.Filter 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
RUL Oracle.Rule.Manager 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
OWM Oracle.Workspace.Manager 10.2.0.4.3 VALID 18/FEB/2011:21:05:29
ODM Oracle.Data.Mining 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
CATALOG Oracle.Database.Catalog.Views 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
CATPROC Oracle.Database.Packages.and.Types 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
JAVAVM JServer.JAVA.Virtual.Machine 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
XML Oracle.XDK 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
CATJAVA Oracle.Database.Java.Packages 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
APS OLAP.Analytic.Workspace 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
XOQ Oracle.OLAP.API 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
- 总结与建议
针对此次数据库实例宕机问题,总结及建议如下:
5.1 问题分析汇总
5.1.1 宕机问题总结
综上可以得出结论,由于arc0进程要写日志需要等待控制文件CF锁,lgwr估计要写日志或者切换也在等待控制文件CF锁,而ckpt进程持有controlfile enqueue lock时间过长,arch0等待超过900秒,导致oracle数据库系统判断异常。
此时oracle数据库系统会让lgwr进程kill持有controlfile enqueue lock的进程,相当于lgwr进程杀了ckpt进程,而ckpt进程又为oracle关键进程,连锁反应,引起宕机。
5.1.2 宕机的问题根源判断
在宕机前系统发出告警aiowait timed out 1 times,初步判断系统在当时的io存在一些问题。而当时系统的内存和swap资源都充足没有出现严重问题,trace中的loadavg : 0.44 0.62 1.82则说明了当时的cpu负载低,这里排除了内存和cpu的问题.
初步建议检测当时的redo日志的切换和产生量,当时的数据库服务器的io调用频繁度,因此可以参考下数据库的2点到3点间的awr信息(如果有数据库监控系统那么则可以分析到故障发生时间点的io信息,但是当前环境未部署)。
controlfile equeue lock的超时管理机制中,受_kill_controlfile_enqueue_blocker=false,_controlfile_enqueue_timeout,_kill_enqueue_blocker这3个隐含参数的影响,当前环境中我不建议增加_controlfile_enqueue_timeout,而是建议在通过控制_kill_controlfile_enqueue_blocker,_kill_enqueue_blocker参数来实现该机制的管理,以避免宕机的风险。
控制队列锁管理机制的bug以及避免:
控制文件队列锁有一个相关的匹配Bug 7692631 - 'DATABASE CRASHES WITH ORA-494 AFTER UPGRADE TO 10.2.0.4',这里可以通过以下2个参数规避。
_kill_controlfile_enqueue_blocker值分为true和false,默认为true,如果设置为false则不会去kill超时的进程,此时_kill_controlfile_enqueue_blocker=false参数失效。
_kill_enqueue_blocker值分为别0,1,2,3. 当设置为1的时候,保证后台进程不会被kill,而只会kill前台进程,这样可以避免宕机。
5.2 建议
5.2.1在数据库层上的调整建议
经过沟通当时正在跑批,所以当时的日志切换会比较频繁,因此我们建议对此时的redo log的大小匹配处理进行调整,适当增加日志组以及日志文件大小。控制日志切换保持在20-30分钟一次。
同时建议使用数据库监控工具,以便能在故障时候能及时获取到故障前的监控数据。
在参数调整角度上我们可以将_kill_enqueue_blocker设置为1或者将_kill_controlfile_enqueue_blocker设置为false。
5.2.2在系统层上的调整建议
同时检查系统io是否存在问题,
同时建议部署io层面的监控工具。
相关文章:
ORACLE数据库实例报错ORA-00470: LGWR process terminated with error宕机问题分析报告
服务概述 10月21号03:22分,BOSS数据库实例发生异常宕机;工程师及时响应此问题并对此故障原因进行分析及相关建议,详细的故障情况及相关日志、TRACE文件的分析及总结、建议,请参阅本文档。 hzboss数据库实例宕机分析 4.1 数据库层面日志的分…...

【前端优化】vue2 webpack4项目升级webpack5,大大提升运行速度
记录一下过程 手里有个老项目,vue2webpack4 项目很大,每次运行、运行都要将近10分钟 现在又要往里面写很多东西,再不优化,开发着会更难受,所以决定先将它升级至webpack5 最初失败的尝试 直接在项目里安装了webpack5 但…...

Nginx应用场景详解与配置指南
1. 什么是Nginx? Nginx(发音为"engine-x")是一个高性能的HTTP和反向代理服务器,也是一个IMAP/POP3/SMTP代理服务器。它以高性能、稳定性、丰富的功能集、简单的配置和低资源消耗而闻名。 2. Nginx的主要应用场景 2.1 …...

vue2 切换主题色以及单页面好使方法
今天要新增一个页面要根据不同公司切换不同页面主题色,一点一点来,怎么快速更改 el-pagination 分页组件主题色。 <el-pagination :page-size"pageSize" :pager-count"pageCount"layout"sizes, prev, pager, next, jumper,…...
React学习———CSS Modules(样式模块化)
CSS Modules CSS Modules(样式模块化)是一种用于模块化和局部作用域化CSS样式的技术,让CSS只在当前组件内生效,避免全局样式冲突的技术方案 工作原理 文件命名:通常以.module.css、.module.less、.module.scss等结尾…...
MCP 与 Cloudflare 的结合:网络安全的新高度
MCP 与 Cloudflare 的结合:网络安全的新高度 在数字化时代,网络安全已经不只是某些行业的“专属问题”,而是所有企业、个人都必须面对的核心挑战。从 DDoS 攻击、数据泄露,到身份盗用,每一种网络威胁都可能带来巨大的损失。而这时候,微软 MCP(Microsoft Cloud Platform…...
JavaScript入门【1】概述
1.JavaScript是什么? <font style"color:rgb(38,38,38);">Javascript (简称“JS”)是⼀种直译式脚本语⾔,⼀段脚本其实就是⼀系列指令,计算机通过这些指令来达成⽬标。它⼜是⼀种动态类型的编程语⾔。JS⽤来在⽹…...
PyQt5 的使用
PyQt5 是 Python 里基于 Qt 框架的 GUI 开发工具,能做桌面应用,跨平台(Windows/macOS/Linux 都能用)。你可能想知道:怎么开始用?有哪些核心组件?怎么写界面逻辑?别急,咱们…...

JavaScript【6】事件
1.概述: 在 JavaScript 中,事件(Event)是浏览器或 DOM(文档对象模型)与 JavaScript 代码之间交互的一种机制。它代表了在浏览器环境中发生的特定行为或者动作,比如用户点击鼠标、敲击键盘、页面…...

STM32F10xx 参考手册
6. 什么是寄存器 本章参考资料:《STM32F10xx 参考手册》、《STM32F10xx数据手册》、 学习本章时,配合《STM32F10xx 参考手册》“存储器和总线架构”及“通用I/O(GPIO)”章节一起阅读,效果会更佳,特别是涉及到寄存器说明的部分。…...
使用Docker部署Nacos
sudo systemctl start docker sudo systemctl enable docker docker --version 步骤 2: 拉取 Nacos Docker 镜像 拉取 Nacos 镜像: 你可以从 Docker Hub 上拉取官方的 Nacos 镜像,使用以下命令: docker pull nacos/nacos-server 这会从 …...
深度学习中ONNX格式的模型文件
一、模型部署的核心步骤 模型部署的完整流程通常分为以下阶段,用 “跨国旅行” 类比: 步骤类比解释技术细节1. 训练模型学会一门语言(如中文)用 PyTorch/TensorFlow 训练模型2. 导出为 ONNX翻译成国际通用语言(如英语…...

TIFS2024 | CRFA | 基于关键区域特征攻击提升对抗样本迁移性
Improving Transferability of Adversarial Samples via Critical Region-Oriented Feature-Level Attack 摘要-Abstract引言-Introduction相关工作-Related Work提出的方法-Proposed Method问题分析-Problem Analysis扰动注意力感知加权-Perturbation Attention-Aware Weighti…...

Redis 发布订阅模式深度解析:原理、应用与实践
在现代分布式系统架构中,实时消息传递机制扮演着至关重要的角色。Redis 作为一款高性能的内存数据库,其内置的发布订阅(Pub/Sub)功能提供了一种轻量级、高效的消息通信方案。本文将全面剖析 Redis 发布订阅模式,从其基本概念、工作原理到实际…...
环形缓冲区 ring buffer 概述
环形缓冲区 ring buffer 概述 1. 简介 环形缓冲区(ring buffer),是一种用于表示一个固定尺寸、头尾相连的缓冲区的数据结构,适合缓存数据流。也称作环形缓冲区(circular buffer),环形队列&…...

飞帆控件 post or get it when it has get
我在这里分享两个链接: post_get_it 设计 - 飞帆 有人看出来这个控件是干什么用吗? 控件的配置:...

SQL里where条件的顺序影响索引使用吗?
大家好,我是锋哥。今天分享关于【SQL里where条件的顺序影响索引使用吗?】面试题。希望对大家有帮助; SQL里where条件的顺序影响索引使用吗? 1000道 互联网大厂Java工程师 精选面试题-Java资源分享网 在 SQL 查询中,W…...

SAP学习笔记 - 开发豆知识02 - com.sap.cds.services.cds.CdsService 废止,那么用什么代替呢?
我看很多人代码里面用的都是这个CdsService类,可以自从2.0版本往上这个类就没了啊。 它的代替是什么呢?用的CqnService 那么怎么查的呢? 我也是下载好几个包,解压看,然后发现这里还可以直接看,挺方便的。…...

OpenResty 深度解析:构建高性能 Web 服务的终极方案
引言 openresty是什么?在我个人对它的理解来看相当于嵌入了lua的nginx; 我们在nginx中嵌入lua是为了不需要再重新编译,我们只需要重新修改lua脚本,随后重启即可; 一.lua指令序列 我们分别从初始化阶段,重写/访问阶段,内容阶段,日志…...

什么是路由器环回接口?
路由器环回接口(LoopbackInterface)是网络设备中的一种逻辑虚拟接口,不依赖物理硬件,但在网络配置和管理中具有重要作用。以下是其核心要点: 一、基本特性 1.虚拟性与稳定性 环回接口是纯软件实现的逻辑接口&#x…...
OpenHarmony:开源操作系统重塑产业数字化底座
OpenHarmony:开源操作系统重塑产业数字化底座 引言:当操作系统成为数字公共品 在万物智联时代,操作系统不再是科技巨头的专属领地。华为捐赠的OpenHarmony项目,正以开源协作模式重构操作系统产业格局。这个脱胎于商业版本的开源…...

【MySQL进阶】如何在ubuntu下安装MySQL数据库
前言 🌟🌟本期讲解关于如何在ubuntu环境下安装mysql的详细介绍~~~ 🌈感兴趣的小伙伴看一看小编主页:GGBondlctrl-CSDN博客 🔥 你的点赞就是小编不断更新的最大动力 dz…...

【数据结构】_二叉树
1.二叉树链式结构的实现 1.1 前置说明 在学习二叉树的基本操作前,需先要创建一棵二叉树,然后才能学习其相关的基本操作。由于现在大家对二叉树结构掌握还不够深入,为了降低大家学习成本,此处手动快速创建一棵简单的二叉树&#x…...

给图表组件上点“颜色” —— 我与 CodeBuddy 的合作记录
我正在参加CodeBuddy「首席试玩官」内容创作大赛,本文所使用的 CodeBuddy 免费下载链接:腾讯云代码助手 CodeBuddy - AI 时代的智能编程伙伴 前段时间,我在开发一个 Vue3 项目的时候,碰到了一个小小的挑战:我想做一个可…...

使用 YOLO 结合 PiscTrace 实现股票走势图像识别
在智能投研和金融分析中,自动识别图表中的模式(如 K 线走势、支撑/阻力位、形态结构)成为一种新兴手段。传统的技术分析依赖大量人工判断,而计算机视觉技术的发展,特别是 YOLO 模型在图像识别领域的高效表现࿰…...

OpenCV中的光流估计方法详解
文章目录 一、引言二、核心算法原理1. 光流法基本概念2. 算法实现步骤 三、代码实现详解1. 初始化设置2. 特征点检测3. 光流计算与轨迹绘制 四、实际应用效果五、优化方向六、结语 一、引言 在计算机视觉领域,运动目标跟踪是一个重要的研究方向,广泛应用…...
OpenCL C++ 常见属性与函数
核心对象与属性 对象/属性描述示例cl::Platform表示OpenCL平台cl::Platform::get(&platforms)cl::Device表示计算设备cl::Device::getDefault()cl::Context管理设备、内存和命令队列的上下文cl::Context(contextDevices)cl::CommandQueue命令队列,用于提交命令cl::Command…...
Android核心系统服务:AMS、WMS、PMS 与 system_server 进程解析
1. 引言 在 Android 系统中,ActivityManagerService (AMS)、WindowManagerService (WMS) 和 PackageManagerService (PMS) 是三个最核心的系统服务,它们分别管理着应用的生命周期、窗口显示和应用包管理。 但你是否知道,这些服务并不是独立…...
18.自动化生成知识图谱的多维度质量评估方法论
文章目录 一、结构维度评估1.1 拓扑结构评估1.1.1 基础图论指标1.1.2 层级结构指标 1.2 逻辑一致性评估1.2.1 形式逻辑验证1.2.2 约束满足度 二、语义维度评估2.1 语义一致性评估2.1.1 标签语义分析2.1.2 关系语义评估 2.2 语义表示质量2.2.1 嵌入质量2.2.2 上下文语义评估 三、…...
【行为型之命令模式】游戏开发实战——Unity可撤销系统与高级输入管理的架构秘钥
文章目录 ⌨️ 命令模式(Command Pattern)深度解析一、模式本质与核心价值二、经典UML结构三、Unity实战代码(可撤销的建造系统)1. 定义命令接口与接收者2. 实现具体命令3. 命令管理器(Invoker)4. 客户端使…...