当前位置: 首页 > news >正文

一次奇葩的spin_lock_irq / spin_unlock_irq使用不当导致的系统卡死分析

这是在调试内核block层时遇到的一例奇葩的soft lock锁死问题(内核版本centos 8.3,4.18.0-240),现场如下:

  • [  760.247152] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:1:2635]
  • ……………..
  • [  760.247184] CPU: 0 PID: 2635 Comm: kworker/0:1 Kdump: loaded Tainted: G            E    ---------r-  - 4.18.0 #4
  • [  760.247184] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
  • [  760.247190] Workqueue: events netstamp_clear
  • [  760.247193] RIP: 0010:smp_call_function_many+0x1ec/0x250
  • [  760.247194] Code: c7 e8 08 fa 74 00 3b 05 96 ee 2b 01 0f 83 99 fe ff ff 48 63 d0 48 8b 0b 48 03 0c d5 60 d8 15 bd 8b 51 18 83 e2 01 74 0a f3 90 <8b> 51 18 83 e2 01 75 f6 eb c7 48 c7 c2 00 4c 41 bd 4c 89 f6 89 df
  • [  760.247195] RSP: 0018:ffffa3b342ac3dd0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
  • [  760.247196] RAX: 0000000000000001 RBX: ffff92adf5c2ae80 RCX: ffff92adf5c704e0
  • [  760.247196] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff92acc7d536c0
  • [  760.247196] RBP: ffffffffbc027610 R08: 000000000002f060 R09: ffffffffbc04fc8a
  • [  760.247197] R10: ffffeec908b18a00 R11: 0000000000000000 R12: 0000000000000000
  • [  760.247197] R13: 0000000000000001 R14: 0000000000000080 R15: 0000000000000001
  • [  760.247198] FS:  0000000000000000(0000) GS:ffff92adf5c00000(0000) knlGS:0000000000000000
  • [  760.247198] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  • [  760.247199] CR2: 00007fa9e40033c8 CR3: 000000001a00a003 CR4: 00000000003606f0
  • [  760.247216] Call Trace:
  • [  760.247221]  ? poke_int3_handler+0xe0/0xe0
  • [  760.247222]  on_each_cpu+0x28/0x60
  • [  760.247223]  text_poke_bp_batch+0x8b/0x160
  • [  760.247225]  arch_jump_label_transform_apply+0x2e/0x50
  • [  760.247226]  static_key_enable_cpuslocked+0x52/0x80
  • [  760.247228]  static_key_enable+0x16/0x20
  • [  760.247229]  process_one_work+0x1a7/0x360
  • [  760.247230]  worker_thread+0x30/0x390
  • [  760.247232]  ? create_worker+0x1a0/0x1a0
  • [  760.247232]  kthread+0x112/0x130
  • [  760.247233]  ? kthread_flush_work_fn+0x10/0x10
  • [  760.247235]  ret_from_fork+0x35/0x40
  • [  760.247237] Kernel panic - not syncing: softlockup: hung tasks
  • [  760.247239] CPU: 0 PID: 2635 Comm: kworker/0:1 Kdump: loaded Tainted: G            EL   ---------r-  - 4.18.0 #4
  • [  760.247239] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
  • [  760.247241] Workqueue: events netstamp_clear
  • [  760.247241] Call Trace:
  • [  760.247243]  <IRQ>
  • [  760.247245]  dump_stack+0x5c/0x80
  • [  760.247247]  panic+0xe7/0x2a9
  • [  760.247249]  ? __switch_to_asm+0x51/0x70
  • [  760.247250]  watchdog_timer_fn.cold.8+0x85/0x9e
  • [  760.247251]  ? watchdog+0x30/0x30
  • [  760.247253]  __hrtimer_run_queues+0x100/0x280
  • [  760.247255]  hrtimer_interrupt+0x100/0x220
  • [  760.247256]  ? ktime_get+0x36/0xa0
  • [  760.247257]  smp_apic_timer_interrupt+0x6a/0x130
  • [  760.247259]  apic_timer_interrupt+0xf/0x20
  • [  760.247260]  </IRQ>
  • [  760.247262] RIP: 0010:smp_call_function_many+0x1ec/0x250
  • [  760.247263] Code: c7 e8 08 fa 74 00 3b 05 96 ee 2b 01 0f 83 99 fe ff ff 48 63 d0 48 8b 0b 48 03 0c d5 60 d8 15 bd 8b 51 18 83 e2 01 74 0a f3 90 <8b> 51 18 83 e2 01 75 f6 eb c7 48 c7 c2 00 4c 41 bd 4c 89 f6 89 df
  • [  760.247263] RSP: 0018:ffffa3b342ac3dd0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
  • [  760.247264] RAX: 0000000000000001 RBX: ffff92adf5c2ae80 RCX: ffff92adf5c704e0
  • [  760.247265] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff92acc7d536c0
  • [  760.247266] RBP: ffffffffbc027610 R08: 000000000002f060 R09: ffffffffbc04fc8a
  • [  760.247266] R10: ffffeec908b18a00 R11: 0000000000000000 R12: 0000000000000000
  • [  760.247267] R13: 0000000000000001 R14: 0000000000000080 R15: 0000000000000001
  • [  760.247268]  ? poke_int3_handler+0xe0/0xe0
  • [  760.247269]  ? native_send_call_func_ipi+0xda/0x120
  • [  760.247271]  ? poke_int3_handler+0xe0/0xe0
  • [  760.247272]  on_each_cpu+0x28/0x60
  • [  760.247274]  text_poke_bp_batch+0x8b/0x160
  • [  760.247275]  arch_jump_label_transform_apply+0x2e/0x50
  • [  760.247276]  static_key_enable_cpuslocked+0x52/0x80
  • [  760.247277]  static_key_enable+0x16/0x20
  • [  760.247278]  process_one_work+0x1a7/0x360
  • [  760.247279]  worker_thread+0x30/0x390

先启动crash分析:crash /usr/lib/debug/lib/modules/4.18.0-240.el8.x86_64/vmlinux /var/crash/127.0.0.1-2023-02-19-02\:05\:59/vmcore,bt看下卡死时各个进程的栈回溯信息:

  •     DUMPFILE: /var/crash/127.0.0.1-2023-02-19-02:05:59/vmcore  [PARTIAL DUMP]
  •         CPUS: 4
  •         DATE: Sun Feb 19 02:04:55 2023
  •       UPTIME: 00:11:46
  • LOAD AVERAGE: 21.78, 17.84, 9.85
  •        TASKS: 415
  •     NODENAME: localhost.localdomain
  •      RELEASE: 4.18.0
  •      VERSION: #4 SMP Sun Feb 19 00:38:10 PST 2023
  •      MACHINE: x86_64  (3407 Mhz)
  •       MEMORY: 8 GB
  •        PANIC: "Kernel panic - not syncing: softlockup: hung tasks"
  •          PID: 2635
  •      COMMAND: "kworker/0:1"
  •         TASK: ffff92acd6cf17c0  [THREAD_INFO: ffff92acd6cf17c0]
  •          CPU: 0
  •        STATE: TASK_RUNNING (PANIC)
  • crash> bt
  • PID: 2635   TASK: ffff92acd6cf17c0  CPU: 0   COMMAND: "kworker/0:1"
  •  #0 [ffff92adf5c03d48] machine_kexec at ffffffffbc05bf3e
  •  #1 [ffff92adf5c03da0] __crash_kexec at ffffffffbc16072d
  •  #2 [ffff92adf5c03e68] panic at ffffffffbc0b5dc7
  •  #3 [ffff92adf5c03ee8] watchdog_timer_fn.cold.8 at ffffffffbc196dc7
  •  #4 [ffff92adf5c03f18] __hrtimer_run_queues at ffffffffbc1408a0
  •  #5 [ffff92adf5c03f78] hrtimer_interrupt at ffffffffbc141070
  •  #6 [ffff92adf5c03fd8] smp_apic_timer_interrupt at ffffffffbca027da
  •  #7 [ffff92adf5c03ff0] apic_timer_interrupt at ffffffffbca01d6f
  • --- <IRQ stack> ---
  •  #8 [ffffa3b342ac3d28] apic_timer_interrupt at ffffffffbca01d6f
  •     [exception RIP: smp_call_function_many+492]
  •     RIP: ffffffffbc15677c  RSP: ffffa3b342ac3dd0  RFLAGS: 00000202
  •     RAX: 0000000000000001  RBX: ffff92adf5c2ae80  RCX: ffff92adf5c704e0
  •     RDX: 0000000000000001  RSI: 0000000000000000  RDI: ffff92acc7d536c0
  •     RBP: ffffffffbc027610   R8: 000000000002f060   R9: ffffffffbc04fc8a
  •     R10: ffffeec908b18a00  R11: 0000000000000000  R12: 0000000000000000
  •     R13: 0000000000000001  R14: 0000000000000080  R15: 0000000000000001
  •     ORIG_RAX: ffffffffffffff13  CS: 0010  SS: 0018
  •  #9 [ffffa3b342ac3e10] on_each_cpu at ffffffffbc156838
  • #10 [ffffa3b342ac3e30] text_poke_bp_batch at ffffffffbc027fab
  • #11 [ffffa3b342ac3e70] arch_jump_label_transform_apply at ffffffffbc0251be
  • #12 [ffffa3b342ac3e78] static_key_enable_cpuslocked at ffffffffbc226422
  • #13 [ffffa3b342ac3e88] static_key_enable at ffffffffbc226466
  • #14 [ffffa3b342ac3e98] process_one_work at ffffffffbc0d3477
  • #15 [ffffa3b342ac3ed8] worker_thread at ffffffffbc0d3b40
  • #16 [ffffa3b342ac3f10] kthread at ffffffffbc0d9502
  • #17 [ffffa3b342ac3f50] ret_from_fork at ffffffffbca00255
  • crash> bt -a
  • PID: 2635   TASK: ffff92acd6cf17c0  CPU: 0   COMMAND: "kworker/0:1"
  •  #0 [ffff92adf5c03d48] machine_kexec at ffffffffbc05bf3e
  •  #1 [ffff92adf5c03da0] __crash_kexec at ffffffffbc16072d
  •  #2 [ffff92adf5c03e68] panic at ffffffffbc0b5dc7
  •  #3 [ffff92adf5c03ee8] watchdog_timer_fn.cold.8 at ffffffffbc196dc7
  •  #4 [ffff92adf5c03f18] __hrtimer_run_queues at ffffffffbc1408a0
  •  #5 [ffff92adf5c03f78] hrtimer_interrupt at ffffffffbc141070
  •  #6 [ffff92adf5c03fd8] smp_apic_timer_interrupt at ffffffffbca027da
  •  #7 [ffff92adf5c03ff0] apic_timer_interrupt at ffffffffbca01d6f
  • --- <IRQ stack> ---
  •  #8 [ffffa3b342ac3d28] apic_timer_interrupt at ffffffffbca01d6f
  •     [exception RIP: smp_call_function_many+492]
  •     RIP: ffffffffbc15677c  RSP: ffffa3b342ac3dd0  RFLAGS: 00000202
  •     RAX: 0000000000000001  RBX: ffff92adf5c2ae80  RCX: ffff92adf5c704e0
  •     RDX: 0000000000000001  RSI: 0000000000000000  RDI: ffff92acc7d536c0
  •     RBP: ffffffffbc027610   R8: 000000000002f060   R9: ffffffffbc04fc8a
  •     R10: ffffeec908b18a00  R11: 0000000000000000  R12: 0000000000000000
  •     R13: 0000000000000001  R14: 0000000000000080  R15: 0000000000000001
  •     ORIG_RAX: ffffffffffffff13  CS: 0010  SS: 0018
  •  #9 [ffffa3b342ac3e10] on_each_cpu at ffffffffbc156838
  • #10 [ffffa3b342ac3e30] text_poke_bp_batch at ffffffffbc027fab
  • #11 [ffffa3b342ac3e70] arch_jump_label_transform_apply at ffffffffbc0251be
  • #12 [ffffa3b342ac3e78] static_key_enable_cpuslocked at ffffffffbc226422
  • #13 [ffffa3b342ac3e88] static_key_enable at ffffffffbc226466
  • #14 [ffffa3b342ac3e98] process_one_work at ffffffffbc0d3477
  • #15 [ffffa3b342ac3ed8] worker_thread at ffffffffbc0d3b40
  • #16 [ffffa3b342ac3f10] kthread at ffffffffbc0d9502
  • #17 [ffffa3b342ac3f50] ret_from_fork at ffffffffbca00255
  • PID: 2857   TASK: ffff92ad555497c0  CPU: 1   COMMAND: "fio"
  •  #0 [fffffe0000032e50] crash_nmi_callback at ffffffffbc04eee3
  •  #1 [fffffe0000032e58] nmi_handle at ffffffffbc023703
  •  #2 [fffffe0000032eb0] default_do_nmi at ffffffffbc023ade
  •  #3 [fffffe0000032ed0] do_nmi at ffffffffbc023cb8
  •  #4 [fffffe0000032ef0] end_repeat_nmi at ffffffffbca016d4
  •     [exception RIP: native_queued_spin_lock_slowpath+32]
  •     RIP: ffffffffbc111190  RSP: ffff92adf5c43e38  RFLAGS: 00000002
  •     RAX: 0000000000000001  RBX: 0000000000000246  RCX: 0000000000008801
  •     RDX: 0000000000000001  RSI: 0000000000000001  RDI: ffff92adc0fc6400
  •     RBP: ffff92acd0a39800   R8: 0000000000000000   R9: 0000000000000000
  •     R10: 0000000000000068  R11: 0000000000000000  R12: ffff92ad55426910
  •     R13: ffff92adc0fc6400  R14: 000000a494918997  R15: 0000000000008801
  •     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  • --- <NMI exception stack> ---
  •  #5 [ffff92adf5c43e38] native_queued_spin_lock_slowpath at ffffffffbc111190
  •  #6 [ffff92adf5c43e38] _raw_spin_lock_irqsave at ffffffffbc8cdc12
  •  #7 [ffff92adf5c43e48] bfq_finish_requeue_request at ffffffffc0664555 [bfq]
  •  #8 [ffff92adf5c43ea0] blk_mq_free_request at ffffffffbc4070ca
  •  #9 [ffff92adf5c43ec8] scsi_end_request at ffffffffbc5d0e7a
  • #10 [ffff92adf5c43f00] scsi_io_completion at ffffffffbc5d0fc8
  • #11 [ffff92adf5c43f48] blk_done_softirq at ffffffffbc405ee1
  • #12 [ffff92adf5c43f80] __softirqentry_text_start at ffffffffbcc000e4
  • #13 [ffff92adf5c43fe0] irq_exit at ffffffffbc0bc1d7
  • #14 [ffff92adf5c43ff0] call_function_single_interrupt at ffffffffbca01e0f
  • --- <IRQ stack> ---
  • #15 [ffffa3b34223f728] call_function_single_interrupt at ffffffffbca01e0f
  •     [exception RIP: bfq_pos_tree_add_move+109]
  •     RIP: ffffffffc0667f0d  RSP: ffffa3b34223f7d0  RFLAGS: 00000246
  •     RAX: ffff92adc0fc6200  RBX: ffff92acd0b89000  RCX: ffff92acd6900c38
  •     RDX: 0000000000000000  RSI: ffff92acd967ba58  RDI: ffff92acd0b89038
  •     RBP: ffff92adc0fc6000   R8: 0000000000000000   R9: 0000000000000000
  •     R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000001
  •     R13: ffff92adc0fc6000  R14: ffff92acd20593f0  R15: ffff92acd0b89058
  •     ORIG_RAX: ffffffffffffff04  CS: 0010  SS: 0018
  • #16 [ffffa3b34223f7f8] bfq_remove_request.cold.58 at ffffffffc06680cc [bfq]
  • #17 [ffffa3b34223f878] bfq_finish_requeue_request at ffffffffc0664948 [bfq]
  • #18 [ffffa3b34223f8d0] blk_mq_free_request at ffffffffbc4070ca
  • #19 [ffffa3b34223f8f8] bfq_bio_merge at ffffffffc065da34 [bfq]
  • #20 [ffffa3b34223f948] blk_mq_make_request at ffffffffbc40abb0
  • #21 [ffffa3b34223f9d8] generic_make_request at ffffffffbc3fe85f
  • #22 [ffffa3b34223fa38] submit_bio at ffffffffbc3feadc
  • #23 [ffffa3b34223fa78] do_blockdev_direct_IO at ffffffffbc31ee96
  • #24 [ffffa3b34223fc78] ext4_direct_IO at ffffffffc0734ea6 [ext4]
  • #25 [ffffa3b34223fcf0] generic_file_read_iter at ffffffffbc22da7f
  • #26 [ffffa3b34223fd38] aio_read at ffffffffbc3313a5
  • #27 [ffffa3b34223fe40] io_submit_one at ffffffffbc33165b
  • #28 [ffffa3b34223feb8] __x64_sys_io_submit at ffffffffbc331b82
  • #29 [ffffa3b34223ff38] do_syscall_64 at ffffffffbc00419b
  • #30 [ffffa3b34223ff50] entry_SYSCALL_64_after_hwframe at ffffffffbca000ad
  • PID: 491    TASK: ffff92acd7bf97c0  CPU: 2   COMMAND: "kworker/2:1H"
  •  #0 [fffffe000005de50] crash_nmi_callback at ffffffffbc04eee3
  •  #1 [fffffe000005de58] nmi_handle at ffffffffbc023703
  •  #2 [fffffe000005deb0] default_do_nmi at ffffffffbc023ade
  •  #3 [fffffe000005ded0] do_nmi at ffffffffbc023cb8
  •  #4 [fffffe000005def0] end_repeat_nmi at ffffffffbca016d4
  •     [exception RIP: native_queued_spin_lock_slowpath+32]
  •     RIP: ffffffffbc111190  RSP: ffffa3b3414c3d50  RFLAGS: 00000002
  •     RAX: 0000000000000001  RBX: ffff92acd7f2bc00  RCX: ffff92adc0fc6428
  •     RDX: 0000000000000001  RSI: 0000000000000001  RDI: ffff92adc0fc6400
  •     RBP: ffff92adc0fc6000   R8: 0000000000000000   R9: 0000000000000000
  •     R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
  •     R13: ffff92acd20593f0  R14: ffff92adc0fc6400  R15: ffff92ad55422220
  •     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  • --- <NMI exception stack> ---
  •  #5 [ffffa3b3414c3d50] native_queued_spin_lock_slowpath at ffffffffbc111190
  •  #6 [ffffa3b3414c3d50] _raw_spin_lock_irq at ffffffffbc8cde43
  •  #7 [ffffa3b3414c3d58] bfq_dispatch_request at ffffffffc0661187 [bfq]
  •  #8 [ffffa3b3414c3db8] blk_mq_do_dispatch_sched at ffffffffbc40f455
  •  #9 [ffffa3b3414c3e10] __blk_mq_sched_dispatch_requests at ffffffffbc40ff89
  • #10 [ffffa3b3414c3e70] blk_mq_sched_dispatch_requests at ffffffffbc410010
  • #11 [ffffa3b3414c3e80] __blk_mq_run_hw_queue at ffffffffbc407691
  • #12 [ffffa3b3414c3e98] process_one_work at ffffffffbc0d3477
  • #13 [ffffa3b3414c3ed8] worker_thread at ffffffffbc0d3b40
  • #14 [ffffa3b3414c3f10] kthread at ffffffffbc0d9502
  • #15 [ffffa3b3414c3f50] ret_from_fork at ffffffffbca00255
  • PID: 2861   TASK: ffff92ad55522f80  CPU: 3   COMMAND: "fio"
  •  #0 [fffffe0000088e50] crash_nmi_callback at ffffffffbc04eee3
  •  #1 [fffffe0000088e58] nmi_handle at ffffffffbc023703
  •  #2 [fffffe0000088eb0] default_do_nmi at ffffffffbc023ade
  •  #3 [fffffe0000088ed0] do_nmi at ffffffffbc023cb8
  •  #4 [fffffe0000088ef0] end_repeat_nmi at ffffffffbca016d4
  •     [exception RIP: native_queued_spin_lock_slowpath+32]
  •     RIP: ffffffffbc111190  RSP: ffff92adf5cc3e38  RFLAGS: 00000002
  •     RAX: 0000000000000001  RBX: 0000000000000246  RCX: 0000000000008801
  •     RDX: 0000000000000001  RSI: 0000000000000001  RDI: ffff92adc0fc6400
  •     RBP: ffff92acd6900c00   R8: 0000000000000000   R9: 0000000000000000
  •     R10: 0000000000000068  R11: 0000000000000000  R12: ffff92acd0b02d10
  •     R13: ffff92adc0fc6400  R14: 000000a49490f1bc  R15: 0000000000008801
  •     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  • --- <NMI exception stack> ---
  •  #5 [ffff92adf5cc3e38] native_queued_spin_lock_slowpath at ffffffffbc111190
  •  #6 [ffff92adf5cc3e38] _raw_spin_lock_irqsave at ffffffffbc8cdc12
  •  #7 [ffff92adf5cc3e48] bfq_finish_requeue_request at ffffffffc0664555 [bfq]
  •  #8 [ffff92adf5cc3ea0] blk_mq_free_request at ffffffffbc4070ca
  •  #9 [ffff92adf5cc3ec8] scsi_end_request at ffffffffbc5d0e7a
  • #10 [ffff92adf5cc3f00] scsi_io_completion at ffffffffbc5d0fc8
  • #11 [ffff92adf5cc3f48] blk_done_softirq at ffffffffbc405ee1
  • #12 [ffff92adf5cc3f80] __softirqentry_text_start at ffffffffbcc000e4
  • #13 [ffff92adf5cc3fe0] irq_exit at ffffffffbc0bc1d7
  • #14 [ffff92adf5cc3ff0] call_function_single_interrupt at ffffffffbca01e0f
  • --- <IRQ stack> ---
  • #15 [ffffa3b34225f898] call_function_single_interrupt at ffffffffbca01e0f
  •     [exception RIP: bfq_bio_merge+5]
  •     RIP: ffffffffc065d955  RSP: ffffa3b34225f948  RFLAGS: 00000286
  •     RAX: ffffffffc065d950  RBX: ffff92acd20593f0  RCX: 0000000000000000
  •     RDX: ffff92acd967e400  RSI: ffff92ad6e641e00  RDI: ffff92acd7f2bc00
  •     RBP: ffffa3b34225fa30   R8: 0000000000000001   R9: 0000000000000001
  •     R10: 0000000000008000  R11: 0000000000000040  R12: 0000000000000000
  •     R13: ffff92ad6e641e00  R14: 0000000000000001  R15: ffff92ad6e592800
  •     ORIG_RAX: ffffffffffffff04  CS: 0010  SS: 0018
  • #16 [ffffa3b34225f948] blk_mq_make_request at ffffffffbc40abb0
  • #17 [ffffa3b34225f9d8] generic_make_request at ffffffffbc3fe85f
  • #18 [ffffa3b34225fa38] submit_bio at ffffffffbc3feadc
  • #19 [ffffa3b34225fa78] do_blockdev_direct_IO at ffffffffbc31ee96
  • #20 [ffffa3b34225fc78] ext4_direct_IO at ffffffffc0734ea6 [ext4]
  • #21 [ffffa3b34225fcf0] generic_file_read_iter at ffffffffbc22da7f
  • #22 [ffffa3b34225fd38] aio_read at ffffffffbc3313a5
  • #23 [ffffa3b34225fe40] io_submit_one at ffffffffbc33165b
  • #24 [ffffa3b34225feb8] __x64_sys_io_submit at ffffffffbc331b82
  • #25 [ffffa3b34225ff38] do_syscall_64 at ffffffffbc00419b
  • #26 [ffffa3b34225ff50] entry_SYSCALL_64_after_hwframe at ffffffffbca000ad

根据这些信息,发现pid是491和2857这两个进程嫌疑很大。先看下491这个进程的栈回溯:

crash> bt 491

PID: 491    TASK: ffff92acd7bf97c0  CPU: 2   COMMAND: "kworker/2:1H"

 #0 [fffffe000005de50] crash_nmi_callback at ffffffffbc04eee3

 #1 [fffffe000005de58] nmi_handle at ffffffffbc023703

 #2 [fffffe000005deb0] default_do_nmi at ffffffffbc023ade

 #3 [fffffe000005ded0] do_nmi at ffffffffbc023cb8

 #4 [fffffe000005def0] end_repeat_nmi at ffffffffbca016d4

    [exception RIP: native_queued_spin_lock_slowpath+32]

    RIP: ffffffffbc111190  RSP: ffffa3b3414c3d50  RFLAGS: 00000002

    RAX: 0000000000000001  RBX: ffff92acd7f2bc00  RCX: ffff92adc0fc6428

    RDX: 0000000000000001  RSI: 0000000000000001  RDI: ffff92adc0fc6400

    RBP: ffff92adc0fc6000   R8: 0000000000000000   R9: 0000000000000000

    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000

    R13: ffff92acd20593f0  R14: ffff92adc0fc6400  R15: ffff92ad55422220

    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018

--- <NMI exception stack> ---

 #5 [ffffa3b3414c3d50] native_queued_spin_lock_slowpath at ffffffffbc111190

 #6 [ffffa3b3414c3d50] _raw_spin_lock_irq at ffffffffbc8cde43

 #7 [ffffa3b3414c3d58] bfq_dispatch_request at ffffffffc0661187 [bfq]

 #8 [ffffa3b3414c3db8] blk_mq_do_dispatch_sched at ffffffffbc40f455

 #9 [ffffa3b3414c3e10] __blk_mq_sched_dispatch_requests at ffffffffbc40ff89

#10 [ffffa3b3414c3e70] blk_mq_sched_dispatch_requests at ffffffffbc410010

#11 [ffffa3b3414c3e80] __blk_mq_run_hw_queue at ffffffffbc407691

#12 [ffffa3b3414c3e98] process_one_work at ffffffffbc0d3477

#13 [ffffa3b3414c3ed8] worker_thread at ffffffffbc0d3b40

#14 [ffffa3b3414c3f10] kthread at ffffffffbc0d9502

#15 [ffffa3b3414c3f50] ret_from_fork at ffffffffbca00255

2857进程最需要分析看看,这个嫌疑更大

  • PID: 2857   TASK: ffff92ad555497c0  CPU: 1   COMMAND: "fio"
  •  #0 [fffffe0000032e50] crash_nmi_callback at ffffffffbc04eee3
  •  #1 [fffffe0000032e58] nmi_handle at ffffffffbc023703
  •  #2 [fffffe0000032eb0] default_do_nmi at ffffffffbc023ade
  •  #3 [fffffe0000032ed0] do_nmi at ffffffffbc023cb8
  •  #4 [fffffe0000032ef0] end_repeat_nmi at ffffffffbca016d4
  •     [exception RIP: native_queued_spin_lock_slowpath+32]
  •     RIP: ffffffffbc111190  RSP: ffff92adf5c43e38  RFLAGS: 00000002
  •     RAX: 0000000000000001  RBX: 0000000000000246  RCX: 0000000000008801
  •     RDX: 0000000000000001  RSI: 0000000000000001  RDI: ffff92adc0fc6400
  •     RBP: ffff92acd0a39800   R8: 0000000000000000   R9: 0000000000000000
  •     R10: 0000000000000068  R11: 0000000000000000  R12: ffff92ad55426910
  •     R13: ffff92adc0fc6400  R14: 000000a494918997  R15: 0000000000008801
  •     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  • --- <NMI exception stack> ---
  •  #5 [ffff92adf5c43e38] native_queued_spin_lock_slowpath at ffffffffbc111190
  •  #6 [ffff92adf5c43e38] _raw_spin_lock_irqsave at ffffffffbc8cdc12
  •  //这里获取spin_lock_irq(&bfqd->lock) bfqd->lock 锁失败,之后就大面积出现获取 bfqd->lock 锁失败而卡死
  •  #7 [ffff92adf5c43e48] bfq_finish_requeue_request at ffffffffc0664555 [bfq]
  •  #8 [ffff92adf5c43ea0] blk_mq_free_request at ffffffffbc4070ca
  •  #9 [ffff92adf5c43ec8] scsi_end_request at ffffffffbc5d0e7a
  • #10 [ffff92adf5c43f00] scsi_io_completion at ffffffffbc5d0fc8
  • #11 [ffff92adf5c43f48] blk_done_softirq at ffffffffbc405ee1
  • #12 [ffff92adf5c43f80] __softirqentry_text_start at ffffffffbcc000e4
  • #13 [ffff92adf5c43fe0] irq_exit at ffffffffbc0bc1d7
  • #14 [ffff92adf5c43ff0] call_function_single_interrupt at ffffffffbca01e0f
  • --- <IRQ stack> ---
  • #15 [ffffa3b34223f728] call_function_single_interrupt at ffffffffbca01e0f
  •     //这里竟然产生了中断
  •     [exception RIP: bfq_pos_tree_add_move+109]
  •     RIP: ffffffffc0667f0d  RSP: ffffa3b34223f7d0  RFLAGS: 00000246
  •     RAX: ffff92adc0fc6200  RBX: ffff92acd0b89000  RCX: ffff92acd6900c38
  •     RDX: 0000000000000000  RSI: ffff92acd967ba58  RDI: ffff92acd0b89038
  •     RBP: ffff92adc0fc6000   R8: 0000000000000000   R9: 0000000000000000
  •     R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000001
  •     R13: ffff92adc0fc6000  R14: ffff92acd20593f0  R15: ffff92acd0b89058
  •     ORIG_RAX: ffffffffffffff04  CS: 0010  SS: 0018
  • #16 [ffffa3b34223f7f8] bfq_remove_request.cold.58 at ffffffffc06680cc [bfq]
  • #17 [ffffa3b34223f878] bfq_finish_requeue_request at ffffffffc0664948 [bfq]
  • #18 [ffffa3b34223f8d0] blk_mq_free_request at ffffffffbc4070ca
  •                        //这里 spin_lock_irq(&bfqd->lock) bfqd->lock锁关中断
  • #19 [ffffa3b34223f8f8] bfq_bio_merge at ffffffffc065da34 [bfq]
  • #20 [ffffa3b34223f948] blk_mq_make_request at ffffffffbc40abb0
  • #21 [ffffa3b34223f9d8] generic_make_request at ffffffffbc3fe85f
  • #22 [ffffa3b34223fa38] submit_bio at ffffffffbc3feadc
  • #23 [ffffa3b34223fa78] do_blockdev_direct_IO at ffffffffbc31ee96
  • #24 [ffffa3b34223fc78] ext4_direct_IO at ffffffffc0734ea6 [ext4]
  • #25 [ffffa3b34223fcf0] generic_file_read_iter at ffffffffbc22da7f
  • #26 [ffffa3b34223fd38] aio_read at ffffffffbc3313a5
  • #27 [ffffa3b34223fe40] io_submit_one at ffffffffbc33165b
  • #28 [ffffa3b34223feb8] __x64_sys_io_submit at ffffffffbc331b82

2857号进程执行到bfq_bio_merge()发生的卡死,看下这个函数:

  1. static bool bfq_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
  2. {
  3.     //这里 spin_lock_irq(&bfqd->lock) bfqd->lock锁关中断
  4.     spin_lock_irq(&bfqd->lock);
  5.     ret = blk_mq_sched_try_merge(q, bio, &free);
  6.     if (free)
  7.         blk_mq_free_request(free);
  8.     spin_unlock_irq(&bfqd->lock);
  9. }

先spin_lock_irq(&bfqd->lock) 加bfqd->lock锁并关中断,然后继续执行到 bfq_bio_merge ->blk_mq_free_request->bfq_finish_requeue_request->bfq_remove_request->bfq_pos_tree_add_move函数时,却产生了中断,中断函数里执行bfq_finish_requeue_request->spin_lock_irqsave(&bfqd->lock, flags) 再次获取bfqd->lock锁失败而卡死。这些都发生在cpu1上的线程,先 spin_lock_irq(&bfqd->lock) 获取 bfqd->lock 锁,然后产生中断后又执行 spin_lock_irq(&bfqd->lock)获取bfqd->lock 锁失败。这样肯定会失败呀,因为两次获取同一把锁!

有一个很大的疑问,明明 bfq_bio_merge()里的spin_lock_irq(&bfqd->lock) 是关闭中断的!谁开了中断?导致后续执行bfq_bio_merge ->blk_mq_free_request->bfq_finish_requeue_request->bfq_remove_request->bfq_pos_tree_add_move时,产生了中断,中断函数里执行bfq_finish_requeue_request->spin_lock_irqsave(&bfqd->lock, flags) 再次获取bfqd->lock锁失败而卡死。

排查发现,竟然是 2857号进程先执行bfq_bio_merge->blk_mq_sched_try_merge->attempt_back_merge->attempt_merge->blk_account_io_merge函数意外的开启了中断,如下红色代码:

  1. static void blk_account_io_merge(struct request *req)                                                                                                                                     
  2. {
  3.     if (blk_do_io_stat(req)) {
  4.             struct hd_struct *part;
  5.             part_stat_lock();
  6.             part = req->part;
  7.             part_dec_in_flight(req->q, part, rq_data_dir(req));
  8.             hd_struct_put(part);
  9.             part_stat_unlock();
  10.             if(req->rq_disk && req->rq_disk->process_io.enable && req->p_process_rq_stat){
  11.                 spin_lock_irq(&(req->rq_disk->process_io.process_io_insert_lock));
  12.                 list_del(&req->p_process_rq_stat->process_io_insert);
  13.                 //这里开启了中断!!!!!!!!!!!!!!!!!!!!
  14.                 spin_unlock_irq(&(req->rq_disk->process_io.process_io_insert_lock));
  15.                 kmem_cache_free(req->rq_disk->process_io.process_rq_stat_cachep,req->p_process_rq_stat);
  16.                 atomic_dec(&(req->rq_disk->process_io.rq_in_queue));
  17.                 req->p_process_rq_stat = NULL;
  18.             }
  19.     }
  20. }

竟然是我在 blk_account_io_merge()执行了 spin_unlock_irq 开启了中断!服了,又是自己埋了一个坑然后自己跳进去!再次惊醒!这又是不熟悉上下文导致的!上层bfq_bio_merge()执行 spin_lock_irq(&bfqd->lock) 后,底层调用我的 blk_account_io_merge函数里的spin_unlock_irq,莫名其妙开启了本地中断!导致后续一系列不符合逻辑的获取bfqd->lock 锁失败的奇葩问题。

这是一个很大的教训, spin_lock_irq/ spin_unlock_irq 用起来有风险呀!最保险的是使用 spin_lock_irqsave/spin_unlock_irqrestore。因为spin_unlock_irqrestore 会恢复 spin_lock_irqsave 执行时的cpu中断状态,就不会有这个问题了。而spin_unlock_irq不管关中断前,中断是否关闭或开启,都强制开启中断!这样就会导致中断状态错乱,发生未知问题!比如,本应该是关闭中断的,但spin_unlock_irq却开启了中断,就会出现卡死。

相关文章:

一次奇葩的spin_lock_irq / spin_unlock_irq使用不当导致的系统卡死分析

这是在调试内核block层时遇到的一例奇葩的soft lock锁死问题(内核版本centos 8.3&#xff0c;4.18.0-240)&#xff0c;现场如下&#xff1a; [ 760.247152] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:1:2635]……………..[ 760.247184] CPU: 0 PID: 26…...

公司创建百度百科需要哪些内容?

一个公司或是一个品牌想要让自己更有身份&#xff0c;更有知名度&#xff0c;更有含金量&#xff0c;百度百科词条是必不可少的。通过百度百科展示公司的详细信息&#xff0c;有助于增强用户对公司的信任感&#xff0c;提高企业形象。通过百度百科展示公司的发展历程、领导团队…...

qt中信号槽第五个参数

文章目录 connent函数第五个参数的作用自动连接(Qt::AutoConnection)直接连接(Qt::DirectConnection - 同步)同线程不同线程 队列连接(Qt::QueuedConnection - 异步)同一线程不同线程 锁定队列连接(Qt::BlockingQueuedConnection) connent函数第五个参数的作用 connect(const …...

模式识别与机器学习-SVM(线性支持向量机)

线性支持向量机 线性支持向量机间隔距离学习的对偶算法算法:线性可分支持向量机学习算法线性可分支持向量机例子 谨以此博客作为复习期间的记录 线性支持向量机 在以上四条线中&#xff0c;都可以作为分割平面&#xff0c;误差率也都为0。但是那个分割平面效果更好呢&#xff1…...

【并行计算】GPU,CUDA

一、CUDA层次结构 1.kernel核函数 一个CUDA程序是一个kernel核函数被GPU的多个计算单元并行执行的过程&#xff0c;CUDA给了如下抽象 dim3 threadsPerBlock(4, 3, 1); dim3 numBlocks(3, 2, 1); matrixAdd<<<numBlocks, threadsPerBlock>>>(A, B, C); 2.G…...

计算机网络教案——计算机网络设备章节

第五章 计算机网络设备 一、教学目标: 1. 了解计算机网络的主要设备 2. 了解计算机网络设备的主要原理 3. 掌握计算机网络设备的基本用途 4. 掌握计算机网络设备的使用常识 二、教学重点、难点 计算机网络设备的主要原理 三、技能培训重点、难点 计算机网络设备的使用…...

什么是SLAM中的回环检测,如果没有回环检测会怎样

目录 什么是回环检测 如果没有回环检测 SLAM&#xff08;Simultaneous Localization and Mapping&#xff0c;即同时定位与地图构建&#xff09;是一种使机器人或自动驾驶汽车能够在未知环境中建立地图的同时定位自身位置的技术。回环检测&#xff08;Loop Closure Detectio…...

ubuntu 通过文件设置静态IP、DNS、网关

1. 确定网络接口名称 首先&#xff0c;使用 ip a 命令确定您要配置的网络接口名称。 2. 编辑 Netplan 配置文件 使用文本编辑器&#xff08;如 nano&#xff09;打开或创建 Netplan 配置文件&#xff1a; sudo nano /etc/netplan/01-netcfg.yaml3. 输入 Netplan 配置 在编…...

mapboxgl 中热力图的实现以及给热力图点增加鼠标移上 popup 效果

文章目录 概要效果预览技术思路技术细节小结 概要 本篇文章还是关于最近做到的 mapboxgl 地图展开的。 借鉴官方示例&#xff1a;https://iclient.supermap.io/examples/mapboxgl/editor.html#heatMapLayer 效果预览 技术思路 将接口数据渲染到地图中形成热力图。还需要将热…...

golang并发安全-sync.map

sync.map解决的问题 golang 原生map是存在并发读写的问题&#xff0c;在并发读写时候会抛出异常 func main() {mT : make(map[int]int)g1 : []int{1, 2, 3, 4, 5, 6}g2 : []int{4, 5, 6, 7, 8, 9}go func() {for i : range g1 {mT[i] i}}()go func() {for i : range g2 {mT[…...

开发第一个SpringBoot程序

使用命令创建Maven工程 mvn archetype:generate -DgroupIdorg.sang -DartifactIdchapter01 -DarchetypeArtifactIdmaven-archetype-quickstart -DinteractiveModefalse 参数说明&#xff1a; -DgroupId 组织Id&#xff08;项目包名&#xff09; -DartifactId 项目名称或模块…...

2023年度总结—你是你的年度MVP吗?

这段年度总结其实我之前就想写了&#xff0c;大概就是市赛比完之后18号的样子把&#xff0c;但是因为太懒了就一直拖到了现在哈哈&#xff0c;我思来想去&#xff0c;翻来覆去&#xff0c;彻夜难眠&#xff0c;想了想&#xff0c;还是决定把它写了吧&#xff01;毕竟&#xff0…...

Linux基础知识学习3

vim编辑器 其分为四种模式 1.普通(命令)模式 2.编辑模式 3.底栏模式 4.可视化模式 vim编辑器被称为编辑器之神&#xff0c;而Emacs更是神之编辑器 普通模式&#xff1a; 1.光标移动 ^ 移动到行首 w 跳到下一个单词的开头…...

Leetcode5-在长度2N的数组中找出重复N次的元素(961)

1、题目 给你一个整数数组 nums &#xff0c;该数组具有以下属性&#xff1a; nums.length 2 * n. nums 包含 n 1 个 不同的 元素 nums 中恰有一个元素重复 n 次 找出并返回重复了 n 次的那个元素。 示例 1&#xff1a; 输入&#xff1a;nums [1,2,3,3] 输出&#xff1a…...

openssl的 openssl.cnf配置文件详解

背景&#xff1a;在上一篇文中&#xff0c;提到要写一篇openssl 配置文件详解的&#xff0c;这就来了~~~ find / -name openssl.cnf /etc/pki/tls/openssl.cnf /etc/pki/tls/openssl.cnf&#xff0c;该文件主要设置了证书请求、签名、crl相关的配置。主要相关的伪命令为ca和req…...

SpringBoot集成支付宝,看这一篇就够了。

前 言 在开始集成支付宝支付之前&#xff0c;我们需要准备一个支付宝商家账户&#xff0c;如果是个人开发者&#xff0c;可以通过注册公司或者让有公司资质的单位进行授权&#xff0c;后续在集成相关API的时候需要提供这些信息。 下面我以电脑网页端在线支付为例&#xff0c;介…...

数据结构程序设计——哈希表的应用(2)->哈希表解决冲突的方法

目录 实验须知 代码实现 实验报告 一&#xff1a;问题分析 二、数据结构 1.逻辑结构 2.物理结构 三、算法 &#xff08;一&#xff09;主要算法描述 1.用除留余数法构造哈希函数 2.线性探测再散列法 &#xff08;一&#xff09;主要算法实现代码 四、上机调试 实…...

微信小程序开发系列-07组件

微信小程序开发系列目录 《微信小程序开发系列-01创建一个最小的小程序项目》《微信小程序开发系列-02注册小程序》《微信小程序开发系列-03全局配置中的“window”和“tabBar”》《微信小程序开发系列-04获取用户图像和昵称》《微信小程序开发系列-05登录小程序》《微信小程序…...

JavaScript 中 Set 和 Map 的区别

JavaScript 中的 Set 和 Map 都是用来存储数据的数据结构&#xff0c;它们之间的区别如下&#xff1a; Set 是一组唯一值的集合&#xff0c;而 Map 是一组键值对的集合。Set 中的值是唯一的&#xff0c;不允许重复&#xff1b;Map 中的键是唯一的&#xff0c;值可以重复。Set …...

web前端之JavaScript

MENU JavaScript之设计模式、单例、代理、装饰者、中介者、观察者、发布订阅、策略JavaScript之数组静态方法的实现、reduce、forEach、map、push、every JavaScript之设计模式、单例、代理、装饰者、中介者、观察者、发布订阅、策略 单例模式 概念 保证一个类仅有一个实例&am…...

简易版抽奖活动的设计技术方案

1.前言 本技术方案旨在设计一套完整且可靠的抽奖活动逻辑,确保抽奖活动能够公平、公正、公开地进行,同时满足高并发访问、数据安全存储与高效处理等需求,为用户提供流畅的抽奖体验,助力业务顺利开展。本方案将涵盖抽奖活动的整体架构设计、核心流程逻辑、关键功能实现以及…...

黑马Mybatis

Mybatis 表现层&#xff1a;页面展示 业务层&#xff1a;逻辑处理 持久层&#xff1a;持久数据化保存 在这里插入图片描述 Mybatis快速入门 ![在这里插入图片描述](https://i-blog.csdnimg.cn/direct/6501c2109c4442118ceb6014725e48e4.png //logback.xml <?xml ver…...

基于服务器使用 apt 安装、配置 Nginx

&#x1f9fe; 一、查看可安装的 Nginx 版本 首先&#xff0c;你可以运行以下命令查看可用版本&#xff1a; apt-cache madison nginx-core输出示例&#xff1a; nginx-core | 1.18.0-6ubuntu14.6 | http://archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages ng…...

Python实现prophet 理论及参数优化

文章目录 Prophet理论及模型参数介绍Python代码完整实现prophet 添加外部数据进行模型优化 之前初步学习prophet的时候&#xff0c;写过一篇简单实现&#xff0c;后期随着对该模型的深入研究&#xff0c;本次记录涉及到prophet 的公式以及参数调优&#xff0c;从公式可以更直观…...

06 Deep learning神经网络编程基础 激活函数 --吴恩达

深度学习激活函数详解 一、核心作用 引入非线性:使神经网络可学习复杂模式控制输出范围:如Sigmoid将输出限制在(0,1)梯度传递:影响反向传播的稳定性二、常见类型及数学表达 Sigmoid σ ( x ) = 1 1 +...

Maven 概述、安装、配置、仓库、私服详解

目录 1、Maven 概述 1.1 Maven 的定义 1.2 Maven 解决的问题 1.3 Maven 的核心特性与优势 2、Maven 安装 2.1 下载 Maven 2.2 安装配置 Maven 2.3 测试安装 2.4 修改 Maven 本地仓库的默认路径 3、Maven 配置 3.1 配置本地仓库 3.2 配置 JDK 3.3 IDEA 配置本地 Ma…...

Spring是如何解决Bean的循环依赖:三级缓存机制

1、什么是 Bean 的循环依赖 在 Spring框架中,Bean 的循环依赖是指多个 Bean 之间‌互相持有对方引用‌,形成闭环依赖关系的现象。 多个 Bean 的依赖关系构成环形链路,例如: 双向依赖:Bean A 依赖 Bean B,同时 Bean B 也依赖 Bean A(A↔B)。链条循环: Bean A → Bean…...

NPOI Excel用OLE对象的形式插入文件附件以及插入图片

static void Main(string[] args) {XlsWithObjData();Console.WriteLine("输出完成"); }static void XlsWithObjData() {// 创建工作簿和单元格,只有HSSFWorkbook,XSSFWorkbook不可以HSSFWorkbook workbook new HSSFWorkbook();HSSFSheet sheet (HSSFSheet)workboo…...

上位机开发过程中的设计模式体会(1):工厂方法模式、单例模式和生成器模式

简介 在我的 QT/C 开发工作中&#xff0c;合理运用设计模式极大地提高了代码的可维护性和可扩展性。本文将分享我在实际项目中应用的三种创造型模式&#xff1a;工厂方法模式、单例模式和生成器模式。 1. 工厂模式 (Factory Pattern) 应用场景 在我的 QT 项目中曾经有一个需…...

TCP/IP 网络编程 | 服务端 客户端的封装

设计模式 文章目录 设计模式一、socket.h 接口&#xff08;interface&#xff09;二、socket.cpp 实现&#xff08;implementation&#xff09;三、server.cpp 使用封装&#xff08;main 函数&#xff09;四、client.cpp 使用封装&#xff08;main 函数&#xff09;五、退出方法…...