View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0008451 | Rocky-Linux-9 | kernel | public | 2024-12-03 11:18 | 2024-12-19 07:32 |
Reporter | Steve Rast | Assigned To | |||
Priority | normal | Severity | crash | Reproducibility | random |
Status | new | Resolution | open | ||
Platform | x86_64 | OS | Rocky Linux | OS Version | 9.5 |
Summary | 0008451: nfs-server hard reboot server | ||||
Description | Since upgrading from RockyLinux 9.4 to RockyLinux 9.5 several NFS servers are randomly hard rebooting. I can see this: [251118.198708] perf: interrupt took too long (2524 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 [256533.825978] perf: interrupt took too long (3166 > 3155), lowering kernel.perf_event_max_sample_rate to 63000 [279965.977293] perf: interrupt took too long (3965 > 3957), lowering kernel.perf_event_max_sample_rate to 50000 [326621.176722] ------------[ cut here ]------------ [326621.176728] WARNING: CPU: 18 PID: 3270 at mm/slab_common.c:957 free_large_kmalloc+0x5a/0x80 [326621.176739] Modules linked in: tls binfmt_misc dm_service_time iscsi_tcp libiscsi_tcp libiscsi rpcrdma rdma_cm iw_cm ib_cm ib_core scsi_transport_iscsi nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink vfat fat dm_multipath intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif dm_mod kvm dell_wmi_descriptor sparse_keymap rfkill video iTCO_wdt rapl intel_cstate mxm_wmi mei_me dcdbas mei intel_uncore iTCO_vendor_support ipmi_si joydev acpi_power_meter ipmi_devintf ipmi_msghandler pcspkr lpc_ich nfsd nfs_acl lockd auth_rpcgss grace sunrpc xfs libcrc32c sr_mod sd_mod cdrom t10_pi sg mgag200 uas usb_storage drm_kms_helper ahci libahci drm_shmem_helper crct10dif_pclmul crc32_pclmul drm ixgbe crc32c_intel libata [326621.176795] igb ghash_clmulni_intel megaraid_sas i2c_algo_bit mdio dca wmi fuse [326621.176801] CPU: 18 PID: 3270 Comm: nfsd Kdump: loaded Not tainted 5.14.0-503.14.1.el9_5.x86_64 #1 [326621.176804] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.19.0 12/12/2023 [326621.176806] RIP: 0010:free_large_kmalloc+0x5a/0x80 [326621.176811] Code: da 9c 5b fa be 06 00 00 00 48 89 ef e8 af 25 0a 00 80 e7 02 74 01 fb 48 83 c4 08 44 89 e6 48 89 ef 5b 5d 41 5c e9 d6 28 04 00 <0f> 0b 45 31 e4 80 3d 43 0e fc 01 00 ba 00 f0 ff ff 0f 84 fb 9a 90 [326621.176813] RSP: 0018:ffffb6f9092bb968 EFLAGS: 00010246 [326621.176815] RAX: 0017ffffc0001000 RBX: ffffffff8c31e2e0 RCX: ffff94a6c40f9220 [326621.176816] RDX: fffff42b8cb69608 RSI: ffffffff8b058378 RDI: fffff42b8cb69600 [326621.176818] RBP: fffff42b8cb69600 R08: ffffffff8ca06440 R09: ffff94a9afc744b0 [326621.176819] R10: 00000000000003c8 R11: 0000000000000000 R12: ffffffff8b058378 [326621.176820] R13: 0000000000000000 R14: ffff94a68942ae00 R15: ffff94a9e567c000 [326621.176822] FS: 0000000000000000(0000) GS:ffff94a9afc40000(0000) knlGS:0000000000000000 [326621.176824] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [326621.176825] CR2: 00005626630c5140 CR3: 000000032f410001 CR4: 00000000003706f0 [326621.176827] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [326621.176828] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [326621.176829] Call Trace: [326621.176831] <TASK> [326621.176833] ? show_trace_log_lvl+0x1c4/0x2df [326621.176839] ? show_trace_log_lvl+0x1c4/0x2df [326621.176842] ? security_release_secctx+0x28/0x40 [326621.176846] ? free_large_kmalloc+0x5a/0x80 [326621.176849] ? __warn+0x7e/0xd0 [326621.176852] ? free_large_kmalloc+0x5a/0x80 [326621.176855] ? report_bug+0x100/0x140 [326621.176859] ? handle_bug+0x3c/0x70 [326621.176862] ? exc_invalid_op+0x14/0x70 [326621.176864] ? asm_exc_invalid_op+0x16/0x20 [326621.176868] ? lookup_dcache+0x18/0x60 [326621.176872] ? lookup_dcache+0x18/0x60 [326621.176875] ? free_large_kmalloc+0x5a/0x80 [326621.176878] ? lookup_dcache+0x18/0x60 [326621.176880] security_release_secctx+0x28/0x40 [326621.176883] nfsd4_encode_fattr4+0x2cc/0x4f0 [nfsd] [326621.176955] ? avc_has_perm_noaudit+0x94/0x110 [326621.176959] ? selinux_inode_permission+0x10e/0x1d0 [326621.176964] ? __d_lookup+0x73/0xb0 [326621.176967] ? d_lookup+0x29/0x50 [326621.176969] ? lookup_dcache+0x18/0x60 [326621.176972] nfsd4_encode_entry4_fattr+0xcd/0x1e0 [nfsd] [326621.177019] nfsd4_encode_entry4+0x100/0x290 [nfsd] [326621.177072] nfsd_buffered_readdir+0x144/0x250 [nfsd] [326621.177114] ? __pfx_nfsd4_encode_entry4+0x10/0x10 [nfsd] [326621.177170] ? __pfx_nfsd_buffered_filldir+0x10/0x10 [nfsd] [326621.177211] ? __pfx_nfsd4_encode_entry4+0x10/0x10 [nfsd] [326621.177255] nfsd_readdir+0xa9/0xe0 [nfsd] [326621.177296] nfsd4_encode_readdir+0xf8/0x1d0 [nfsd] [326621.177341] nfsd4_encode_operation+0xa6/0x2b0 [nfsd] [326621.177386] nfsd4_proc_compound+0x1d0/0x700 [nfsd] [326621.177446] nfsd_dispatch+0xe9/0x220 [nfsd] [326621.177487] svc_process_common+0x2e7/0x650 [sunrpc] [326621.177583] ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd] [326621.177623] svc_process+0x12d/0x170 [sunrpc] [326621.177691] svc_handle_xprt+0x448/0x580 [sunrpc] [326621.177750] svc_recv+0x17a/0x2c0 [sunrpc] [326621.177819] ? __pfx_nfsd+0x10/0x10 [nfsd] [326621.177858] nfsd+0x84/0xb0 [nfsd] [326621.177896] kthread+0xe0/0x100 [326621.177900] ? __pfx_kthread+0x10/0x10 [326621.177904] ret_from_fork+0x2c/0x50 [326621.177919] </TASK> [326621.177920] ---[ end trace 0000000000000000 ]--- [326621.177922] object pointer: 0x00000000e53caba2 [326621.179321] BUG: unable to handle page fault for address: ffff94a86da58000 [326621.179324] #PF: supervisor write access in kernel mode [326621.179327] #PF: error_code(0x0003) - permissions violation [326621.179330] PGD 330801067 P4D 330801067 PUD 100207063 PMD 800000032da000a1 [326621.179337] Oops: 0003 [#1] PREEMPT SMP PTI [326621.179341] CPU: 18 PID: 3270 Comm: nfsd Kdump: loaded Tainted: G W ------- --- 5.14.0-503.14.1.el9_5.x86_64 #1 [326621.179345] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.19.0 12/12/2023 [326621.179347] RIP: 0010:svc_process_common+0xe7/0x650 [sunrpc] [326621.179466] Code: 00 00 48 c7 87 80 02 00 00 00 00 00 00 48 29 d0 48 c1 f8 03 c1 e0 0c 89 87 cc 02 00 00 4c 89 e7 e8 ce a9 00 00 48 85 c0 74 02 <89> 18 be 04 00 00 00 4c 89 e7 e8 ba a9 00 00 48 85 c0 74 06 c7 00 [326621.179468] RSP: 0018:ffffb6f9092bbe28 EFLAGS: 00010286 [326621.179470] RAX: ffff94a86da58000 RBX: 000000000bc19c07 RCX: ffff94a86da58000 [326621.179471] RDX: ffff94a9e567c2e8 RSI: 0000000000000004 RDI: ffff94a9e567c238 [326621.179472] RBP: ffff94a9e567c000 R08: ffff94a9e567c1a0 R09: 0000000000000000 [326621.179473] R10: 0000000000000006 R11: 0000000000001000 R12: ffff94a9e567c238 [326621.179474] R13: ffff94a9c23f8f00 R14: ffff94a9c23f8784 R15: ffff94a9e567c000 [326621.179475] FS: 0000000000000000(0000) GS:ffff94a9afc40000(0000) knlGS:0000000000000000 [326621.179477] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [326621.179478] CR2: ffff94a86da58000 CR3: 000000032f410001 CR4: 00000000003706f0 [326621.179479] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [326621.179480] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [326621.179481] Call Trace: [326621.179483] <TASK> [326621.179484] ? show_trace_log_lvl+0x1c4/0x2df [326621.179488] ? show_trace_log_lvl+0x1c4/0x2df [326621.179492] ? svc_process+0x12d/0x170 [sunrpc] [326621.179547] ? __die_body.cold+0x8/0xd [326621.179551] ? page_fault_oops+0x134/0x170 [326621.179554] ? kernelmode_fixup_or_oops+0x84/0x110 [326621.179557] ? exc_page_fault+0xa8/0x150 [326621.179561] ? asm_exc_page_fault+0x22/0x30 [326621.179565] ? svc_process_common+0xe7/0x650 [sunrpc] [326621.179621] ? svc_process_common+0xe2/0x650 [sunrpc] [326621.179678] svc_process+0x12d/0x170 [sunrpc] [326621.179736] svc_handle_xprt+0x448/0x580 [sunrpc] [326621.179796] svc_recv+0x17a/0x2c0 [sunrpc] [326621.179856] ? __pfx_nfsd+0x10/0x10 [nfsd] [326621.179896] nfsd+0x84/0xb0 [nfsd] [326621.179936] kthread+0xe0/0x100 [326621.179940] ? __pfx_kthread+0x10/0x10 [326621.179943] ret_from_fork+0x2c/0x50 [326621.179947] </TASK> [326621.179948] Modules linked in: tls binfmt_misc dm_service_time iscsi_tcp libiscsi_tcp libiscsi rpcrdma rdma_cm iw_cm ib_cm ib_core scsi_transport_iscsi nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink vfat fat dm_multipath intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif dm_mod kvm dell_wmi_descriptor sparse_keymap rfkill video iTCO_wdt rapl intel_cstate mxm_wmi mei_me dcdbas mei intel_uncore iTCO_vendor_support ipmi_si joydev acpi_power_meter ipmi_devintf ipmi_msghandler pcspkr lpc_ich nfsd nfs_acl lockd auth_rpcgss grace sunrpc xfs libcrc32c sr_mod sd_mod cdrom t10_pi sg mgag200 uas usb_storage drm_kms_helper ahci libahci drm_shmem_helper crct10dif_pclmul crc32_pclmul drm ixgbe crc32c_intel libata [326621.179989] igb ghash_clmulni_intel megaraid_sas i2c_algo_bit mdio dca wmi fuse [326621.179994] CR2: ffff94a86da58000 Also saw this sometimes: kernel:watchdog: BUG: soft lockup - CPU#2 stuck for 160s! [nfsd:5657] It is also with the latest kernel: 5.14.0-503.15.1.el9_5.x86_64 I use NFSv4 with SSSD integrated to ActiveDirectory. The Clients are mounting via AutoFS. Everything was working fine until the Upgrade to RockyLinux 9.5. Since then some servers are rebooting several times per day. It doesn't matter if it is UEFI or BIOS Boot. Or if it its a VM or physical server. I have the impression as more NFS traffic is generate as faster the servers are crashing. | ||||
Tags | No tags attached. | ||||
We also have this issue, or one very much like it. We have a busy fileserver that upgraded from Rocky 9.4 to 9.5 this morning at 0540. It returned, but at 0610 stopped responding. The console was unresponsive so we hard rebooted via vm controls. Within 30 minutes of that, had stopped again. We hard booted again, and this time chose the previous kernel. Since then (5h+) the vm has been 100% stable as it has been before 9.5 It serves files through both NFSv4 and SMB, using SSSD and AD for authentication. I also get the impression that it's related to NFS traffic. Our logs follow Dec 3 05:53:41 redacted_hostname: [ 769.252982] ------------[ cut here ]------------ Dec 3 05:53:41 redacted_hostname: [ 769.252987] WARNING: CPU: 0 PID: 1115 at mm/slab_common.c:957 free_large_kmalloc+0x5a/0x80 Dec 3 05:53:41 redacted_hostname: [ 769.253003] Modules linked in: binfmt_misc rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache netfs rpcrdma rdma_cm iw_cm ib_cm ib_core rfkill nft_reject_ipv4 nf_reject_ipv4 nft_reject nft_counter nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 vsock_loopback vmw_vsock_virtio_transport_common nf_tables vmw_vsock_vmci_transport vsock nfnetlink vmwgfx intel_rapl_msr vmw_balloon intel_rapl_common drm_ttm_helper ttm vmw_vmci pcspkr drm_kms_helper i2c_piix4 joydev nfsd nfs_acl lockd auth_rpcgss grace sunrpc drm xfs libcrc32c crct10dif_pclmul sd_mod crc32_pclmul ata_generic t10_pi crc32c_intel sg ghash_clmulni_intel ata_piix libata vmxnet3 vmw_pvscsi serio_raw dm_mirror dm_region_hash dm_log dm_mod fuse Dec 3 05:53:41 redacted_hostname: [ 769.253054] CPU: 0 PID: 1115 Comm: nfsd Not tainted 5.14.0-503.15.1.el9_5.x86_64 #1 Dec 3 05:53:41 redacted_hostname: [ 769.253056] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020 Dec 3 05:53:41 redacted_hostname: [ 769.253057] RIP: 0010:free_large_kmalloc+0x5a/0x80 Dec 3 05:53:41 redacted_hostname: [ 769.253060] Code: da 9c 5b fa be 06 00 00 00 48 89 ef e8 af 25 0a 00 80 e7 02 74 01 fb 48 83 c4 08 44 89 e6 48 89 ef 5b 5d 41 5c e9 d6 28 04 00 <0f> 0b 45 31 e4 80 3d d3 0d fc 01 00 ba 00 f0 ff ff 0f 84 8b 9a 90 Dec 3 05:53:41 redacted_hostname: [ 769.253062] RSP: 0018:ffffafef00d57b28 EFLAGS: 00010246 Dec 3 05:53:41 redacted_hostname: [ 769.253064] RAX: 0017ffffc0000000 RBX: ffffffff8cd1e2e0 RCX: ffff9fce52f7ac68 Dec 3 05:53:41 redacted_hostname: [ 769.253065] RDX: dead000000000100 RSI: ffffffffc096b47c RDI: ffffd9fd06c25ac0 Dec 3 05:53:41 redacted_hostname: [ 769.253065] RBP: ffffd9fd06c25ac0 R08: ffffffff8d4075e0 R09: ffff9fcf35e344b0 Dec 3 05:53:41 redacted_hostname: [ 769.253066] R10: 000000b30f218140 R11: 00000000004c82ea R12: ffffffffc096b47c Dec 3 05:53:41 redacted_hostname: [ 769.253067] R13: 0000000000000000 R14: ffff9fce0a0c4a00 R15: ffff9fce01148000 Dec 3 05:53:41 redacted_hostname: [ 769.253068] FS: 0000000000000000(0000) GS:ffff9fcf35e00000(0000) knlGS:0000000000000000 Dec 3 05:53:41 redacted_hostname: [ 769.253070] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 3 05:53:41 redacted_hostname: [ 769.253071] CR2: 00007efd2f51a654 CR3: 000000010902a002 CR4: 00000000007706f0 Dec 3 05:53:41 redacted_hostname: [ 769.253085] PKRU: 55555554 Dec 3 05:53:41 redacted_hostname: [ 769.253086] Call Trace: Dec 3 05:53:41 redacted_hostname: [ 769.253087] <TASK> Dec 3 05:53:41 redacted_hostname: [ 769.253088] ? srso_alias_return_thunk+0x5/0xfbef5 Dec 3 05:53:41 redacted_hostname: [ 769.253094] ? show_trace_log_lvl+0x26e/0x2df Dec 3 05:53:41 redacted_hostname: [ 769.253101] ? show_trace_log_lvl+0x26e/0x2df Dec 3 05:53:41 redacted_hostname: [ 769.253106] ? security_release_secctx+0x28/0x40 Dec 3 05:53:41 redacted_hostname: [ 769.253110] ? free_large_kmalloc+0x5a/0x80 Dec 3 05:53:41 redacted_hostname: [ 769.253113] ? __warn+0x7e/0xd0 Dec 3 05:53:41 redacted_hostname: [ 769.253116] ? free_large_kmalloc+0x5a/0x80 Dec 3 05:53:41 redacted_hostname: [ 769.253119] ? report_bug+0x100/0x140 Dec 3 05:53:41 redacted_hostname: [ 769.253124] ? handle_bug+0x3c/0x70 Dec 3 05:53:41 redacted_hostname: [ 769.253127] ? exc_invalid_op+0x14/0x70 Dec 3 05:53:41 redacted_hostname: [ 769.253130] ? asm_exc_invalid_op+0x16/0x20 Dec 3 05:53:41 redacted_hostname: [ 769.253134] ? _fh_update.part.0.isra.0+0x4c/0x90 [nfsd] Dec 3 05:53:41 redacted_hostname: [ 769.253162] ? _fh_update.part.0.isra.0+0x4c/0x90 [nfsd] Dec 3 05:53:41 redacted_hostname: [ 769.253184] ? free_large_kmalloc+0x5a/0x80 Dec 3 05:53:41 redacted_hostname: [ 769.253188] ? _fh_update.part.0.isra.0+0x4c/0x90 [nfsd] Dec 3 05:53:41 redacted_hostname: [ 769.253206] security_release_secctx+0x28/0x40 Dec 3 05:53:41 redacted_hostname: [ 769.253209] nfsd4_encode_fattr4+0x2cc/0x4f0 [nfsd] Dec 3 05:53:41 redacted_hostname: [ 769.253237] ? srso_alias_return_thunk+0x5/0xfbef5 Dec 3 05:53:41 redacted_hostname: [ 769.253239] ? __kmem_cache_alloc_node+0x18f/0x2e0 Dec 3 05:53:41 redacted_hostname: [ 769.253242] ? security_prepare_creds+0x71/0xa0 Dec 3 05:53:41 redacted_hostname: [ 769.253245] ? security_prepare_creds+0x71/0xa0 Dec 3 05:53:41 redacted_hostname: [ 769.253242] ? security_prepare_creds+0x71/0xa0 Dec 3 05:53:41 redacted_hostname: [ 769.253245] ? security_prepare_creds+0x71/0xa0 Dec 3 05:53:41 redacted_hostname: [ 769.253246] ? srso_alias_return_thunk+0x5/0xfbef5 Dec 3 05:53:41 redacted_hostname: [ 769.253248] ? __kmalloc+0x4b/0x140 Dec 3 05:53:41 redacted_hostname: [ 769.253250] ? srso_alias_return_thunk+0x5/0xfbef5 Dec 3 05:53:41 redacted_hostname: [ 769.253251] ? srso_alias_return_thunk+0x5/0xfbef5 Dec 3 05:53:41 redacted_hostname: [ 769.253253] ? security_prepare_creds+0x47/0xa0 Dec 3 05:53:41 redacted_hostname: [ 769.253255] ? srso_alias_return_thunk+0x5/0xfbef5 Dec 3 05:53:41 redacted_hostname: [ 769.253256] ? prepare_creds+0x180/0x270 Dec 3 05:53:41 redacted_hostname: [ 769.253259] ? srso_alias_return_thunk+0x5/0xfbef5 Dec 3 05:53:41 redacted_hostname: [ 769.253261] ? nfsd_setuser+0x110/0x270 [nfsd] Dec 3 05:53:41 redacted_hostname: [ 769.253286] ? srso_alias_return_thunk+0x5/0xfbef5 Dec 3 05:53:41 redacted_hostname: [ 769.253288] ? nfsd_setuser_and_check_port+0x4a/0xc0 [nfsd] Dec 3 05:53:41 redacted_hostname: [ 769.253306] ? _fh_update.part.0.isra.0+0x4c/0x90 [nfsd] Dec 3 05:53:41 redacted_hostname: [ 769.253323] nfsd4_encode_getattr+0x2b/0x40 [nfsd] Dec 3 05:53:41 redacted_hostname: [ 769.253341] nfsd4_encode_operation+0xa6/0x2b0 [nfsd] Dec 3 05:53:41 redacted_hostname: [ 769.253361] nfsd4_proc_compound+0x1d0/0x700 [nfsd] Dec 3 05:53:41 redacted_hostname: [ 769.253384] nfsd_dispatch+0xe9/0x220 [nfsd] Dec 3 05:53:41 redacted_hostname: [ 769.253404] svc_process_common+0x2e7/0x650 [sunrpc] Dec 3 05:53:41 redacted_hostname: [ 769.253435] ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd] Dec 3 05:53:41 redacted_hostname: [ 769.253463] svc_process+0x12d/0x170 [sunrpc] Dec 3 05:53:41 redacted_hostname: [ 769.253491] svc_handle_xprt+0x448/0x580 [sunrpc] Dec 3 05:53:41 redacted_hostname: [ 769.253523] svc_recv+0x17a/0x2c0 [sunrpc] Dec 3 05:53:41 redacted_hostname: [ 769.253552] ? __pfx_nfsd+0x10/0x10 [nfsd] Dec 3 05:53:41 redacted_hostname: [ 769.253577] nfsd+0x84/0xb0 [nfsd] Dec 3 05:53:41 redacted_hostname: [ 769.253600] kthread+0xe0/0x100 Dec 3 05:53:41 redacted_hostname: [ 769.253604] ? __pfx_kthread+0x10/0x10 Dec 3 05:53:41 redacted_hostname: [ 769.253608] ret_from_fork+0x2c/0x50 Dec 3 05:53:41 redacted_hostname: [ 769.253614] </TASK> Dec 3 05:53:41 redacted_hostname: [ 769.253614] ---[ end trace 0000000000000000 ]--- Dec 3 05:53:41 redacted_hostname: [ 769.253616] object pointer: 0x000000008b742c83 |
|
https://nvd.nist.gov/vuln/detail/CVE-2024-46697 feels related |
|
Would either of you have access or ability to generate a kdump? I'd like to poke around and see if my assumptions here are correct.. but I believe this crash is being triggered by the lack of the fix for CVE-2024-46697 (https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=f58bab6fd4063913bd8321e99874b8239e9ba726) in the 9.5 kernel. i.e., the specified condition occurs and args.context is set with random junk which, when the kernel attempts to free, sometimes causes a crash -- the effect thereof would be pronounced further on busy servers, probably. |
|
In /var/crash i have a lot of kexec-dmesg.log, vmcore, vmcore-dmesg.txt Is that what you would need? |
|
https://issues.redhat.com/browse/RHEL-69877 Opened this for now |
|
cant see that RHEL issue - despite having a subscription there. | |
Thanks for picking this up, Neil. My instance doesn't run kdump, and is now back in production (and still stable on the previous kernel) - so can't provide further logs I'm afraid. Steve - me neither with a basic sub. Suspect it's a higher level subscription, possibly developer only. |
|
Just a short note to say that all kernel-related bug reports become private automatically. This cannot be changed by the submitter. | |
Also having this issue. Server rebooting randomly every few hours. High nfs traffic. New install so the only kernel versions I have are: 5.14.0-503.15.1.el9_5.x86_64 5.14.0-503.14.1.el9_5.x86_64 Both have the same issue. I can send a kdump just let me know where to send. Thanks. |
|
@Dale wget https://dl.rockylinux.org/vault/rocky/9.4/BaseOS/x86_64/os/Packages/k/kernel-5.14.0-427.42.1.el9_4.x86_64.rpm wget https://dl.rockylinux.org/vault/rocky/9.4/BaseOS/x86_64/os/Packages/k/kernel-core-5.14.0-427.42.1.el9_4.x86_64.rpm wget https://dl.rockylinux.org/vault/rocky/9.4/BaseOS/x86_64/os/Packages/k/kernel-modules-5.14.0-427.42.1.el9_4.x86_64.rpm wget https://dl.rockylinux.org/vault/rocky/9.4/BaseOS/x86_64/os/Packages/k/kernel-modules-core-5.14.0-427.42.1.el9_4.x86_64.rpm dnf install kernel-5.14.0-427.42.1.el9_4.x86_64.rpm kernel-core-5.14.0-427.42.1.el9_4.x86_64.rpm kernel-modules-5.14.0-427.42.1.el9_4.x86_64.rpm kernel-modules-core-5.14.0-427.42.1.el9_4.x86_64.rpm Try that. I have done that on the servers that where crashing. Since then no more problems. |
|
@Steve I've tried this already but with "5.14.0-427.31.1.el9_4.x86_64" since that is the kernel I have still on several other working servers. However it fails to boot and drops to dracut. Just tried "5.14.0-427.42.1.el9_4.x86_64.rpm" and same issue. Our servers need the deprecated megaraid_sas driver. I ran the below to check and the initramfs has the driver: lsinitrd /boot/initramfs-5.14.0-427.31.1.el9_4.x86_64.img | grep mega Fails saying: ata1: SATA link down (SStatus 4 SControl 300) ata2: failed to resume link (SControl 0) ata2: SATA link down (SStatus 4 SControl 0) |
|
For clarity, that looks to be the previous kernel, before this bug was apparently introduced. If you're running an established machine (rather than a new one) and have the standard 3-package limit in dnf, then it will still be available in the boot menu and no need to download packages. When we hit problems, we just booted back into the kernel before which amounts to the same thing, and has been totally stable. The issue is only with 503.15.1. No newer kernel, or fix, appears to yet be available. Our available ones: Assumed working as previously used: -rwxr-xr-x. 1 root root 13609800 Oct 16 16:09 vmlinuz-5.14.0-427.40.1.el9_4.x86_64 Works and currently booted. -rwxr-xr-x. 1 root root 13617992 Oct 31 14:13 vmlinuz-5.14.0-427.42.1.el9_4.x86_64 Latest, unstable and crashes within less than an hour. -rwxr-xr-x. 1 root root 14461768 Nov 26 17:37 vmlinuz-5.14.0-503.15.1.el9_5.x86_64 |
|
@simon That's all correct. The issue for me is that I did a server cutover to a new machine. The old machine was on Rocky 9.5 but still running the el9_4 kernel so I wasn't aware of the issue while that server was running. The replacement server OS was installed fresh with Rocky 9.5 and so no el9_4 kernels installed. I did install '5.14.0-427.42.1.el9_4.x86_64' manually with: dnf install kernel-5.14.0-427.42.1.el9_4.x86_64.rpm kernel-core-5.14.0-427.42.1.el9_4.x86_64.rpm kernel-modules-5.14.0-427.42.1.el9_4.x86_64.rpm kernel-modules-core-5.14.0-427.42.1.el9_4.x86_64.rpm I also tried copying the kernel, initramfs, and /lib/modules/5.14.0-427.42.1.el9_4.x86_64 from a working server but I still get dropped to a dracut shell with no hard drive found despite the megaraid_sas driver existing in the initramfs. |
|
I was able to get our server booted into 5.14.0-427.42.1.el9_4.x86_64. The module at '/lib/modules/5.14.0-427.42.1.el9_4.x86_64/extra/megaraid_sas/megaraid_sas.ko' was missing and was also removed already from our other servers that still have the old 9_4 kernel on them. I was able to get the driver and copy it into: /lib/modules/5.14.0-427.42.1.el9_4.x86_64/extra/megaraid_sas/megaraid_sas.ko then rebuild the initramfs: dracut -f /boot/initramfs-5.14.0-427.42.1.el9_4.x86_64.img 5.14.0-427.42.1.el9_4.x86_64 and this finally booted. Now lsinitrd /boot/initramfs-5.14.0-427.42.1.el9_4.x86_64.img | grep mega looks like this: etc/depmod.d/kmod-megaraid_sas.conf usr/lib/modules/5.14.0-427.42.1.el9_4.x86_64/kernel/drivers/scsi/megaraid usr/lib/modules/5.14.0-427.42.1.el9_4.x86_64/kernel/drivers/scsi/megaraid/megaraid_sas.ko.xz usr/lib/modules/5.14.0-427.42.1.el9_4.x86_64/weak-updates/megaraid_sas usr/lib/modules/5.14.0-427.42.1.el9_4.x86_64/weak-updates/megaraid_sas/megaraid_sas.ko |
|
Red Hat released kernel-5.14.0-503.19.1.el9_5 today. This fixes the nfsd bug. I suppose Rocky's kernel will soon become available. | |
Thanks for the update, Akemi, good to know the solution is on its way. | |
Date Modified | Username | Field | Change |
---|---|---|---|
2024-12-03 11:18 | Steve Rast | New Issue | |
2024-12-03 14:51 | Simon Avery | Note Added: 0009011 | |
2024-12-03 14:55 | Neil Hanlon | Note Added: 0009012 | |
2024-12-03 15:20 | Neil Hanlon | Note Added: 0009013 | |
2024-12-03 15:28 | Steve Rast | Note Added: 0009014 | |
2024-12-03 15:39 | Neil Hanlon | Note Added: 0009015 | |
2024-12-04 09:04 | Steve Rast | Note Added: 0009016 | |
2024-12-04 10:32 | Simon Avery | Note Added: 0009018 | |
2024-12-04 17:48 | Akemi Yagi | Note Added: 0009019 | |
2024-12-10 16:59 | Dale Showers | Note Added: 0009046 | |
2024-12-10 17:10 | Steve Rast | Note Added: 0009047 | |
2024-12-10 18:19 | Dale Showers | Note Added: 0009048 | |
2024-12-11 07:53 | Simon Avery | Note Added: 0009076 | |
2024-12-11 15:45 | Dale Showers | Note Added: 0009077 | |
2024-12-11 16:37 | Dale Showers | Note Added: 0009078 | |
2024-12-19 01:26 | Akemi Yagi | Note Added: 0009142 | |
2024-12-19 07:32 | Simon Avery | Note Added: 0009143 |