View Issue Details

IDProjectCategoryView StatusLast Update
0008451Rocky-Linux-9kernelpublic2024-12-19 07:32
ReporterSteve Rast Assigned To 
PrioritynormalSeveritycrashReproducibilityrandom
Status newResolutionopen 
Platformx86_64OSRocky LinuxOS Version9.5
Summary0008451: nfs-server hard reboot server
DescriptionSince upgrading from RockyLinux 9.4 to RockyLinux 9.5 several NFS servers are randomly hard rebooting. I can see this:

[251118.198708] perf: interrupt took too long (2524 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[256533.825978] perf: interrupt took too long (3166 > 3155), lowering kernel.perf_event_max_sample_rate to 63000
[279965.977293] perf: interrupt took too long (3965 > 3957), lowering kernel.perf_event_max_sample_rate to 50000
[326621.176722] ------------[ cut here ]------------
[326621.176728] WARNING: CPU: 18 PID: 3270 at mm/slab_common.c:957 free_large_kmalloc+0x5a/0x80
[326621.176739] Modules linked in: tls binfmt_misc dm_service_time iscsi_tcp libiscsi_tcp libiscsi rpcrdma rdma_cm iw_cm ib_cm ib_core scsi_transport_iscsi nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink vfat fat dm_multipath intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif dm_mod kvm dell_wmi_descriptor sparse_keymap rfkill video iTCO_wdt rapl intel_cstate mxm_wmi mei_me dcdbas mei intel_uncore iTCO_vendor_support ipmi_si joydev acpi_power_meter ipmi_devintf ipmi_msghandler pcspkr lpc_ich nfsd nfs_acl lockd auth_rpcgss grace sunrpc xfs libcrc32c sr_mod sd_mod cdrom t10_pi sg mgag200 uas usb_storage drm_kms_helper ahci libahci drm_shmem_helper crct10dif_pclmul crc32_pclmul drm ixgbe crc32c_intel libata
[326621.176795] igb ghash_clmulni_intel megaraid_sas i2c_algo_bit mdio dca wmi fuse
[326621.176801] CPU: 18 PID: 3270 Comm: nfsd Kdump: loaded Not tainted 5.14.0-503.14.1.el9_5.x86_64 #1
[326621.176804] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.19.0 12/12/2023
[326621.176806] RIP: 0010:free_large_kmalloc+0x5a/0x80
[326621.176811] Code: da 9c 5b fa be 06 00 00 00 48 89 ef e8 af 25 0a 00 80 e7 02 74 01 fb 48 83 c4 08 44 89 e6 48 89 ef 5b 5d 41 5c e9 d6 28 04 00 <0f> 0b 45 31 e4 80 3d 43 0e fc 01 00 ba 00 f0 ff ff 0f 84 fb 9a 90
[326621.176813] RSP: 0018:ffffb6f9092bb968 EFLAGS: 00010246
[326621.176815] RAX: 0017ffffc0001000 RBX: ffffffff8c31e2e0 RCX: ffff94a6c40f9220
[326621.176816] RDX: fffff42b8cb69608 RSI: ffffffff8b058378 RDI: fffff42b8cb69600
[326621.176818] RBP: fffff42b8cb69600 R08: ffffffff8ca06440 R09: ffff94a9afc744b0
[326621.176819] R10: 00000000000003c8 R11: 0000000000000000 R12: ffffffff8b058378
[326621.176820] R13: 0000000000000000 R14: ffff94a68942ae00 R15: ffff94a9e567c000
[326621.176822] FS: 0000000000000000(0000) GS:ffff94a9afc40000(0000) knlGS:0000000000000000
[326621.176824] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[326621.176825] CR2: 00005626630c5140 CR3: 000000032f410001 CR4: 00000000003706f0
[326621.176827] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[326621.176828] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[326621.176829] Call Trace:
[326621.176831] <TASK>
[326621.176833] ? show_trace_log_lvl+0x1c4/0x2df
[326621.176839] ? show_trace_log_lvl+0x1c4/0x2df
[326621.176842] ? security_release_secctx+0x28/0x40
[326621.176846] ? free_large_kmalloc+0x5a/0x80
[326621.176849] ? __warn+0x7e/0xd0
[326621.176852] ? free_large_kmalloc+0x5a/0x80
[326621.176855] ? report_bug+0x100/0x140
[326621.176859] ? handle_bug+0x3c/0x70
[326621.176862] ? exc_invalid_op+0x14/0x70
[326621.176864] ? asm_exc_invalid_op+0x16/0x20
[326621.176868] ? lookup_dcache+0x18/0x60
[326621.176872] ? lookup_dcache+0x18/0x60
[326621.176875] ? free_large_kmalloc+0x5a/0x80
[326621.176878] ? lookup_dcache+0x18/0x60
[326621.176880] security_release_secctx+0x28/0x40
[326621.176883] nfsd4_encode_fattr4+0x2cc/0x4f0 [nfsd]
[326621.176955] ? avc_has_perm_noaudit+0x94/0x110
[326621.176959] ? selinux_inode_permission+0x10e/0x1d0
[326621.176964] ? __d_lookup+0x73/0xb0
[326621.176967] ? d_lookup+0x29/0x50
[326621.176969] ? lookup_dcache+0x18/0x60
[326621.176972] nfsd4_encode_entry4_fattr+0xcd/0x1e0 [nfsd]
[326621.177019] nfsd4_encode_entry4+0x100/0x290 [nfsd]
[326621.177072] nfsd_buffered_readdir+0x144/0x250 [nfsd]
[326621.177114] ? __pfx_nfsd4_encode_entry4+0x10/0x10 [nfsd]
[326621.177170] ? __pfx_nfsd_buffered_filldir+0x10/0x10 [nfsd]
[326621.177211] ? __pfx_nfsd4_encode_entry4+0x10/0x10 [nfsd]
[326621.177255] nfsd_readdir+0xa9/0xe0 [nfsd]
[326621.177296] nfsd4_encode_readdir+0xf8/0x1d0 [nfsd]
[326621.177341] nfsd4_encode_operation+0xa6/0x2b0 [nfsd]
[326621.177386] nfsd4_proc_compound+0x1d0/0x700 [nfsd]
[326621.177446] nfsd_dispatch+0xe9/0x220 [nfsd]
[326621.177487] svc_process_common+0x2e7/0x650 [sunrpc]
[326621.177583] ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
[326621.177623] svc_process+0x12d/0x170 [sunrpc]
[326621.177691] svc_handle_xprt+0x448/0x580 [sunrpc]
[326621.177750] svc_recv+0x17a/0x2c0 [sunrpc]
[326621.177819] ? __pfx_nfsd+0x10/0x10 [nfsd]
[326621.177858] nfsd+0x84/0xb0 [nfsd]
[326621.177896] kthread+0xe0/0x100
[326621.177900] ? __pfx_kthread+0x10/0x10
[326621.177904] ret_from_fork+0x2c/0x50
[326621.177919] </TASK>
[326621.177920] ---[ end trace 0000000000000000 ]---
[326621.177922] object pointer: 0x00000000e53caba2
[326621.179321] BUG: unable to handle page fault for address: ffff94a86da58000
[326621.179324] #PF: supervisor write access in kernel mode
[326621.179327] #PF: error_code(0x0003) - permissions violation
[326621.179330] PGD 330801067 P4D 330801067 PUD 100207063 PMD 800000032da000a1
[326621.179337] Oops: 0003 [#1] PREEMPT SMP PTI
[326621.179341] CPU: 18 PID: 3270 Comm: nfsd Kdump: loaded Tainted: G W ------- --- 5.14.0-503.14.1.el9_5.x86_64 #1
[326621.179345] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.19.0 12/12/2023
[326621.179347] RIP: 0010:svc_process_common+0xe7/0x650 [sunrpc]
[326621.179466] Code: 00 00 48 c7 87 80 02 00 00 00 00 00 00 48 29 d0 48 c1 f8 03 c1 e0 0c 89 87 cc 02 00 00 4c 89 e7 e8 ce a9 00 00 48 85 c0 74 02 <89> 18 be 04 00 00 00 4c 89 e7 e8 ba a9 00 00 48 85 c0 74 06 c7 00
[326621.179468] RSP: 0018:ffffb6f9092bbe28 EFLAGS: 00010286
[326621.179470] RAX: ffff94a86da58000 RBX: 000000000bc19c07 RCX: ffff94a86da58000
[326621.179471] RDX: ffff94a9e567c2e8 RSI: 0000000000000004 RDI: ffff94a9e567c238
[326621.179472] RBP: ffff94a9e567c000 R08: ffff94a9e567c1a0 R09: 0000000000000000
[326621.179473] R10: 0000000000000006 R11: 0000000000001000 R12: ffff94a9e567c238
[326621.179474] R13: ffff94a9c23f8f00 R14: ffff94a9c23f8784 R15: ffff94a9e567c000
[326621.179475] FS: 0000000000000000(0000) GS:ffff94a9afc40000(0000) knlGS:0000000000000000
[326621.179477] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[326621.179478] CR2: ffff94a86da58000 CR3: 000000032f410001 CR4: 00000000003706f0
[326621.179479] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[326621.179480] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[326621.179481] Call Trace:
[326621.179483] <TASK>
[326621.179484] ? show_trace_log_lvl+0x1c4/0x2df
[326621.179488] ? show_trace_log_lvl+0x1c4/0x2df
[326621.179492] ? svc_process+0x12d/0x170 [sunrpc]
[326621.179547] ? __die_body.cold+0x8/0xd
[326621.179551] ? page_fault_oops+0x134/0x170
[326621.179554] ? kernelmode_fixup_or_oops+0x84/0x110
[326621.179557] ? exc_page_fault+0xa8/0x150
[326621.179561] ? asm_exc_page_fault+0x22/0x30
[326621.179565] ? svc_process_common+0xe7/0x650 [sunrpc]
[326621.179621] ? svc_process_common+0xe2/0x650 [sunrpc]
[326621.179678] svc_process+0x12d/0x170 [sunrpc]
[326621.179736] svc_handle_xprt+0x448/0x580 [sunrpc]
[326621.179796] svc_recv+0x17a/0x2c0 [sunrpc]
[326621.179856] ? __pfx_nfsd+0x10/0x10 [nfsd]
[326621.179896] nfsd+0x84/0xb0 [nfsd]
[326621.179936] kthread+0xe0/0x100
[326621.179940] ? __pfx_kthread+0x10/0x10
[326621.179943] ret_from_fork+0x2c/0x50
[326621.179947] </TASK>
[326621.179948] Modules linked in: tls binfmt_misc dm_service_time iscsi_tcp libiscsi_tcp libiscsi rpcrdma rdma_cm iw_cm ib_cm ib_core scsi_transport_iscsi nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink vfat fat dm_multipath intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif dm_mod kvm dell_wmi_descriptor sparse_keymap rfkill video iTCO_wdt rapl intel_cstate mxm_wmi mei_me dcdbas mei intel_uncore iTCO_vendor_support ipmi_si joydev acpi_power_meter ipmi_devintf ipmi_msghandler pcspkr lpc_ich nfsd nfs_acl lockd auth_rpcgss grace sunrpc xfs libcrc32c sr_mod sd_mod cdrom t10_pi sg mgag200 uas usb_storage drm_kms_helper ahci libahci drm_shmem_helper crct10dif_pclmul crc32_pclmul drm ixgbe crc32c_intel libata
[326621.179989] igb ghash_clmulni_intel megaraid_sas i2c_algo_bit mdio dca wmi fuse
[326621.179994] CR2: ffff94a86da58000

Also saw this sometimes:

kernel:watchdog: BUG: soft lockup - CPU#2 stuck for 160s! [nfsd:5657]

It is also with the latest kernel: 5.14.0-503.15.1.el9_5.x86_64

I use NFSv4 with SSSD integrated to ActiveDirectory. The Clients are mounting via AutoFS. Everything was working fine until the Upgrade to RockyLinux 9.5. Since then some servers are rebooting several times per day. It doesn't matter if it is UEFI or BIOS Boot. Or if it its a VM or physical server.

I have the impression as more NFS traffic is generate as faster the servers are crashing.
TagsNo tags attached.

Activities

Simon Avery

Simon Avery

2024-12-03 14:51

reporter   ~0009011

We also have this issue, or one very much like it.

We have a busy fileserver that upgraded from Rocky 9.4 to 9.5 this morning at 0540.

It returned, but at 0610 stopped responding. The console was unresponsive so we hard rebooted via vm controls.

Within 30 minutes of that, had stopped again.

We hard booted again, and this time chose the previous kernel. Since then (5h+) the vm has been 100% stable as it has been before 9.5

It serves files through both NFSv4 and SMB, using SSSD and AD for authentication.

I also get the impression that it's related to NFS traffic.

Our logs follow



Dec 3 05:53:41 redacted_hostname: [ 769.252982] ------------[ cut here ]------------
Dec 3 05:53:41 redacted_hostname: [ 769.252987] WARNING: CPU: 0 PID: 1115 at mm/slab_common.c:957 free_large_kmalloc+0x5a/0x80
Dec 3 05:53:41 redacted_hostname: [ 769.253003] Modules linked in: binfmt_misc rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache netfs rpcrdma rdma_cm iw_cm ib_cm ib_core rfkill nft_reject_ipv4 nf_reject_ipv4 nft_reject nft_counter nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 vsock_loopback vmw_vsock_virtio_transport_common nf_tables vmw_vsock_vmci_transport vsock nfnetlink vmwgfx intel_rapl_msr vmw_balloon intel_rapl_common drm_ttm_helper ttm vmw_vmci pcspkr drm_kms_helper i2c_piix4 joydev nfsd nfs_acl lockd auth_rpcgss grace sunrpc drm xfs libcrc32c crct10dif_pclmul sd_mod crc32_pclmul ata_generic t10_pi crc32c_intel sg ghash_clmulni_intel ata_piix libata vmxnet3 vmw_pvscsi serio_raw dm_mirror dm_region_hash dm_log dm_mod fuse
Dec 3 05:53:41 redacted_hostname: [ 769.253054] CPU: 0 PID: 1115 Comm: nfsd Not tainted 5.14.0-503.15.1.el9_5.x86_64 #1
Dec 3 05:53:41 redacted_hostname: [ 769.253056] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
Dec 3 05:53:41 redacted_hostname: [ 769.253057] RIP: 0010:free_large_kmalloc+0x5a/0x80
Dec 3 05:53:41 redacted_hostname: [ 769.253060] Code: da 9c 5b fa be 06 00 00 00 48 89 ef e8 af 25 0a 00 80 e7 02 74 01 fb 48 83 c4 08 44 89 e6 48 89 ef 5b 5d 41 5c e9 d6 28 04 00 <0f> 0b 45 31 e4 80 3d d3 0d fc 01 00 ba 00 f0 ff ff 0f 84 8b 9a 90
Dec 3 05:53:41 redacted_hostname: [ 769.253062] RSP: 0018:ffffafef00d57b28 EFLAGS: 00010246
Dec 3 05:53:41 redacted_hostname: [ 769.253064] RAX: 0017ffffc0000000 RBX: ffffffff8cd1e2e0 RCX: ffff9fce52f7ac68
Dec 3 05:53:41 redacted_hostname: [ 769.253065] RDX: dead000000000100 RSI: ffffffffc096b47c RDI: ffffd9fd06c25ac0
Dec 3 05:53:41 redacted_hostname: [ 769.253065] RBP: ffffd9fd06c25ac0 R08: ffffffff8d4075e0 R09: ffff9fcf35e344b0
Dec 3 05:53:41 redacted_hostname: [ 769.253066] R10: 000000b30f218140 R11: 00000000004c82ea R12: ffffffffc096b47c
Dec 3 05:53:41 redacted_hostname: [ 769.253067] R13: 0000000000000000 R14: ffff9fce0a0c4a00 R15: ffff9fce01148000
Dec 3 05:53:41 redacted_hostname: [ 769.253068] FS: 0000000000000000(0000) GS:ffff9fcf35e00000(0000) knlGS:0000000000000000
Dec 3 05:53:41 redacted_hostname: [ 769.253070] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 3 05:53:41 redacted_hostname: [ 769.253071] CR2: 00007efd2f51a654 CR3: 000000010902a002 CR4: 00000000007706f0
Dec 3 05:53:41 redacted_hostname: [ 769.253085] PKRU: 55555554
Dec 3 05:53:41 redacted_hostname: [ 769.253086] Call Trace:
Dec 3 05:53:41 redacted_hostname: [ 769.253087] <TASK>
Dec 3 05:53:41 redacted_hostname: [ 769.253088] ? srso_alias_return_thunk+0x5/0xfbef5
Dec 3 05:53:41 redacted_hostname: [ 769.253094] ? show_trace_log_lvl+0x26e/0x2df
Dec 3 05:53:41 redacted_hostname: [ 769.253101] ? show_trace_log_lvl+0x26e/0x2df
Dec 3 05:53:41 redacted_hostname: [ 769.253106] ? security_release_secctx+0x28/0x40
Dec 3 05:53:41 redacted_hostname: [ 769.253110] ? free_large_kmalloc+0x5a/0x80
Dec 3 05:53:41 redacted_hostname: [ 769.253113] ? __warn+0x7e/0xd0
Dec 3 05:53:41 redacted_hostname: [ 769.253116] ? free_large_kmalloc+0x5a/0x80
Dec 3 05:53:41 redacted_hostname: [ 769.253119] ? report_bug+0x100/0x140
Dec 3 05:53:41 redacted_hostname: [ 769.253124] ? handle_bug+0x3c/0x70
Dec 3 05:53:41 redacted_hostname: [ 769.253127] ? exc_invalid_op+0x14/0x70
Dec 3 05:53:41 redacted_hostname: [ 769.253130] ? asm_exc_invalid_op+0x16/0x20
Dec 3 05:53:41 redacted_hostname: [ 769.253134] ? _fh_update.part.0.isra.0+0x4c/0x90 [nfsd]
Dec 3 05:53:41 redacted_hostname: [ 769.253162] ? _fh_update.part.0.isra.0+0x4c/0x90 [nfsd]
Dec 3 05:53:41 redacted_hostname: [ 769.253184] ? free_large_kmalloc+0x5a/0x80
Dec 3 05:53:41 redacted_hostname: [ 769.253188] ? _fh_update.part.0.isra.0+0x4c/0x90 [nfsd]
Dec 3 05:53:41 redacted_hostname: [ 769.253206] security_release_secctx+0x28/0x40
Dec 3 05:53:41 redacted_hostname: [ 769.253209] nfsd4_encode_fattr4+0x2cc/0x4f0 [nfsd]
Dec 3 05:53:41 redacted_hostname: [ 769.253237] ? srso_alias_return_thunk+0x5/0xfbef5
Dec 3 05:53:41 redacted_hostname: [ 769.253239] ? __kmem_cache_alloc_node+0x18f/0x2e0
Dec 3 05:53:41 redacted_hostname: [ 769.253242] ? security_prepare_creds+0x71/0xa0
Dec 3 05:53:41 redacted_hostname: [ 769.253245] ? security_prepare_creds+0x71/0xa0
Dec 3 05:53:41 redacted_hostname: [ 769.253242] ? security_prepare_creds+0x71/0xa0
Dec 3 05:53:41 redacted_hostname: [ 769.253245] ? security_prepare_creds+0x71/0xa0
Dec 3 05:53:41 redacted_hostname: [ 769.253246] ? srso_alias_return_thunk+0x5/0xfbef5
Dec 3 05:53:41 redacted_hostname: [ 769.253248] ? __kmalloc+0x4b/0x140
Dec 3 05:53:41 redacted_hostname: [ 769.253250] ? srso_alias_return_thunk+0x5/0xfbef5
Dec 3 05:53:41 redacted_hostname: [ 769.253251] ? srso_alias_return_thunk+0x5/0xfbef5
Dec 3 05:53:41 redacted_hostname: [ 769.253253] ? security_prepare_creds+0x47/0xa0
Dec 3 05:53:41 redacted_hostname: [ 769.253255] ? srso_alias_return_thunk+0x5/0xfbef5
Dec 3 05:53:41 redacted_hostname: [ 769.253256] ? prepare_creds+0x180/0x270
Dec 3 05:53:41 redacted_hostname: [ 769.253259] ? srso_alias_return_thunk+0x5/0xfbef5
Dec 3 05:53:41 redacted_hostname: [ 769.253261] ? nfsd_setuser+0x110/0x270 [nfsd]
Dec 3 05:53:41 redacted_hostname: [ 769.253286] ? srso_alias_return_thunk+0x5/0xfbef5
Dec 3 05:53:41 redacted_hostname: [ 769.253288] ? nfsd_setuser_and_check_port+0x4a/0xc0 [nfsd]
Dec 3 05:53:41 redacted_hostname: [ 769.253306] ? _fh_update.part.0.isra.0+0x4c/0x90 [nfsd]
Dec 3 05:53:41 redacted_hostname: [ 769.253323] nfsd4_encode_getattr+0x2b/0x40 [nfsd]
Dec 3 05:53:41 redacted_hostname: [ 769.253341] nfsd4_encode_operation+0xa6/0x2b0 [nfsd]
Dec 3 05:53:41 redacted_hostname: [ 769.253361] nfsd4_proc_compound+0x1d0/0x700 [nfsd]
Dec 3 05:53:41 redacted_hostname: [ 769.253384] nfsd_dispatch+0xe9/0x220 [nfsd]
Dec 3 05:53:41 redacted_hostname: [ 769.253404] svc_process_common+0x2e7/0x650 [sunrpc]
Dec 3 05:53:41 redacted_hostname: [ 769.253435] ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
Dec 3 05:53:41 redacted_hostname: [ 769.253463] svc_process+0x12d/0x170 [sunrpc]
Dec 3 05:53:41 redacted_hostname: [ 769.253491] svc_handle_xprt+0x448/0x580 [sunrpc]
Dec 3 05:53:41 redacted_hostname: [ 769.253523] svc_recv+0x17a/0x2c0 [sunrpc]
Dec 3 05:53:41 redacted_hostname: [ 769.253552] ? __pfx_nfsd+0x10/0x10 [nfsd]
Dec 3 05:53:41 redacted_hostname: [ 769.253577] nfsd+0x84/0xb0 [nfsd]
Dec 3 05:53:41 redacted_hostname: [ 769.253600] kthread+0xe0/0x100
Dec 3 05:53:41 redacted_hostname: [ 769.253604] ? __pfx_kthread+0x10/0x10
Dec 3 05:53:41 redacted_hostname: [ 769.253608] ret_from_fork+0x2c/0x50
Dec 3 05:53:41 redacted_hostname: [ 769.253614] </TASK>
Dec 3 05:53:41 redacted_hostname: [ 769.253614] ---[ end trace 0000000000000000 ]---
Dec 3 05:53:41 redacted_hostname: [ 769.253616] object pointer: 0x000000008b742c83
Neil Hanlon

Neil Hanlon

2024-12-03 14:55

administrator   ~0009012

https://nvd.nist.gov/vuln/detail/CVE-2024-46697

feels related
Neil Hanlon

Neil Hanlon

2024-12-03 15:20

administrator   ~0009013

Would either of you have access or ability to generate a kdump? I'd like to poke around and see if my assumptions here are correct.. but I believe this crash is being triggered by the lack of the fix for CVE-2024-46697 (https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=f58bab6fd4063913bd8321e99874b8239e9ba726) in the 9.5 kernel.

i.e., the specified condition occurs and args.context is set with random junk which, when the kernel attempts to free, sometimes causes a crash -- the effect thereof would be pronounced further on busy servers, probably.
Steve Rast

Steve Rast

2024-12-03 15:28

reporter   ~0009014

In /var/crash i have a lot of kexec-dmesg.log, vmcore, vmcore-dmesg.txt

Is that what you would need?
Neil Hanlon

Neil Hanlon

2024-12-03 15:39

administrator   ~0009015

https://issues.redhat.com/browse/RHEL-69877

Opened this for now
Steve Rast

Steve Rast

2024-12-04 09:04

reporter   ~0009016

cant see that RHEL issue - despite having a subscription there.
Simon Avery

Simon Avery

2024-12-04 10:32

reporter   ~0009018

Thanks for picking this up, Neil.

My instance doesn't run kdump, and is now back in production (and still stable on the previous kernel) - so can't provide further logs I'm afraid.

Steve - me neither with a basic sub. Suspect it's a higher level subscription, possibly developer only.
Akemi Yagi

Akemi Yagi

2024-12-04 17:48

reporter   ~0009019

Just a short note to say that all kernel-related bug reports become private automatically. This cannot be changed by the submitter.
Dale Showers

Dale Showers

2024-12-10 16:59

reporter   ~0009046

Also having this issue. Server rebooting randomly every few hours. High nfs traffic.
New install so the only kernel versions I have are:
  5.14.0-503.15.1.el9_5.x86_64
  5.14.0-503.14.1.el9_5.x86_64

Both have the same issue. I can send a kdump just let me know where to send.

Thanks.
Steve Rast

Steve Rast

2024-12-10 17:10

reporter   ~0009047

@Dale

wget https://dl.rockylinux.org/vault/rocky/9.4/BaseOS/x86_64/os/Packages/k/kernel-5.14.0-427.42.1.el9_4.x86_64.rpm
wget https://dl.rockylinux.org/vault/rocky/9.4/BaseOS/x86_64/os/Packages/k/kernel-core-5.14.0-427.42.1.el9_4.x86_64.rpm
wget https://dl.rockylinux.org/vault/rocky/9.4/BaseOS/x86_64/os/Packages/k/kernel-modules-5.14.0-427.42.1.el9_4.x86_64.rpm
wget https://dl.rockylinux.org/vault/rocky/9.4/BaseOS/x86_64/os/Packages/k/kernel-modules-core-5.14.0-427.42.1.el9_4.x86_64.rpm
dnf install kernel-5.14.0-427.42.1.el9_4.x86_64.rpm kernel-core-5.14.0-427.42.1.el9_4.x86_64.rpm kernel-modules-5.14.0-427.42.1.el9_4.x86_64.rpm kernel-modules-core-5.14.0-427.42.1.el9_4.x86_64.rpm

Try that. I have done that on the servers that where crashing. Since then no more problems.
Dale Showers

Dale Showers

2024-12-10 18:19

reporter   ~0009048

@Steve

I've tried this already but with "5.14.0-427.31.1.el9_4.x86_64" since that is the kernel I have still on several other working servers.
However it fails to boot and drops to dracut.
Just tried "5.14.0-427.42.1.el9_4.x86_64.rpm" and same issue.
 
Our servers need the deprecated megaraid_sas driver.
I ran the below to check and the initramfs has the driver:
  lsinitrd /boot/initramfs-5.14.0-427.31.1.el9_4.x86_64.img | grep mega

Fails saying:
ata1: SATA link down (SStatus 4 SControl 300)
ata2: failed to resume link (SControl 0)
ata2: SATA link down (SStatus 4 SControl 0)
Simon Avery

Simon Avery

2024-12-11 07:53

reporter   ~0009076

For clarity, that looks to be the previous kernel, before this bug was apparently introduced.

If you're running an established machine (rather than a new one) and have the standard 3-package limit in dnf, then it will still be available in the boot menu and no need to download packages.

When we hit problems, we just booted back into the kernel before which amounts to the same thing, and has been totally stable.

The issue is only with 503.15.1. No newer kernel, or fix, appears to yet be available.

Our available ones:

Assumed working as previously used:
-rwxr-xr-x. 1 root root 13609800 Oct 16 16:09 vmlinuz-5.14.0-427.40.1.el9_4.x86_64

Works and currently booted.
-rwxr-xr-x. 1 root root 13617992 Oct 31 14:13 vmlinuz-5.14.0-427.42.1.el9_4.x86_64

Latest, unstable and crashes within less than an hour.
-rwxr-xr-x. 1 root root 14461768 Nov 26 17:37 vmlinuz-5.14.0-503.15.1.el9_5.x86_64
Dale Showers

Dale Showers

2024-12-11 15:45

reporter   ~0009077

@simon

That's all correct.
The issue for me is that I did a server cutover to a new machine.
The old machine was on Rocky 9.5 but still running the el9_4 kernel so I wasn't aware of the issue while that server was running.
The replacement server OS was installed fresh with Rocky 9.5 and so no el9_4 kernels installed.

I did install '5.14.0-427.42.1.el9_4.x86_64' manually with:
 
dnf install kernel-5.14.0-427.42.1.el9_4.x86_64.rpm kernel-core-5.14.0-427.42.1.el9_4.x86_64.rpm kernel-modules-5.14.0-427.42.1.el9_4.x86_64.rpm kernel-modules-core-5.14.0-427.42.1.el9_4.x86_64.rpm

I also tried copying the kernel, initramfs, and /lib/modules/5.14.0-427.42.1.el9_4.x86_64 from a working server but I still get dropped to a dracut shell with no hard drive found despite the megaraid_sas driver existing in the initramfs.
Dale Showers

Dale Showers

2024-12-11 16:37

reporter   ~0009078

I was able to get our server booted into 5.14.0-427.42.1.el9_4.x86_64.

The module at '/lib/modules/5.14.0-427.42.1.el9_4.x86_64/extra/megaraid_sas/megaraid_sas.ko' was missing and was also removed already from our other servers that still have the old 9_4 kernel on them.

I was able to get the driver and copy it into:
  /lib/modules/5.14.0-427.42.1.el9_4.x86_64/extra/megaraid_sas/megaraid_sas.ko
then rebuild the initramfs:
 dracut -f /boot/initramfs-5.14.0-427.42.1.el9_4.x86_64.img 5.14.0-427.42.1.el9_4.x86_64
and this finally booted.

Now lsinitrd /boot/initramfs-5.14.0-427.42.1.el9_4.x86_64.img | grep mega looks like this:
  etc/depmod.d/kmod-megaraid_sas.conf
  usr/lib/modules/5.14.0-427.42.1.el9_4.x86_64/kernel/drivers/scsi/megaraid
  usr/lib/modules/5.14.0-427.42.1.el9_4.x86_64/kernel/drivers/scsi/megaraid/megaraid_sas.ko.xz
  usr/lib/modules/5.14.0-427.42.1.el9_4.x86_64/weak-updates/megaraid_sas
  usr/lib/modules/5.14.0-427.42.1.el9_4.x86_64/weak-updates/megaraid_sas/megaraid_sas.ko
Akemi Yagi

Akemi Yagi

2024-12-19 01:26

reporter   ~0009142

Red Hat released kernel-5.14.0-503.19.1.el9_5 today. This fixes the nfsd bug. I suppose Rocky's kernel will soon become available.
Simon Avery

Simon Avery

2024-12-19 07:32

reporter   ~0009143

Thanks for the update, Akemi, good to know the solution is on its way.

Issue History

Date Modified Username Field Change
2024-12-03 11:18 Steve Rast New Issue
2024-12-03 14:51 Simon Avery Note Added: 0009011
2024-12-03 14:55 Neil Hanlon Note Added: 0009012
2024-12-03 15:20 Neil Hanlon Note Added: 0009013
2024-12-03 15:28 Steve Rast Note Added: 0009014
2024-12-03 15:39 Neil Hanlon Note Added: 0009015
2024-12-04 09:04 Steve Rast Note Added: 0009016
2024-12-04 10:32 Simon Avery Note Added: 0009018
2024-12-04 17:48 Akemi Yagi Note Added: 0009019
2024-12-10 16:59 Dale Showers Note Added: 0009046
2024-12-10 17:10 Steve Rast Note Added: 0009047
2024-12-10 18:19 Dale Showers Note Added: 0009048
2024-12-11 07:53 Simon Avery Note Added: 0009076
2024-12-11 15:45 Dale Showers Note Added: 0009077
2024-12-11 16:37 Dale Showers Note Added: 0009078
2024-12-19 01:26 Akemi Yagi Note Added: 0009142
2024-12-19 07:32 Simon Avery Note Added: 0009143