View Issue Details

IDProjectCategoryView StatusLast Update
0000176Rocky-Linux-8kernelpublic2022-08-25 19:20
ReporterNeil neil Assigned ToLouis Abel  
PriorityhighSeverityblockReproducibilitysometimes
Status closedResolutionno change required 
Platformx86-64 interOSRocky Linux release 8.6 (Green OOS VersionRocky Linux rele
Summary0000176: The physical machine restarts a server every day, and there is no related log
DescriptionA newly purchased batch of server services run with ceph storage openstack compute

 A physical machine is restarted in the cluster every day., This may seem like a hardware problem, but we contacted the hardware support, which is the instruction of Warm Reset

/var/crash/ There is no information under the path, and kdumap did not find any useful logs.
I hope the organization can help analyze and troubleshoot the direction of the problem. Thank you very much!
Steps To ReproduceBefore rebooting:

2022-08-03T19:17:25.529508+08:00 xxxhost ceph-osd[5068]: 2022-08-03T19:17:25.528+0800 7f23e3c1c700 -1 --2- [v2:10.1.1.1:6825/5068,v1:10.1.1.1:6831/5068] >> [v2:10.1.1.3:6817/2056885,v1:10.1.1.3:6861/2056885] conn(0x55eec5782000 0x55ee74bda500 unknown :-1 s=BANNER_CONNECTING pgs=30571 cs=11571 l=0 rev1=1 rx=0 tx=0)._handle_peer_banner peer [v2:10.1.1.3:6817/2056885,v1:10.1.1.3:6861/2056885] is using msgr V1 protocol
2022-08-03T19:17:26.013482+08:00 xxxhost ceph-osd[5086]: 2022-08-03T19:17:26.012+0800 7f8b6e342700 -1 --2- [v2:10.1.1.1:6813/1005086,v1:10.1.1.1:6815/1005086] >> [v2:10.1.1.4:6821/3013756,v1:10.1.1.4:6833/3013756] conn(0x562f3ab58c00 0x562f4a52d700 unknown :-1 s=BANNER_CONNECTING pgs=27487 cs=11571 l=0 rev1=1 rx=0 tx=0)._handle_peer_banner peer [v2:10.1.1.4:6821/3013756,v1:10.1.1.4:6833/3013756] is using msgr V1 protocol
2022-08-03T19:17:40.506488+08:00 xxxhost ceph-osd[5118]: 2022-08-03T19:17:40.505+0800 7fbb2aace700 -1 --2- [v2:10.1.1.1:6814/5118,v1:10.1.1.1:6818/5118] >> [v2:10.1.1.2:6845/1005121,v1:10.1.1.2:6847/1005121] conn(0x563a454a1000 0x563b04a31e00 unknown :-1 s=BANNER_CONNECTING pgs=9017 cs=11566 l=0 rev1=1 rx=0 tx=0)._handle_peer_banner peer [v2:10.1.1.2:6845/1005121,v1:10.1.1.2:6847/1005121] is using msgr V1 protocol
2022-08-03T19:17:40.530498+08:00 xxxhost ceph-osd[5068]: 2022-08-03T19:17:40.529+0800 7f23e3c1c700 -1 --2- [v2:10.1.1.1:6825/5068,v1:10.1.1.1:6831/5068] >> [v2:10.1.1.3:6817/2056885,v1:10.1.1.3:6861/2056885] conn(0x55eec5782000 0x55ee74bda500 unknown :-1 s=BANNER_CONNECTING pgs=30571 cs=11572 l=0 rev1=1 rx=0 tx=0)._handle_peer_banner peer [v2:10.1.1.3:6817/2056885,v1:10.1.1.3:6861/2056885] is using msgr V1 protocol
2022-08-03T19:17:41.014477+08:00 xxxhost ceph-osd[5086]: 2022-08-03T19:17:41.013+0800 7f8b6e342700 -1 --2- [v2:10.1.1.1:6813/1005086,v1:10.1.1.1:6815/1005086] >> [v2:10.1.1.4:6821/3013756,v1:10.1.1.4:6833/3013756] conn(0x562f3ab58c00 0x562f4a52d700 unknown :-1 s=BANNER_CONNECTING pgs=27487 cs=11572 l=0 rev1=1 rx=0 tx=0)._handle_peer_banner peer [v2:10.1.1.4:6821/3013756,v1:10.1.1.4:6833/3013756] is using msgr V1 protocol

After restart:

2022-08-03T19:23:35.955783+08:00 xxxhost kernel: Linux version 4.18.0-372.13.1.el8_6.x86_64 (mockbuild@dal1-prod-builder001.bld.equ.rockylinux.org) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-10) (GCC)) #1 SMP Wed Jun 29 17:21:09 UTC 2022
2022-08-03T19:23:35.955966+08:00 xxxhost kernel: Command line: BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-372.13.1.el8_6.x86_64 root=UUID=a863cc8c-1a0f-421a-bce5-0f435eb74954 ro crashkernel=512M@16M

grep -E "error|Error|ERROR|fail|Fail|FAIL" /var/log/messages

2022-08-03T19:23:36.026616+08:00 xxxhost kernel: pci 0000:65:00.0: BAR 6: failed to assign [mem size 0x00100000 pref]
2022-08-03T19:23:36.032640+08:00 xxxhost kernel: ERST: Error Record Serialization Table (ERST) support is initialized.
2022-08-03T19:23:36.463509+08:00 xxxhost kernel: bnxt_en 0000:31:00.0 (unnamed net_device) (uninitialized): PTP initialization failed.
2022-08-03T19:23:36.644505+08:00 xxxhost kernel: bnxt_en 0000:31:00.1 (unnamed net_device) (uninitialized): PTP initialization failed.
2022-08-03T19:23:36.877340+08:00 xxxhost kernel: bnxt_en 0000:b1:00.0 (unnamed net_device) (uninitialized): PTP initialization failed.
2022-08-03T19:23:37.089581+08:00 xxxhost kernel: bnxt_en 0000:b1:00.1 (unnamed net_device) (uninitialized): PTP initialization failed.
2022-08-03T19:23:51.268879+08:00 xxxhost kernel: ACPI Error: No handler for Region [SYSI] (000000007fd51248) [IPMI] (20210604/evregion-135)
2022-08-03T19:23:51.268972+08:00 xxxhost kernel: ACPI Error: Region IPMI (ID=7) has no handler (20210604/exfldio-265)
2022-08-03T19:23:51.284200+08:00 xxxhost kernel: ACPI Error: Aborting method \_SB.PMI0._GHL due to previous error (AE_NOT_EXIST) (20210604/psparse-531)
2022-08-03T19:23:51.291671+08:00 xxxhost kernel: ACPI Error: Aborting method \_SB.PMI0._PMC due to previous error (AE_NOT_EXIST) (20210604/psparse-531)
2022-08-03T19:23:51.291902+08:00 xxxhost kernel: ACPI Error: AE_NOT_EXIST, Evaluating _PMC (20210604/power_meter-759)
2022-08-03T19:23:52.788965+08:00 xxxhost kernel: bnxt_en 0000:31:00.0: bnxt_re: probe error: RoCE is not supported on this device
2022-08-03T19:23:52.789520+08:00 xxxhost kernel: bnxt_en 0000:31:00.1: bnxt_re: probe error: RoCE is not supported on this device
2022-08-03T19:23:52.789657+08:00 xxxhost kernel: bnxt_en 0000:b1:00.0: bnxt_re: probe error: RoCE is not supported on this device
2022-08-03T19:23:52.789742+08:00 xxxhost kernel: bnxt_en 0000:b1:00.1: bnxt_re: probe error: RoCE is not supported on this device
2022-08-03T19:23:56.328783+08:00 xxxhost augenrules[3942]: failure 1
2022-08-03T19:23:56.328783+08:00 xxxhost augenrules[3942]: failure 1
2022-08-03T19:23:56.328783+08:00 xxxhost augenrules[3942]: failure 1
2022-08-03T19:24:13.887961+08:00 xxxhost rsyslogd[6998]: imjournal: fscanf on state file `/var/lib/rsyslog/imjournal.state' failed [v8.2102.0-7.el8_6.1 try https://www.rsyslog.com/e/2027 ]

kdumap log:
+ 2022-08-03 19:24:38 /usr/bin/kdumpctl@698: /sbin/kexec -s -d -p '--command-line=BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-372.13.1.el8_6.x86_64 ro irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd hest_disable disable_cpu_apicid=0 iTCO_wdt.pretimeout=0' --initrd=/boot/initramfs-4.18.0-372.13.1.el8_6.x86_64kdump.img /boot/vmlinuz-4.18.0-372.13.1.el8_6.x86_64
Try gzip decompression.
Try LZMA decompression.
lzma_decompress_file: read on /boot/vmlinuz-4.18.0-372.13.1.el8_6.x86_64 of 65536 bytes failed
+ 2022-08-03 19:24:39 /usr/bin/kdumpctl@702: ret=0
+ 2022-08-03 19:24:39 /usr/bin/kdumpctl@703: set +x



Tagsreboot

Activities

Neil neil

Neil neil

2022-08-04 08:15

reporter   ~0000320

ceph Version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)
openstack version Victoria
openstack-nova-compute-22.2.2-1.el8.noarch
openstack-neutron-common-17.2.1-1.el8.noarch
openstack-nova-common-22.2.2-1.el8.noarch
openstack-neutron-linuxbridge-17.2.1-1.el8.noarch
python3-openstacksdk-0.50.0-1.el8.noarch
Louis Abel

Louis Abel

2022-08-04 16:09

administrator   ~0000323

Thank you for the report. We are not able to determine from this bug report if it is an issue with Rocky Linux or hardware due to missing information:

* An SOS report (can be ran via installing sos and running sosreport) or general hardware information
* other relevant logs

Because you are using software that we do not ship, it makes it difficult for us help troubleshoot. If you have installed openstack and this problem began shortly after, you may need to work with the openstack community to resolve the issue. As you are using a version of openstack that is in extended maintenance, the community may or may not request that you upgrade to a supported version.

Please note that this bug tracker is not meant for general support questions.
Neil neil

Neil neil

2022-08-10 08:32

reporter   ~0000334

1.jpg (1,015,334 bytes)
2.jpg (1,195,344 bytes)
4.jpg (620,735 bytes)
3.jpg (310,107 bytes)
Neil neil

Neil neil

2022-08-10 08:35

reporter   ~0000335

1-2.jpg (1,417,161 bytes)
4.jpeg (1,152,759 bytes)
3-2.jpg (1,400,459 bytes)
2-2.jpg (1,412,334 bytes)
Neil neil

Neil neil

2022-08-11 01:31

reporter   ~0000336

We finally took out the kdump log, which looks like a hardware compatibility issue. Ask the community to help analyze what caused it.
Neil neil

Neil neil

2022-08-12 08:16

reporter   ~0000340

Hello, now our kdump cannot mount the hard drive and dump kdump core.
At present, we can only restart whether to enter kdump shell or not.
Is there a better way for the community to dump kdump core files to external media? Dump kdump core files to external media?
image-2.png (633,352 bytes)
image.png (280,657 bytes)
Louis Abel

Louis Abel

2022-08-13 06:01

administrator   ~0000344

You can use the rescue mode of any Rocky Linux ISO to obtain the kernel dumps. The default location will be /var/crash.
Neil neil

Neil neil

2022-08-24 02:20

reporter   ~0000466

Thank you very much for the help of the community leader. I think we have found the reason. One of our small partners added a lethal configuration to ceph's service, which caused the kernel to crash and restart.
|
[Unit]
Description=ceph Slice
Documentation=man:systemd.special(7)
Before=slices.target

[Slice]
MemoryAccounting=true
#MemoryLimit=2048M
MemoryMax=4G
CPUAccounting=true
CPUQuota=90%

/usr/lib/systemd/system/ceph-osd*.service
[Service]
Slice=ceph.slice
CPUAffinity=4-21
Nice=-20

After we checked and removed this configuration, everything returned to normal. Once again,thank the community boss for his kind help. Please accept my knee.

Issue History

Date Modified Username Field Change
2022-08-04 08:08 Neil neil New Issue
2022-08-04 08:08 Neil neil Tag Attached: reboot
2022-08-04 08:15 Neil neil Note Added: 0000320
2022-08-04 16:09 Louis Abel Note Added: 0000323
2022-08-10 08:32 Neil neil Note Added: 0000334
2022-08-10 08:32 Neil neil File Added: 1.jpg
2022-08-10 08:32 Neil neil File Added: 2.jpg
2022-08-10 08:32 Neil neil File Added: 3.jpg
2022-08-10 08:32 Neil neil File Added: 4.jpg
2022-08-10 08:35 Neil neil Note Added: 0000335
2022-08-10 08:35 Neil neil File Added: 1-2.jpg
2022-08-10 08:35 Neil neil File Added: 2-2.jpg
2022-08-10 08:35 Neil neil File Added: 3-2.jpg
2022-08-10 08:35 Neil neil File Added: 4.jpeg
2022-08-11 01:31 Neil neil Note Added: 0000336
2022-08-11 02:46 Louis Abel Assigned To => Louis Abel
2022-08-11 02:46 Louis Abel Status new => needinfo
2022-08-12 08:16 Neil neil Note Added: 0000340
2022-08-12 08:16 Neil neil File Added: image.png
2022-08-12 08:16 Neil neil File Added: image-2.png
2022-08-13 06:01 Louis Abel Note Added: 0000344
2022-08-24 02:20 Neil neil Note Added: 0000466
2022-08-25 19:20 Louis Abel Status needinfo => closed
2022-08-25 19:20 Louis Abel Resolution open => no change required