View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0000662 | Rocky-Linux-8 | NetworkManager | public | 2022-11-01 16:01 | 2022-11-07 22:12 |
Reporter | Carl Gromatzky | Assigned To | Louis Abel | ||
Priority | normal | Severity | major | Reproducibility | always |
Status | needinfo | Resolution | open | ||
Platform | ESXi 7.0.3, open-vm-tools 11.3.5 | OS | Rocky 8.6 | OS Version | 4.18.0-372.9.1.e |
Summary | 0000662: MAC Addr Corruption VMWare open-vm-tools 11.3.5-1.el8 | ||||
Description | Subject: Rock 8.6 VMXNET3 Device Enumeration Corruption Message: Running ESXi 7.0.3 build 20328353 (or build 20395099) on a UCS-C220-M5SX with firmware 4.2(2a) and a Intel Xeon Platinum 8168 2.7GHz processor VM is Rocky 8.6 (4.18.0-372.9.1.el8.x86_64), using VMXNET3 network devices Network devices mac addresses enumerate in the VM OS (Rocky 8.6) just fine, up to 3 network devices. i.e.: eth0 MAC addr = Network Adapter 1 MAC addr eth1 MAC addr = Network Adapter 2 MAC addr eth2 MAC addr = Network Adapter 3 MAC addr Adding a fourth network device causes nmcli to show corrupted MAC address to device mappings. Rebooting causes dmesg errors and network loss (the driver appears to enumerate all vmxnet3 network devices incorrectly upon adding the 4th vmxnet3 device, mapping connection profiles to incorrect devices and causing kernel errors and network connectivity loss) MAC addresses enumerate differently in the OS than the network adapters listed in vCenter after the fourth adapter is added to the vm, i.e.: eth0 MAC addr = Network Adapter 4 MAC addr eth1 MAC addr = Network Adapter 1 MAC addr eth2 MAC addr = Network Adapter 2 MAC addr eth3 MAC addr = Network Adapter 3 MAC addr open-vm-tools version experiencing this issue is 11.3.5-1.el8_6.1 from @AppStream repo I also downloaded open-vm-tools versions 12.0.0, 12.0.5, 12.1.0 from the vmware github and rpm'd them into the build. All provide the same behavior on this vm. 12.1.0 actually started experiencing the MAC enumeration corruption when adding the third Network Adapter. I opened a case with VMWare. They closed my case provided that I must open a case with Rocky Linux directly. | ||||
Steps To Reproduce | The VM is given 1 vmxnet3 adapter to start. The network is configured and connectivity good. The VM is given 1 additional adapter. The network connectivity is good. The VM is given 1 additional adapter. The network connectivity is good. The VM is given 1 additional adapter. The network connectivity fails, with mac address to nmcli connection mappings corrupted. Generally, the error will not present until the VM is rebooted. OR The VM is given 4 vmxnet3 adapters to start. The network is configured and connectivity good. The VM is rebooted, connectivity fails, with mac address to nmcli connection mappings corrupted. | ||||
Additional Information | The OS is hardened, running in mult-user.target mode. Running generic Rocky 8.6 build provides the following behavior in the same VMWare environment: Add 4 vmxnet3 devices, the nmcli device list shows mac addr enumeration is fine, though the default network device nomenclature is ens[n]. I notice that the first 2 network devices enumerate as ens2[nn] and that the 3rd and 4th devices enumerate as ens1[nn]. In the production, hardened build, the network device nomenclature is eth[n]. | ||||
Tags | No tags attached. | ||||
Unfortunately we simply rebuild the open-vm-tools package as provided by Red Hat, which contains the sources that VMWare tools builds. So there's not much of a way for us to investigate. I disagree with them that you must come to us to resolve this issue when it's their tools and platform. Please reopen a ticket and reference this bug ticket as well as https://bugs.rockylinux.org/view.php?id=200 to open a discussion with their engineering teams to find a solution. |
|
I've update the VMWare ticket and am awaiting response from that support team, regarding the support-ability dispute. Thank you, Louis. | |
I uninstalled open-vm-tools 11.3.5 and confirmed in vcenter. Installed VMWare Tools 10.3.25, confirmed in vcenter, and experienced the same mac address corruption issue. VMWare engineer updated. Perhaps, they will support VMWare Tools official. VMWare engineer also escalating as an open-vm-tools issue to VMWare global engineers (seeing if they will pickup at all). The /sys/class/net/eth[0,1]/address sysfs files show the incorrectly mapped mac addresses. This seems to be a driver issue. Hoping to get feedback. Will update here. |
|
Thank you for the update. It's starting to sound like a platform issue (esx level). Perhaps ESX is providing the wrong information to the firmware, and the kernel/driver/udev is just following what it sees. Hoping to hear what they find. One of our testers is currently moving their infra, so if vmware doesn't come back right away, he can probably try to see if the issue is reproducible in his environment with both Rocky Linux and RHEL. It would be interesting to see if it's also a problem in RHEL. |
|
That's an interesting note. I researched and tested further, finding the VWMare Firmware/BIOS is indeed the culprit, in that it enumerates Network Adapters on the PCIe bus out of logical sequence. Red Hat referred to this as "giving the kernel information that doesn't make any sense," since 2016, so it's a long-ongoing issue. Below is text from the updated email (with additional research references that led to better understanding) that I sent to the VMWare engineer. --------------------------------- To re-iterate the change in scope: I replicated the MAC Address to Interface corruption in the official VMWare Tools package. This is no longer only an open-vm-tools issue. This is a VMWare-specific problem, as noted by several vendors, and it has to due with PCI/PCIe bus enumeration problems of VMWare. I believe that this VMWare article, which pulls information from Red Hat KBs, may prove the next path for testing: https://docs.vmware.com/en/VMware-Adapter-for-SAP-Landscape-Management/services/Administration-Guide-for-LaMa-Administrators/GUID-0603A3F3-CCB1-42AF-A1E2-6B61979C00CB.html I need to know more about how the PCIe Bus enumeration and Ethernet Port enumeration in VMWare function in order to create a reasonable udev rule to address VMWare's hardware handling. Questions: Is BDF bus assignment consistent and predictable when adding n number of Network Adapters to a VMWare VM? What is the order of operations and resulting output of Bus ID assignment for n number of Network Adapters? Are Ethernet port IDs user-definable in the Advanced Configuration Parameters of VM Settings? I see that I can adjust advanced configuration port numbers for interfaces, but not sure if port number assignments would change bus ID behavior? How are port numbers derived/calculated? Would changing a port number persist a reboot, or would VMWare simply automatically reassign a port number? Root cause of the issue: Root cause of the MAC Addr corruption that takes Ethernet interfaces down appears to be that VMWare Tools passes bus IDs to the Rocky Linux kernel out of numerical order (this is a VMWare-specific issue, as noted by several OS vendors). This behavior causes the naming sequence discrepancy for naming convention of Interface, ethX, to Ethernet Device Connection name, ensY. The ethX old naming style is needed for many environments in order to comply with needs of devops and secops Result: Interface eth0 may map to Connection ens192 when Network Adapter 1 is first presented to the OS, then eth0 may map to ens161 on the next boot. This is due to ID_NET_NAME_PATH nomenclature assigned by udev, as enpMsN, enumerating out of logical sequence, as based on BUS ID enumeration from the VMWare Firmware/BIOS passing through to kernel udev/rules.d naming rules during POST. i.e. Interface eth0 will be enumerated first by the kernel and mapped to the first-presented virtual PCIe connection, which enumerates based on bus ID Generall,y Network Adapter 1 maps to enp11s0 reliably With 4 network adapters, the BUS ID doesn't matter at first presentation of the Network Adapter (hot-plugging gives Network Adapters 1-4 the appropriate eth0-3 names because they're being hot-plugged in order to the OS by VMWare) On the next boot, Network Adapter 4 gets presented to the kernel on enp4s0, connection ens161, so udev then assignes eth0 to ens161 because the lowest numbers in the database get paired together The MAC Address corruption grows in complexity with the number of VMWare Network Adapters added. If I give the VM 8 Network Adapters, the BUS IDs get injected in no predictable order, which goes back to the questions above. If I can understand how VMWare enumerates the PCIe bus and Ethernet ports, I believe I can create the udev rule to handle ethX naming at scale in el7+ (and other OS kernel) builds and satisfy dev and sec requirements. Further links leading to above VMWare KB in research of this problem: https://github.com/Azure/WALinuxAgent/issues/1750 https://github.com/coreos/bugs/issues/2437 https://github.com/systemd/systemd/pull/8458 https://unix.stackexchange.com/questions/611762/how-to-fix-the-conflict-in-the-naming-scheme-for-network-interfaces-use-by-predi https://www.linuxsysadmins.com/systemd-network-interface-name-ensx-to-eth0/ https://www.redhat.com/en/blog/red-hat-enterprise-linux-73-achieving-persistent-and-consistent-network-interface-naming-vmware-environments https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/ch-consistent_network_device_naming#sec-Naming_Schemes_Hierarchy https://communities.vmware.com/t5/vSphere-vNetwork-Discussions/how-to-fix-a-virtual-Network-Adapter-to-be-the-first-one-from/m-p/916181 https://kb.vmware.com/s/article/2047927 https://unix.stackexchange.com/questions/134483/why-is-my-ethernet-interface-called-enp0s10-instead-of-eth0 https://wiki.xenproject.org/wiki/Bus:Device.Function_(BDF)_Notation https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/ https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/sec-understanding_the_predictable_network_interface_device_names https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_networking/consistent-network-interface-device-naming_configuring-and-managing-networking https://access.redhat.com/solutions/2592561 |
|
I meant to attach some files for reference (see attached) netid_badrock.txt (1,771 bytes)
-------------------------------- UDEV udev test /sys/class/net/eth[0-3] --------------------------------- $>_ ID_NET_NAMING_SCHEME=rhel-8.0 ID_NET_NAME_MAC=enx005056ac64fa ID_OUI_FROM_DATABASE=VMware, Inc. ID_NET_NAME_PATH=enp11s0 ID_NET_NAME_SLOT=ens192 ID_NET_NAMING_SCHEME=rhel-8.0 ID_NET_NAME_MAC=enx005056acda97 ID_OUI_FROM_DATABASE=VMware, Inc. ID_NET_NAME_PATH=enp19s0 ID_NET_NAME_SLOT=ens224 ID_NET_NAMING_SCHEME=rhel-8.0 ID_NET_NAME_MAC=enx005056acd390 ID_OUI_FROM_DATABASE=VMware, Inc. ID_NET_NAME_PATH=enp27s0 ID_NET_NAME_SLOT=ens256 ID_NET_NAMING_SCHEME=rhel-8.0 ID_NET_NAME_MAC=enx005056ac98bd ID_OUI_FROM_DATABASE=VMware, Inc. ID_NET_NAME_PATH=enp4s0 ID_NET_NAME_SLOT=ens161 ------------------- after reboot ---------------- ID_NET_NAMING_SCHEME=rhel-8.0 ID_NET_NAME_MAC=enx005056ac98bd ID_OUI_FROM_DATABASE=VMware, Inc. ID_NET_NAME_PATH=enp4s0 ID_NET_NAME_SLOT=ens161 ID_NET_NAMING_SCHEME=rhel-8.0 ID_NET_NAME_MAC=enx005056ac64fa ID_OUI_FROM_DATABASE=VMware, Inc. ID_NET_NAME_PATH=enp11s0 ID_NET_NAME_SLOT=ens192 ID_NET_NAMING_SCHEME=rhel-8.0 ID_NET_NAME_MAC=enx005056acda97 ID_OUI_FROM_DATABASE=VMware, Inc. ID_NET_NAME_PATH=enp19s0 ID_NET_NAME_SLOT=ens224 ID_NET_NAMING_SCHEME=rhel-8.0 ID_NET_NAME_MAC=enx005056acd390 ID_OUI_FROM_DATABASE=VMware, Inc. ID_NET_NAME_PATH=enp27s0 ID_NET_NAME_SLOT=ens256 ---------------------------- lspci for Ehternet adapters --------------------------- 04:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01) 0b:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01) 13:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01) 1b:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01) udevtest_badrock_afterreboot.txt (1,091 bytes)
This program is for debugging only, it does not run any program specified by a RUN key. It may show incorrect results, because some values may be different, or not available at a simulation run. ACTION=add DEVPATH=/devices/pci0000:00/0000:00:15.1/0000:04:00.0/net/eth0 ID_BUS=pci ID_MODEL_FROM_DATABASE=VMXNET3 Ethernet Controller ID_MODEL_ID=0x07b0 ID_NET_DRIVER=vmxnet3 ID_NET_LINK_FILE=/usr/lib/systemd/network/99-default.link ID_NET_NAME_MAC=enx005056acf7d9 ID_NET_NAME_PATH=enp4s0 ID_NET_NAME_SLOT=ens161 ID_NET_NAMING_SCHEME=rhel-8.0 ID_OUI_FROM_DATABASE=VMware, Inc. ID_PATH=pci-0000:04:00.0 ID_PATH_TAG=pci-0000_04_00_0 ID_PCI_CLASS_FROM_DATABASE=Network controller ID_PCI_SUBCLASS_FROM_DATABASE=Ethernet controller ID_VENDOR_FROM_DATABASE=VMware ID_VENDOR_ID=0x15ad IFINDEX=2 INTERFACE=eth0 SUBSYSTEM=net SYSTEMD_ALIAS=/sys/subsystem/net/devices/eth0 TAGS=:systemd: UDEV_BIOSDEVNAME=0 USEC_INITIALIZED=2574611 biosdevname=0 run: '/usr/lib/systemd/systemd-sysctl --prefix=/net/ipv4/conf/eth0 --prefix=/net/ipv4/neigh/eth0 --prefix=/net/ipv6/conf/eth0 --prefix=/net/ipv6/neigh/eth0' udevtest_badrock.txt (1,096 bytes)
This program is for debugging only, it does not run any program specified by a RUN key. It may show incorrect results, because some values may be different, or not available at a simulation run. ACTION=add DEVPATH=/devices/pci0000:00/0000:00:16.0/0000:0b:00.0/net/eth0 ID_BUS=pci ID_MODEL_FROM_DATABASE=VMXNET3 Ethernet Controller ID_MODEL_ID=0x07b0 ID_NET_DRIVER=vmxnet3 ID_NET_LINK_FILE=/usr/lib/systemd/network/99-default.link ID_NET_NAME_MAC=enx005056ac64fa ID_NET_NAME_PATH=enp11s0 ID_NET_NAME_SLOT=ens192 ID_NET_NAMING_SCHEME=rhel-8.0 ID_OUI_FROM_DATABASE=VMware, Inc. ID_PATH=pci-0000:0b:00.0 ID_PATH_TAG=pci-0000_0b_00_0 ID_PCI_CLASS_FROM_DATABASE=Network controller ID_PCI_SUBCLASS_FROM_DATABASE=Ethernet controller ID_VENDOR_FROM_DATABASE=VMware ID_VENDOR_ID=0x15ad IFINDEX=10 INTERFACE=eth0 SUBSYSTEM=net SYSTEMD_ALIAS=/sys/subsystem/net/devices/eth0 TAGS=:systemd: UDEV_BIOSDEVNAME=0 USEC_INITIALIZED=1279538639 biosdevname=0 run: '/usr/lib/systemd/systemd-sysctl --prefix=/net/ipv4/conf/eth0 --prefix=/net/ipv4/neigh/eth0 --prefix=/net/ipv6/conf/eth0 --prefix=/net/ipv6/neigh/eth0' udevtest_r8default.txt (1,090 bytes)
This program is for debugging only, it does not run any program specified by a RUN key. It may show incorrect results, because some values may be different, or not available at a simulation run. ACTION=add DEVPATH=/devices/pci0000:00/0000:00:15.1/0000:04:00.0/net/ens161 ID_BUS=pci ID_MM_CANDIDATE=1 ID_MODEL_FROM_DATABASE=VMXNET3 Ethernet Controller ID_MODEL_ID=0x07b0 ID_NET_DRIVER=vmxnet3 ID_NET_LINK_FILE=/usr/lib/systemd/network/99-default.link ID_NET_NAME_MAC=enx005056ac5de9 ID_NET_NAME_PATH=enp4s0 ID_NET_NAME_SLOT=ens161 ID_NET_NAMING_SCHEME=rhel-8.0 ID_OUI_FROM_DATABASE=VMware, Inc. ID_PATH=pci-0000:04:00.0 ID_PATH_TAG=pci-0000_04_00_0 ID_PCI_CLASS_FROM_DATABASE=Network controller ID_PCI_SUBCLASS_FROM_DATABASE=Ethernet controller ID_VENDOR_FROM_DATABASE=VMware ID_VENDOR_ID=0x15ad IFINDEX=2 INTERFACE=ens161 SUBSYSTEM=net SYSTEMD_ALIAS=/sys/subsystem/net/devices/ens161 TAGS=:systemd: USEC_INITIALIZED=4634244 run: '/usr/lib/systemd/systemd-sysctl --prefix=/net/ipv4/conf/ens161 --prefix=/net/ipv4/neigh/ens161 --prefix=/net/ipv6/conf/ens161 --prefix=/net/ipv6/neigh/ens161' |
|
Keeping this string updated with research as discovered: Based on the VMWare article and the Kemp article below, the community has understood the VMWare PCIe bus enumeration and configured a .vmx file that attaches max Ethernet devices and hard-codes their sequence to a logical order for the OS in user-space. This seems like a logical way to scale a template-able fix (which is my primary concern, as addressing this issue at scale any time someone wants more the 3 adapters would be inefficient). Any thoughts? (I've ported this comment over to the VMWare engineer assigned. Want to make sure this passes VMWare's eyes and get the ok, regarding any pitfalls or obvious errors.) Kemp article: https://support.kemptechnologies.com/hc/en-us/articles/201978745-When-adding-4-or-more-VMXNET3-NICs-to-a-VLM-in-VMware-the-order-is-incorrect VMWare article: https://kb.vmware.com/s/article/2047927 |
|
I deployed this config in a .vmx for a test vm, worked great # 001.00101.00000 # PCI Slot=4 # Parent slot 21: # Bus:Dev.Func = 00h:15h.01h # Guest OS Order eth0 ethernet0.pciSlotNumber = "1184" # 010.00101.00000 # PCI Slot=4 # Parent slot 21: # Bus:Dev.Func = 00h:15h.02h # Guest OS Order eth1 ethernet1.pciSlotNumber = "2208" # 011.00101.00000 # PCI Slot=4 # Parent slot 21: # Bus:Dev.Func = 00h:15h.03h # Guest OS Order eth2 ethernet2.pciSlotNumber = "3232" # 001.00110.00000 # PCI Slot=5 # Parent slot 22: # Bus:Dev.Func = 00h:16h.01h # Guest OS Order eth3 ethernet3.pciSlotNumber = "1216" # 010.00110.00000 # PCI Slot=5 # Parent slot 22: # Bus:Dev.Func = 00h:16h.02h # Guest OS Order eth4 ethernet4.pciSlotNumber = "2240" # 011.00110.00000 # PCI Slot=5 # Parent slot 22: # Bus:Dev.Func = 00h:16h.03h # Guest OS Order eth5 ethernet5.pciSlotNumber = "3264" # 001.00111.00000 # PCI Slot=6 # Parent slot 23: # Bus:Dev.Func = 00h:17h.01h # Guest OS Order eth6 ethernet6.pciSlotNumber = "1248" # 010.00111.00000 # PCI Slot=6 # Parent slot 23: # Bus:Dev.Func = 00h:17h.02h # Guest OS Order eth7 ethernet7.pciSlotNumber = "2272" # 001.01000.00000 # PCI Slot=7 # Parent slot 24: # Bus:Dev.Func = 00h:18h.01h # Guest OS Order eth8 ethernet8.pciSlotNumber = "1280" # 010.01000.00000 # PCI Slot=7 # Parent slot 24: # Bus:Dev.Func = 00h:18h.02h # Guest OS Order eth9 ethernet9.pciSlotNumber = "2304" I also found that these .vmx lines need to be removed (they'll be automatically repopulated at next power state change of the vm) ethernetN.dvs.portId = ... ethernetN.generatedAddress = ... ethernetN.addressType = ... ethernetN.generateAddressOffset = ... ------------------------------------------------------------------------------------------- If configuring a VM with the GUI, Need to Edit Settings > VM Optios (tab at the top) > Advanced > Edit Configuration > Add Configuration Params (particularly useful for new VM configuration) Then, add ethernetN.pciSlotNumber and the slot ID noted from the lines above for the n-th interface (where N is the n-th interface in ethernetN) ------------------------------------------------------------------------------------------- VMWare Engineer is also looking into opening a PR for VMWare-Tools and Rocky 8.6. (I would want to guess that this is a way to channel this issue into the right VMWare support workflow vs. VMWare assuming this an OS problem....the long years of history of this issue at the global level leaves interpretation up in the air, but I'm brave enough to feel hopeful lol) |
|
Date Modified | Username | Field | Change |
---|---|---|---|
2022-11-01 16:01 | Carl Gromatzky | New Issue | |
2022-11-01 16:08 | Louis Abel | Assigned To | => Louis Abel |
2022-11-01 16:08 | Louis Abel | Status | new => needinfo |
2022-11-01 16:08 | Louis Abel | Note Added: 0000802 | |
2022-11-01 16:15 | Carl Gromatzky | Note Added: 0000803 | |
2022-11-02 23:44 | Carl Gromatzky | Note Added: 0000826 | |
2022-11-02 23:49 | Louis Abel | Note Added: 0000827 | |
2022-11-04 18:48 | Carl Gromatzky | Note Added: 0000862 | |
2022-11-04 21:29 | Carl Gromatzky | Note Added: 0000863 | |
2022-11-04 21:29 | Carl Gromatzky | File Added: netid_badrock.txt | |
2022-11-04 21:29 | Carl Gromatzky | File Added: udevtest_badrock_afterreboot.txt | |
2022-11-04 21:29 | Carl Gromatzky | File Added: udevtest_badrock.txt | |
2022-11-04 21:29 | Carl Gromatzky | File Added: udevtest_r8default.txt | |
2022-11-05 19:30 | Carl Gromatzky | Note Added: 0000864 | |
2022-11-07 22:12 | Carl Gromatzky | Note Added: 0000866 |