-
Bug
-
Resolution: Done
-
Normal
-
None
-
4.8
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
After a successful OCP 4.8.32 cluster installation, we perform Day 2 operations that modify the current network configuration. Some required services are not available on the master nodes when it is IPL’d for the first time. The network changes have the openvswitch enabled and the IP configuration is moved from the physical device to openvswitch. Specifically, the issues occur during bootstrapping, the openvswitch (daemon and database), and some NetworkManager virtual mount fail to start due to the /sysroot bind mounts not reaching Read-Write mode fast enough. The etcd sync writes on the master nodes are also interfering with each other and encounter “disk is too slow” warnings.
This OCP cluster resides under zVM, with a disk configuration where multiple OCP node minidisks are on a shared EDEV. This potentially leads to higher latency to the disk, however this should not mean all the required services (openvswitch and network manager) mentioned above need to be available within a certain timing window.
If you continue to reboot the nodes enough times, there will eventually be one with a fast enough interval on the backing EDEV that will allow it to make it past the timing window.
Version-Release number of selected component (if applicable):
1. OCP 4.8.32
2. RHCOS 4.8.14
How reproducible:
Consistently reproducible.
Steps to Reproduce:
1. Start with a working OCP 4.8.32 with RHCOS 4.8.14 on a physical NIC.
2. Perform Day 2 operations to move to an openvswitch.
3. Reboot the nodes.
4. The services openvswitch daemon, db, and some networkmanager virtual mount fail to start.
For example, this would be the output from the services that fail at startup:
14:06:50 Starting Open vSwitch Database Unit...
14:06:50 ÝÝ0;1;31mFAILEDÝ0m¨ Failed to start Open vSwitch Database Unit.
14:06:50 See 'systemctl status ovsdb-server.service' for details.
14:06:50 ÝÝ0;1;33mASSERTÝ0m¨ Assertion failed for Open vSwitch Delete Transient
14:06:50 ÝÝ0;32m OK Ý0m¨ Stopped Open vSwitch Database Unit.
14:06:50 ÝÝ0;1;31mFAILEDÝ0m¨ Failed to start Open vSwitch Database Unit.
14:06:50 See 'systemctl status ovsdb-server.service' for details.
14:06:50 ÝÝ0;1;33mASSERTÝ0m¨ Assertion failed for Open vSwitch Delete Transient
Actual results:
Rebooting the OCP nodes until one is fast enough for the services to start.
Expected results:
The disk latency should not interfere with required services being available to complete the installation.
Additional info:
Attached is the journalctl log from one of the failing worker nodes. Timestamp 2:06 local time will show the services that fail.
- external trackers