Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.8
Component/s: Multi-Arch
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None

Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Description of problem:
After a successful OCP 4.8.32 cluster installation, we perform Day 2 operations that modify the current network configuration. Some required services are not available on the master nodes when it is IPL’d for the first time. The network changes have the openvswitch enabled and the IP configuration is moved from the physical device to openvswitch. Specifically, the issues occur during bootstrapping, the openvswitch (daemon and database), and some NetworkManager virtual mount fail to start due to the /sysroot bind mounts not reaching Read-Write mode fast enough. The etcd sync writes on the master nodes are also interfering with each other and encounter “disk is too slow” warnings.

This OCP cluster resides under zVM, with a disk configuration where multiple OCP node minidisks are on a shared EDEV. This potentially leads to higher latency to the disk, however this should not mean all the required services (openvswitch and network manager) mentioned above need to be available within a certain timing window.

If you continue to reboot the nodes enough times, there will eventually be one with a fast enough interval on the backing EDEV that will allow it to make it past the timing window.

Version-Release number of selected component (if applicable):
1. OCP 4.8.32
2. RHCOS 4.8.14

How reproducible:
Consistently reproducible.

Steps to Reproduce:
1. Start with a working OCP 4.8.32 with RHCOS 4.8.14 on a physical NIC.
2. Perform Day 2 operations to move to an openvswitch.
3. Reboot the nodes.
4. The services openvswitch daemon, db, and some networkmanager virtual mount fail to start.

For example, this would be the output from the services that fail at startup:
14:06:50 Starting Open vSwitch Database Unit...
14:06:50 ÝÝ0;1;31mFAILEDÝ0m¨ Failed to start Open vSwitch Database Unit.
14:06:50 See 'systemctl status ovsdb-server.service' for details.
14:06:50 ÝÝ0;1;33mASSERTÝ0m¨ Assertion failed for Open vSwitch Delete Transient
14:06:50 ÝÝ0;32m OK Ý0m¨ Stopped Open vSwitch Database Unit.
14:06:50 ÝÝ0;1;31mFAILEDÝ0m¨ Failed to start Open vSwitch Database Unit.
14:06:50 See 'systemctl status ovsdb-server.service' for details.
14:06:50 ÝÝ0;1;33mASSERTÝ0m¨ Assertion failed for Open vSwitch Delete Transient

Actual results:
Rebooting the OCP nodes until one is fast enough for the services to start.

Expected results:
The disk latency should not interfere with required services being available to complete the installation.

Additional info:
Attached is the journalctl log from one of the failing worker nodes. Timestamp 2:06 local time will show the services that fail.

external trackers

Red Hat Issue Tracker MULTIARCH-2420

Assignee:: Jeremy Poulin

Reporter:: Philip Chan (Inactive)

Contributors:: None

Architect:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2022/03/22 9:31 PM

Updated:: 2025/07/29 5:47 PM

Resolved:: 2022/03/31 7:18 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates