-
Bug
-
Resolution: Done
-
Minor
-
4.13
-
Moderate
-
None
-
False
-
Description of problem:
the crio.service default unit is configured with Restart=on-abnormal TimeoutStartSec=0 This configuration can bring the crio.service (and dependants) to permanently fail at boot after the first attempt to start. For example, a possible concurrency issue between the network configuration scripts in OVN Baremetal IPI clusters can lead to the network-online.target reached while the network is still being configured. As the network-online.target is reached, crio and kubelet try to start, and crio can fail with "Failed to start streaming server: listen tcp 192.168.90.42:10010: bind: cannot assign requested address" Eventually, the network configuration converges, and the services can be manually started. This example happened in some upgrade CI runs in QE and could lead to nodes of a cluster being 'paused' and not ready until a system administrator manually restarted them. As a followup to the discussion in slack, and although the network issue deserves other investigations, we'd propose changing the restart policy of the crio.service unit to guarantee more safety and recoverability against such "byzantine" faults like order violation in the systemd dependency graph.
How reproducible:
Not always
Steps to Reproduce:
The following steps refer to the example above: 1. Install a 4.11 ipv4-only BM IPI cluster with OVN and provisioning network (4.11.0-0.nightly-arm64-2022-11-27-133933) 2. Upgrade to 4.12.0-0.nightly-arm64-2022-11-29-032225 3. After the rebase of a worker and its reboot, it doesn't reconcile, leaving the MCP degraded. (Manual recovery) 4. Login to the worker and verify the network is up and crio is a failed unit. 5. Start crio manually (you might need to start kubelet as well) 6. The worker node object reconciles and the node is ready
Additional info:
- The issue could be related to a temporary network failure. - kubelet is already set with the `always` restart policy Slack conversation: https://coreos.slack.com/archives/CK1AE4ZCK/p1669727807409689
- clones
-
OCPBUGS-4266 crio.service should use a more safe restart policy to provide recoverability against concurrency issues
-
- Closed
-