[OCPBUGS-7887] [4.12] crio.service should use a more safe restart policy to provide recoverability against concurrency issues - Red Hat Issue Tracker

Type: Bug
Resolution: Done
Priority: Minor
Fix Version/s: 4.12.z
Affects Version/s: 4.13
Component/s: Node / CRI-O
Labels:
- triaged

Severity:
Moderate
Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.12.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

the crio.service default unit is configured with

Restart=on-abnormal
TimeoutStartSec=0

This configuration can bring the crio.service (and dependants) to permanently fail at boot after the first attempt to start.

For example, a possible concurrency issue between the network configuration scripts in OVN Baremetal IPI clusters can lead to the network-online.target reached while the network is still being configured. As the network-online.target is reached, crio and kubelet try to start, and crio can fail with

"Failed to start streaming server: listen tcp 192.168.90.42:10010: bind: cannot assign requested address"

Eventually, the network configuration converges, and the services can be manually started. This example happened in some upgrade CI runs in QE and could lead to nodes of a cluster being 'paused' and not ready until a system administrator manually restarted them.

As a followup to the discussion in slack, and although the network issue deserves other investigations, we'd propose changing the restart policy of the crio.service unit to guarantee more safety and recoverability against such "byzantine" faults like order violation in the systemd dependency graph.

How reproducible:

Not always

Steps to Reproduce:

The following steps refer to the example above:

1. Install a 4.11 ipv4-only BM IPI cluster with OVN and provisioning network (4.11.0-0.nightly-arm64-2022-11-27-133933) 
2. Upgrade to  4.12.0-0.nightly-arm64-2022-11-29-032225
3. After the rebase of a worker and its reboot, it doesn't reconcile, leaving the MCP degraded.
(Manual recovery)
4. Login to the worker and verify the network is up and crio is a failed unit.
5. Start crio manually (you might need to start kubelet as well)
6. The worker node object reconciles and the node is ready

Additional info:

- The issue could be related to a temporary network failure.
- kubelet is already set with the `always` restart policy

Slack conversation: https://coreos.slack.com/archives/CK1AE4ZCK/p1669727807409689

clones

OCPBUGS-4266 crio.service should use a more safe restart policy to provide recoverability against concurrency issues

Closed

Assignee:: Peter Hunt

Reporter:: Alessandro Di Stefano

QA Contact:: Sunil Choudhary

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2023/02/22 3:22 PM

Updated:: 2023/11/08 11:08 AM

Resolved:: 2023/05/23 1:26 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide