Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Major
Fix Version/s: None
Affects Version/s: 4.12.z
Component/s: Node / CRI-O
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Priority Data:
PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

    If for some reason a machineconfig containerruntime is removed, and the machine-config-controller is recycled, the cluster recreates the machine config containerruntime and consequently the mc starts rendering the nodes, eventually causing nodes reboot.

Version-Release number of selected component (if applicable):

    tested on 4.12

How reproducible:

    See the steps to reproduce. It happened on vmware customer's environment (IPI), and in resource hub (UPI).

Steps to Reproduce:

    1. Create a containerconfigruntime:
~~~
apiVersion: machineconfiguration.openshift.io/v1
kind: ContainerRuntimeConfig
metadata:
 name: some-containerruntime-config
spec:
 machineConfigPoolSelector:
   matchLabels:
     pools.operator.machineconfiguration.openshift.io/worker: '' 
 containerRuntimeConfig: 
   overlaySize: "0"
   pidsLimit: 4096
~~~

2. Remove the mc containerruntime linked to the containerconfigruntime
~~~
$ oc delete mc 99-worker-generated-containerruntime
~~~

3. After removing mc, controller doesn't recreate it, if a long time period, like days, weeks, months.

4. Recycle controller pod:
~~~
$ oc delete pod machine-config-controller-6d94bf9cb8-g6n6s  
~~~

5. mc is recreated, render is recreated and node(s) is/are rebooted.
~~~
$ oc get mc
[...]
99-worker-generated-containerruntime                    01f999aa710dd62d43bee8cf1e2ca6a226c7dce3   3.2.0             1s <<====
[...]

$ oc get nodes 
NAME                                         STATUS                     ROLES    AGE   VERSION

worker-0.ocp4.lab.psi.pnq2.redhat.com   Ready,SchedulingDisabled   worker   18h   v1.25.11+1485cc9
~~~

Actual results:

   mc is recreated after controller pod is recycled

Expected results:

  Alerts should be fired warning about missed mc

Additional info:

   These events happened on a huge production cluster. The nodes started rebooting apparently from out of the blue, causing serious application outages. The related mc's were missing for months and customer only realize when the incident took place.

Suggestion: After applying the ctrcfg (containerruntimeconfig), the cluster could check the finalizers and then if the mc's are present in the cluster.
~~~
oc get ctrcfg -o yaml
apiVersion: v1
items:
- apiVersion: machineconfiguration.openshift.io/v1
  kind: ContainerRuntimeConfig
  metadata:
    annotations:
      machineconfiguration.openshift.io/mc-name-suffix: ""
    creationTimestamp: "2024-03-06T12:42:16Z"
    finalizers:
    - 99-worker-generated-containerruntime  <<<====
    generation: 2

~~~

Assignee:: Qi Wang

Reporter:: Clesio Takahashi (Inactive)

Need Info From:: None

Contributors:: None

QA Contact:: Sergio Regidor de la Rosa

Doc Contact:: None

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/03/11 1:59 PM

Updated:: 2025/09/12 9:34 PM

Resolved:: 2024/07/23 7:14 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide