-
Bug
-
Resolution: Not a Bug
-
Major
-
None
-
4.12.z
-
Critical
-
No
-
False
-
Description of problem:
If for some reason a machineconfig containerruntime is removed, and the machine-config-controller is recycled, the cluster recreates the machine config containerruntime and consequently the mc starts rendering the nodes, eventually causing nodes reboot.
Version-Release number of selected component (if applicable):
tested on 4.12
How reproducible:
See the steps to reproduce. It happened on vmware customer's environment (IPI), and in resource hub (UPI).
Steps to Reproduce:
1. Create a containerconfigruntime: ~~~ apiVersion: machineconfiguration.openshift.io/v1 kind: ContainerRuntimeConfig metadata: name: some-containerruntime-config spec: machineConfigPoolSelector: matchLabels: pools.operator.machineconfiguration.openshift.io/worker: '' containerRuntimeConfig: overlaySize: "0" pidsLimit: 4096 ~~~ 2. Remove the mc containerruntime linked to the containerconfigruntime ~~~ $ oc delete mc 99-worker-generated-containerruntime ~~~ 3. After removing mc, controller doesn't recreate it, if a long time period, like days, weeks, months. 4. Recycle controller pod: ~~~ $ oc delete pod machine-config-controller-6d94bf9cb8-g6n6s ~~~ 5. mc is recreated, render is recreated and node(s) is/are rebooted. ~~~ $ oc get mc [...] 99-worker-generated-containerruntime 01f999aa710dd62d43bee8cf1e2ca6a226c7dce3 3.2.0 1s <<==== [...] $ oc get nodes NAME STATUS ROLES AGE VERSION worker-0.ocp4.lab.psi.pnq2.redhat.com Ready,SchedulingDisabled worker 18h v1.25.11+1485cc9 ~~~
Actual results:
mc is recreated after controller pod is recycled
Expected results:
Alerts should be fired warning about missed mc
Additional info:
These events happened on a huge production cluster. The nodes started rebooting apparently from out of the blue, causing serious application outages. The related mc's were missing for months and customer only realize when the incident took place. Suggestion: After applying the ctrcfg (containerruntimeconfig), the cluster could check the finalizers and then if the mc's are present in the cluster. ~~~ oc get ctrcfg -o yaml apiVersion: v1 items: - apiVersion: machineconfiguration.openshift.io/v1 kind: ContainerRuntimeConfig metadata: annotations: machineconfiguration.openshift.io/mc-name-suffix: "" creationTimestamp: "2024-03-06T12:42:16Z" finalizers: - 99-worker-generated-containerruntime <<<==== generation: 2 ~~~