Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-30747

Machineconfig containerruntime is created if it is removed and machine-config-controller is restarted

XMLWordPrintable

    • Critical
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

          If for some reason a machineconfig containerruntime is removed, and the machine-config-controller is recycled, the cluster recreates the machine config containerruntime and consequently the mc starts rendering the nodes, eventually causing nodes reboot.

      Version-Release number of selected component (if applicable):

          tested on 4.12

      How reproducible:

          See the steps to reproduce. It happened on vmware customer's environment (IPI), and in resource hub (UPI).

      Steps to Reproduce:

          1. Create a containerconfigruntime:
      ~~~
      apiVersion: machineconfiguration.openshift.io/v1
      kind: ContainerRuntimeConfig
      metadata:
       name: some-containerruntime-config
      spec:
       machineConfigPoolSelector:
         matchLabels:
           pools.operator.machineconfiguration.openshift.io/worker: '' 
       containerRuntimeConfig: 
         overlaySize: "0"
         pidsLimit: 4096
      ~~~
      
      2. Remove the mc containerruntime linked to the containerconfigruntime
      ~~~
      $ oc delete mc 99-worker-generated-containerruntime
      ~~~
      
      3. After removing mc, controller doesn't recreate it, if a long time period, like days, weeks, months.
      
      4. Recycle controller pod:
      ~~~
      $ oc delete pod machine-config-controller-6d94bf9cb8-g6n6s  
      ~~~
      
      5. mc is recreated, render is recreated and node(s) is/are rebooted.
      ~~~
      $ oc get mc
      [...]
      99-worker-generated-containerruntime                    01f999aa710dd62d43bee8cf1e2ca6a226c7dce3   3.2.0             1s <<====
      [...]
      
      $ oc get nodes 
      NAME                                         STATUS                     ROLES    AGE   VERSION
      
      worker-0.ocp4.lab.psi.pnq2.redhat.com   Ready,SchedulingDisabled   worker   18h   v1.25.11+1485cc9
      ~~~
         

      Actual results:

         mc is recreated after controller pod is recycled

      Expected results:

        Alerts should be fired warning about missed mc

      Additional info:

         These events happened on a huge production cluster. The nodes started rebooting apparently from out of the blue, causing serious application outages. The related mc's were missing for months and customer only realize when the incident took place.
      
      Suggestion: After applying the ctrcfg (containerruntimeconfig), the cluster could check the finalizers and then if the mc's are present in the cluster.
      ~~~
      oc get ctrcfg -o yaml
      apiVersion: v1
      items:
      - apiVersion: machineconfiguration.openshift.io/v1
        kind: ContainerRuntimeConfig
        metadata:
          annotations:
            machineconfiguration.openshift.io/mc-name-suffix: ""
          creationTimestamp: "2024-03-06T12:42:16Z"
          finalizers:
          - 99-worker-generated-containerruntime  <<<====
          generation: 2
      
      ~~~

       

            qiwan233 Qi Wang
            rhn-support-ctakahashi Clesio Takahashi
            Sergio Regidor de la Rosa Sergio Regidor de la Rosa
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: