Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-48823

machine-config-daemon pods in CrashLoopBackOff in RHOCP4

    • Critical
    • None
    • False
    • Hide

      None

      Show
      None
    • Customer Escalated

      Description of problem:

      Cluster upgrade from 4.14.30 to 4.15.41 is stuck at 97% with error "Cluster operator machine-config is not available"
      
      machine-config-daemon pods are in CrashLoopBackOff state with exitcode 137 on worker node as well as master node.
      
      All 3 worker nodes have high memory consumption.
      
      Restarting pods and rebooting the nodes didn't help.

      Version-Release number of selected component (if applicable):

      4.14

      How reproducible:

          

      Actual results:

      Upgrade is stucl, machine-config-daemon pods running on worker nodes are stuck in CrashLoopBackOff state.

      Expected results:

      All machine-config-daemon pods must be healthy and cluster must be upgraded successfully.

      Additional info:

      Upgrade Stuck :
      ~~~
      $oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.14.30   True        True          4h55m   Unable to apply 4.15.41: the cluster operator machine-config is not available
      ~~~
      
      machine-config CO degraded :
      ~~~
      NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
      machine-config                             4.14.30   False       True          True       12d
      ~~~
      
      Pods are stuck with exitcode 137 :
      ~~~
      NAME                                                              READY   STATUS             RESTARTS   AGE    IP             NODE                                         NOMINATED NODE   READINESS GATES
      machine-config-daemon-fpsx7                                       1/2     CrashLoopBackOff   3127       11d    172.36.5.148   ip-172-36-5-148.eu-west-1.compute.internal   <none>           <none>
      machine-config-daemon-hrtzv                                       1/2     CrashLoopBackOff   3127       11d    172.36.6.236   ip-172-36-6-236.eu-west-1.compute.internal   <none>           <none>
      machine-config-daemon-llwkf                                       1/2     CrashLoopBackOff   2548       9d     172.36.6.80    ip-172-36-6-80.eu-west-1.compute.internal    <none>           <none>
      ~~~
      
      machine-config-controller-87fc68c7d-xxv2w -c machine-config-controller logs :
      ~~~
      2025-01-19T05:46:56.923950707Z W0119 05:46:56.923763       1 reflector.go:456] github.com/openshift/client-go/config/informers/externalversions/factory.go:101: watch of *v1.Image ended with
      : an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
      2025-01-19T05:46:56.923965564Z W0119 05:46:56.923770       1 reflector.go:456] k8s.io/client-go/informers/factory.go:150: watch of *v1.Node ended with: an error on the server ("unable to de
      code an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
      2025-01-19T05:46:56.923970055Z W0119 05:46:56.923777       1 reflector.go:456] github.com/openshift/client-go/config/informers/externalversions/factory.go:101: watch of *v1.APIServer ended
      with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
      2025-01-19T05:46:56.923988142Z W0119 05:46:56.923791       1 reflector.go:456] github.com/openshift/client-go/config/informers/externalversions/factory.go:101: watch of *v1.FeatureGate ende
      d with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
      2025-01-19T05:46:56.923988142Z W0119 05:46:56.923793       1 reflector.go:456] github.com/openshift/client-go/operator/informers/externalversions/factory.go:101: watch of *v1alpha1.ImageCon
      tentSourcePolicy ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
      
      2025-01-19T05:47:59.645252341Z I0119 05:47:59.645211       1 container_runtime_config_controller.go:417] Error syncing image config openshift-config: could not Create/Update MachineConfig:
      could not find MachineConfig: the server was unable to return a response in the time allotted, but may still be processing the request (get machineconfigs.machineconfiguration.openshift.io
      99-worker-generated-registries)
      2025-01-19T05:48:14.822984766Z I0119 05:48:14.822927       1 template_controller.go:418] Error syncing controllerconfig machine-config-controller: failed to sync status for Timeout: request
       did not complete within requested timeout - context deadline exceeded
      2025-01-19T05:48:18.529294086Z I0119 05:48:18.529246       1 render_controller.go:377] Error syncing machineconfigpool master: Timeout: request did not complete within requested timeout - c
      ontext deadline exceeded
      2025-01-19T05:48:25.946829953Z I0119 05:48:25.946787       1 render_controller.go:377] Error syncing machineconfigpool master: Operation cannot be fulfilled on machineconfigpools.machinecon
      figuration.openshift.io "master": the object has been modified; please apply your changes to the latest version and try again
      2025-01-19T05:48:33.954412212Z I0119 05:48:33.954352       1 render_controller.go:377] Error syncing machineconfigpool master: ControllerConfig has not completed: completed(false) running(f
      alse) failing(true)
      
      2025-01-20T11:43:45.694798471Z I0120 11:43:45.694732       1 template_controller.go:134] Re-syncing ControllerConfig due to secret pull-secret change
      2025-01-20T12:19:41.837904340Z I0120 12:19:41.837844       1 template_controller.go:134] Re-syncing ControllerConfig due to secret pull-secret change
      ~~~
      
      Observed high resource utilization on all 3 worker nodes :
      ~~~
      Node : ip-172-36-5-148.eu-west-1.compute.internal
      MEMORY
        Stats graphed as percent of MemTotal:
          MemUsed    ▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊..  96.8%
      
      Node : ip-172-36-6-236.eu-west-1.compute.internal
      LoadAvg:   [4 CPU] 11.51 (288%), 4.58 (114%), 2.45 (61%)
      MEMORY
        Stats graphed as percent of MemTotal:
          MemUsed    ▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊.  98.5%
      
      Node : ip-172-36-6-80.eu-west-1.compute.internal
      MEMORY
        Stats graphed as percent of MemTotal:
          MemUsed    ▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊.  98.8%
      ~~~

              rphillip@redhat.com Ryan Phillips
              rhn-support-sdharma Suruchi Dharma
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: