-
Bug
-
Resolution: Not a Bug
-
Critical
-
None
-
4.14.z
-
Critical
-
None
-
False
-
-
Customer Escalated
Description of problem:
Cluster upgrade from 4.14.30 to 4.15.41 is stuck at 97% with error "Cluster operator machine-config is not available" machine-config-daemon pods are in CrashLoopBackOff state with exitcode 137 on worker node as well as master node. All 3 worker nodes have high memory consumption. Restarting pods and rebooting the nodes didn't help.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Actual results:
Upgrade is stucl, machine-config-daemon pods running on worker nodes are stuck in CrashLoopBackOff state.
Expected results:
All machine-config-daemon pods must be healthy and cluster must be upgraded successfully.
Additional info:
Upgrade Stuck : ~~~ $oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.30 True True 4h55m Unable to apply 4.15.41: the cluster operator machine-config is not available ~~~ machine-config CO degraded : ~~~ NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE machine-config 4.14.30 False True True 12d ~~~ Pods are stuck with exitcode 137 : ~~~ NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES machine-config-daemon-fpsx7 1/2 CrashLoopBackOff 3127 11d 172.36.5.148 ip-172-36-5-148.eu-west-1.compute.internal <none> <none> machine-config-daemon-hrtzv 1/2 CrashLoopBackOff 3127 11d 172.36.6.236 ip-172-36-6-236.eu-west-1.compute.internal <none> <none> machine-config-daemon-llwkf 1/2 CrashLoopBackOff 2548 9d 172.36.6.80 ip-172-36-6-80.eu-west-1.compute.internal <none> <none> ~~~ machine-config-controller-87fc68c7d-xxv2w -c machine-config-controller logs : ~~~ 2025-01-19T05:46:56.923950707Z W0119 05:46:56.923763 1 reflector.go:456] github.com/openshift/client-go/config/informers/externalversions/factory.go:101: watch of *v1.Image ended with : an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding 2025-01-19T05:46:56.923965564Z W0119 05:46:56.923770 1 reflector.go:456] k8s.io/client-go/informers/factory.go:150: watch of *v1.Node ended with: an error on the server ("unable to de code an event from the watch stream: http2: client connection lost") has prevented the request from succeeding 2025-01-19T05:46:56.923970055Z W0119 05:46:56.923777 1 reflector.go:456] github.com/openshift/client-go/config/informers/externalversions/factory.go:101: watch of *v1.APIServer ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding 2025-01-19T05:46:56.923988142Z W0119 05:46:56.923791 1 reflector.go:456] github.com/openshift/client-go/config/informers/externalversions/factory.go:101: watch of *v1.FeatureGate ende d with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding 2025-01-19T05:46:56.923988142Z W0119 05:46:56.923793 1 reflector.go:456] github.com/openshift/client-go/operator/informers/externalversions/factory.go:101: watch of *v1alpha1.ImageCon tentSourcePolicy ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding 2025-01-19T05:47:59.645252341Z I0119 05:47:59.645211 1 container_runtime_config_controller.go:417] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not find MachineConfig: the server was unable to return a response in the time allotted, but may still be processing the request (get machineconfigs.machineconfiguration.openshift.io 99-worker-generated-registries) 2025-01-19T05:48:14.822984766Z I0119 05:48:14.822927 1 template_controller.go:418] Error syncing controllerconfig machine-config-controller: failed to sync status for Timeout: request did not complete within requested timeout - context deadline exceeded 2025-01-19T05:48:18.529294086Z I0119 05:48:18.529246 1 render_controller.go:377] Error syncing machineconfigpool master: Timeout: request did not complete within requested timeout - c ontext deadline exceeded 2025-01-19T05:48:25.946829953Z I0119 05:48:25.946787 1 render_controller.go:377] Error syncing machineconfigpool master: Operation cannot be fulfilled on machineconfigpools.machinecon figuration.openshift.io "master": the object has been modified; please apply your changes to the latest version and try again 2025-01-19T05:48:33.954412212Z I0119 05:48:33.954352 1 render_controller.go:377] Error syncing machineconfigpool master: ControllerConfig has not completed: completed(false) running(f alse) failing(true) 2025-01-20T11:43:45.694798471Z I0120 11:43:45.694732 1 template_controller.go:134] Re-syncing ControllerConfig due to secret pull-secret change 2025-01-20T12:19:41.837904340Z I0120 12:19:41.837844 1 template_controller.go:134] Re-syncing ControllerConfig due to secret pull-secret change ~~~ Observed high resource utilization on all 3 worker nodes : ~~~ Node : ip-172-36-5-148.eu-west-1.compute.internal MEMORY Stats graphed as percent of MemTotal: MemUsed ▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊.. 96.8% Node : ip-172-36-6-236.eu-west-1.compute.internal LoadAvg: [4 CPU] 11.51 (288%), 4.58 (114%), 2.45 (61%) MEMORY Stats graphed as percent of MemTotal: MemUsed ▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊. 98.5% Node : ip-172-36-6-80.eu-west-1.compute.internal MEMORY Stats graphed as percent of MemTotal: MemUsed ▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊. 98.8% ~~~