Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.14.z
Component/s: Node / Kubelet
Labels:
- machine-config
- mco
- mco-triaged
- triaged
- upgrade

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

Customer Impact:

Customer Escalated

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Cluster upgrade from 4.14.30 to 4.15.41 is stuck at 97% with error "Cluster operator machine-config is not available"

machine-config-daemon pods are in CrashLoopBackOff state with exitcode 137 on worker node as well as master node.

All 3 worker nodes have high memory consumption.

Restarting pods and rebooting the nodes didn't help.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Actual results:

Upgrade is stucl, machine-config-daemon pods running on worker nodes are stuck in CrashLoopBackOff state.

Expected results:

All machine-config-daemon pods must be healthy and cluster must be upgraded successfully.

Additional info:

Upgrade Stuck :
~~~
$oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.30   True        True          4h55m   Unable to apply 4.15.41: the cluster operator machine-config is not available
~~~

machine-config CO degraded :
~~~
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-config                             4.14.30   False       True          True       12d
~~~

Pods are stuck with exitcode 137 :
~~~
NAME                                                              READY   STATUS             RESTARTS   AGE    IP             NODE                                         NOMINATED NODE   READINESS GATES
machine-config-daemon-fpsx7                                       1/2     CrashLoopBackOff   3127       11d    172.36.5.148   ip-172-36-5-148.eu-west-1.compute.internal   <none>           <none>
machine-config-daemon-hrtzv                                       1/2     CrashLoopBackOff   3127       11d    172.36.6.236   ip-172-36-6-236.eu-west-1.compute.internal   <none>           <none>
machine-config-daemon-llwkf                                       1/2     CrashLoopBackOff   2548       9d     172.36.6.80    ip-172-36-6-80.eu-west-1.compute.internal    <none>           <none>
~~~

machine-config-controller-87fc68c7d-xxv2w -c machine-config-controller logs :
~~~
2025-01-19T05:46:56.923950707Z W0119 05:46:56.923763       1 reflector.go:456] github.com/openshift/client-go/config/informers/externalversions/factory.go:101: watch of *v1.Image ended with
: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2025-01-19T05:46:56.923965564Z W0119 05:46:56.923770       1 reflector.go:456] k8s.io/client-go/informers/factory.go:150: watch of *v1.Node ended with: an error on the server ("unable to de
code an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2025-01-19T05:46:56.923970055Z W0119 05:46:56.923777       1 reflector.go:456] github.com/openshift/client-go/config/informers/externalversions/factory.go:101: watch of *v1.APIServer ended
with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2025-01-19T05:46:56.923988142Z W0119 05:46:56.923791       1 reflector.go:456] github.com/openshift/client-go/config/informers/externalversions/factory.go:101: watch of *v1.FeatureGate ende
d with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2025-01-19T05:46:56.923988142Z W0119 05:46:56.923793       1 reflector.go:456] github.com/openshift/client-go/operator/informers/externalversions/factory.go:101: watch of *v1alpha1.ImageCon
tentSourcePolicy ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding

2025-01-19T05:47:59.645252341Z I0119 05:47:59.645211       1 container_runtime_config_controller.go:417] Error syncing image config openshift-config: could not Create/Update MachineConfig:
could not find MachineConfig: the server was unable to return a response in the time allotted, but may still be processing the request (get machineconfigs.machineconfiguration.openshift.io
99-worker-generated-registries)
2025-01-19T05:48:14.822984766Z I0119 05:48:14.822927       1 template_controller.go:418] Error syncing controllerconfig machine-config-controller: failed to sync status for Timeout: request
 did not complete within requested timeout - context deadline exceeded
2025-01-19T05:48:18.529294086Z I0119 05:48:18.529246       1 render_controller.go:377] Error syncing machineconfigpool master: Timeout: request did not complete within requested timeout - c
ontext deadline exceeded
2025-01-19T05:48:25.946829953Z I0119 05:48:25.946787       1 render_controller.go:377] Error syncing machineconfigpool master: Operation cannot be fulfilled on machineconfigpools.machinecon
figuration.openshift.io "master": the object has been modified; please apply your changes to the latest version and try again
2025-01-19T05:48:33.954412212Z I0119 05:48:33.954352       1 render_controller.go:377] Error syncing machineconfigpool master: ControllerConfig has not completed: completed(false) running(f
alse) failing(true)

2025-01-20T11:43:45.694798471Z I0120 11:43:45.694732       1 template_controller.go:134] Re-syncing ControllerConfig due to secret pull-secret change
2025-01-20T12:19:41.837904340Z I0120 12:19:41.837844       1 template_controller.go:134] Re-syncing ControllerConfig due to secret pull-secret change
~~~

Observed high resource utilization on all 3 worker nodes :
~~~
Node : ip-172-36-5-148.eu-west-1.compute.internal
MEMORY
  Stats graphed as percent of MemTotal:
    MemUsed    ▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊..  96.8%

Node : ip-172-36-6-236.eu-west-1.compute.internal
LoadAvg:   [4 CPU] 11.51 (288%), 4.58 (114%), 2.45 (61%)
MEMORY
  Stats graphed as percent of MemTotal:
    MemUsed    ▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊.  98.5%

Node : ip-172-36-6-80.eu-west-1.compute.internal
MEMORY
  Stats graphed as percent of MemTotal:
    MemUsed    ▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊.  98.8%
~~~

Assignee:: Ryan Phillips

Reporter:: Suruchi Dharma

Need Info From:: None

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Created:: 2025/01/23 9:15 PM

Updated:: 2025/09/13 1:16 PM

Resolved:: 2025/03/05 9:21 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide