-
Bug
-
Resolution: Done-Errata
-
Critical
-
4.18
-
Quality / Stability / Reliability
-
False
-
-
None
-
Critical
-
Yes
-
None
-
None
-
Rejected
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Less than 24 hours after a successful deployment, spoke cluster report note "Not Ready". # oc get nodes helix82.telcoqe.eng.rdu2.dc.redhat.com -oyaml [...] - lastHeartbeatTime: "2025-02-03T02:59:00Z" lastTransitionTime: "2025-02-03T03:00:35Z" message: Kubelet stopped posting node status. reason: NodeStatusUnknown status: Unknown type: Ready Kubelet is running, but reporting certificate errors # systemctl status kubelet.service |less [...] Feb 06 17:20:47 helix82.telcoqe.eng.rdu2.dc.redhat.com bash[1037739]: I0206 17:20:47.794669 1037739 csi_plugin.go:884] Failed to contact API server when waiting for CSINode publishing: csinodes.storage.k8s.io "helix82.telcoqe.eng.rdu2.dc.redhat.com" is forbidden: User "system:anonymous" cannot get resource "csinodes" in API group "storage.k8s.io" at the cluster scope Feb 06 17:20:48 helix82.telcoqe.eng.rdu2.dc.redhat.com bash[1037739]: E0206 17:20:48.632171 1037739 transport.go:123] "No valid client certificate is found but the server is not responsive. A restart may be necessary to retrieve new initial credentials." lastCertificateAvailabilityTime="2025-02-03 03:04:55.63120282 +0000 UTC m=+0.061510194" shutdownThreshold="5m0s" Feb 06 17:20:48 helix82.telcoqe.eng.rdu2.dc.redhat.com bash[1037739]: I0206 17:20:48.795128 1037739 csi_plugin.go:884] Failed to contact API server when waiting for CSINode publishing: csinodes.storage.k8s.io "helix82.telcoqe.eng.rdu2.dc.redhat.com" is forbidden: User "system:anonymous" cannot get resource "csinodes" in API group "storage.k8s.io" at the cluster scope Feb 06 17:20:49 helix82.telcoqe.eng.rdu2.dc.redhat.com bash[1037739]: E0206 17:20:49.632113 1037739 transport.go:123] "No valid client certificate is found but the server is not responsive. A restart may be necessary to retrieve new initial credentials." lastCertificateAvailabilityTime="2025-02-03 03:04:55.63120282 +0000 UTC m=+0.061510194" shutdownThreshold="5m0s" Feb 06 17:20:49 helix82.telcoqe.eng.rdu2.dc.redhat.com bash[1037739]: I0206 17:20:49.788454 1037739 csi_plugin.go:884] Failed to contact API server when waiting for CSINode publishing: csinodes.storage.k8s.io "helix82.telcoqe.eng.rdu2.dc.redhat.com" is forbidden: User "system:anonymous" cannot get resource "csinodes" in API group "storage.k8s.io" at the cluster scope
Version-Release number of selected component (if applicable):
Spoke Cluster OCP 4.18.0-rc.7, standard DU profile Hub: OCP 4.17, GitOps, TALM, ACM 2.12
How reproducible:
Always? So far 2/2 times
Steps to Reproduce:
1.Deploy hub 2.Deploy Spoke cluster using either IBI or AI workflow 3.Spoke deployment successful 4. Within 24 hours, spoke becomes unresponsive
Actual results:
Spoke is unresponsive: NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.18.0-rc.7 True False 4d18h Error while reconciling 4.18.0-rc.7: an unknown error has occurred: MultipleErrors [root@helix81 machine-config-daemon]# ls *machine*^C [root@helix81 machine-config-daemon]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.18.0-rc.7 False True True 3d17h APIServerDeploymentAvailable: no apiserver.openshift-oauth-apiserver pods available on any node.... config-operator 4.18.0-rc.7 True False False 4d18h dns 4.18.0-rc.7 False True True 3d17h DNS "default" is unavailable. etcd 4.18.0-rc.7 True False True 4d18h NodeControllerDegraded: The master nodes not ready: node "helix81.telcoqe.eng.rdu2.dc.redhat.com" not ready since 2025-02-03 01:56:04 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) image-registry 4.18.0-rc.7 False True False 3d17h Available: The registry is removed... ingress 4.18.0-rc.7 True True True 4d18h The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 0/1 of replicas are available) kube-apiserver 4.18.0-rc.7 True True True 4d18h NodeControllerDegraded: The master nodes not ready: node "helix81.telcoqe.eng.rdu2.dc.redhat.com" not ready since 2025-02-03 01:56:04 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) kube-controller-manager 4.18.0-rc.7 True False True 4d18h NodeControllerDegraded: The master nodes not ready: node "helix81.telcoqe.eng.rdu2.dc.redhat.com" not ready since 2025-02-03 01:56:04 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) kube-scheduler 4.18.0-rc.7 True False True 4d18h NodeControllerDegraded: The master nodes not ready: node "helix81.telcoqe.eng.rdu2.dc.redhat.com" not ready since 2025-02-03 01:56:04 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) kube-storage-version-migrator 4.18.0-rc.7 False True False 3d17h KubeStorageVersionMigratorAvailable: Waiting for Deployment machine-approver 4.18.0-rc.7 True False False 4d18h machine-config 4.18.0-rc.7 True False True 4d18h Failed to resync 4.18.0-rc.7 because: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 1, updated: 1, ready: 0, unavailable: 1)] monitoring 4.18.0-rc.7 False True True 3d17h UpdatingPrometheusOperator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: context deadline exceeded: got 1 unavailable replicas network 4.18.0-rc.7 True True False 4d18h DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes)... node-tuning 4.18.0-rc.7 False True True 3d17h DaemonSet "tuned" has no available Pod(s) openshift-apiserver 4.18.0-rc.7 False False True 3d17h APIServerDeploymentAvailable: no apiserver.openshift-apiserver pods available on any node.... openshift-controller-manager 4.18.0-rc.7 False True False 3d17h Available: no pods available on any node. operator-lifecycle-manager 4.18.0-rc.7 True False False 4d18h operator-lifecycle-manager-catalog 4.18.0-rc.7 True False False 4d18h operator-lifecycle-manager-packageserver 4.18.0-rc.7 False True False 3d17h ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver observed in phase Failed with reason: InstallCheckFailed, message: install timeout service-ca 4.18.0-rc.7 True True False 4d18h Progressing: ... Errors from Machine config Daemon log: 2025-02-02T02:13:37.250742312+00:00 stderr F I0202 02:13:37.250713 65176 daemon.go:1731] state: Done 2025-02-02T02:13:37.257021995+00:00 stderr F I0202 02:13:37.256988 65176 daemon.go:2237] Completing update to target MachineConfig: rendered-master-aa88c68deab3f85b5398c5071ba28fcc 2025-02-02T02:13:47.295959078+00:00 stderr F I0202 02:13:47.295886 65176 update.go:2691] "Update completed for config rendered-master-aa88c68deab3f85b5398c5071ba28fcc and node has been successfully uncordoned" 2025-02-02T02:13:47.314119423+00:00 stderr F I0202 02:13:47.310239 65176 daemon.go:2262] In desired state MachineConfig: rendered-master-aa88c68deab3f85b5398c5071ba28fcc 2025-02-02T02:13:47.368862553+00:00 stderr F I0202 02:13:47.368825 65176 config_drift_monitor.go:246] Config Drift Monitor started 2025-02-02T02:13:47.368932257+00:00 stderr F I0202 02:13:47.368920 65176 update.go:2722] Removing SIGTERM protection 2025-02-02T02:14:03.907018605+00:00 stderr F I0202 02:14:03.906666 65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 32146 2025-02-02T02:14:10.912593055+00:00 stderr F I0202 02:14:10.912534 65176 certificate_writer.go:185] Unable to decode cert into a pem block. Cert is either empty or invalid. 2025-02-02T02:14:10.934982079+00:00 stderr F I0202 02:14:10.934929 65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33627 2025-02-02T02:14:11.040925975+00:00 stderr F I0202 02:14:11.040697 65176 certificate_writer.go:185] Unable to decode cert into a pem block. Cert is either empty or invalid. 2025-02-02T02:14:11.162126516+00:00 stderr F I0202 02:14:11.161720 65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33629 2025-02-02T02:14:12.056354588+00:00 stderr F I0202 02:14:12.056238 65176 certificate_writer.go:185] Unable to decode cert into a pem block. Cert is either empty or invalid. 2025-02-02T02:14:12.073828092+00:00 stderr F I0202 02:14:12.073781 65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33633 2025-02-02T02:14:17.884110055+00:00 stderr F I0202 02:14:17.882639 65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33718 2025-02-02T02:14:47.370607584+00:00 stderr F I0202 02:14:47.370505 65176 daemon.go:874] Starting health listener on 127.0.0.1:8798 2025-02-02T02:39:32.944763388+00:00 stderr F I0202 02:39:32.937526 65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33718 2025-02-02T03:06:02.447229983+00:00 stderr F I0202 03:06:02.447084 65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33718 2025-02-02T03:32:31.994287572+00:00 stderr F I0202 03:32:31.994082 65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33718 2025-02-02T03:59:01.541019900+00:00 stderr F I0202 03:59:01.540926 65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33718 2025-02-02T04:25:31.088572765+00:00 stderr F I0202 04:25:31.088436 65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33718 ... 2025-02-03T01:39:35.868913143+00:00 stderr F I0203 01:39:35.868850 65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 192389 2025-02-03T01:44:36.933295742+00:00 stderr F I0203 01:44:36.933215 65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 192389 2025-02-03T01:52:52.004005058+00:00 stderr F I0203 01:52:52.003934 65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 192389 2025-02-03T01:55:49.503609539+00:00 stderr F E0203 01:55:49.503224 65176 writer.go:226] Marking Degraded due to: failed to set annotations on node: unable to update node "&Node{ObjectMeta:{ 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,RuntimeHandlers:[]NodeRuntimeHandler{},Features:nil,},}": Unauthorized 2025-02-03T01:55:49.509539979+00:00 stderr F E0203 01:55:49.509474 65176 writer.go:242] Error setting Degraded annotation for node helix81.telcoqe.eng.rdu2.dc.redhat.com: unable to update node "&Node{ObjectMeta:{ 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,RuntimeHandlers:[]NodeRuntimeHandler{},Features:nil,},}": Patch "https://api-int.helix81.telcoqe.eng.rdu2.dc.redhat.com:6443/api/v1/nodes/helix81.telcoqe.eng.rdu2.dc.redhat.com": write tcp [2620:52:9:1684:aa3c:a5ff:fe36:363a]:47586->[2620:52:9:1684:aa3c:a5ff:fe36:363a]:6443: use of closed network connection 2025-02-03T01:55:49.509600297+00:00 stderr F E0203 01:55:49.509585 65176 certificate_writer.go:90] Could not update annotation: unable to update node "&Node{ObjectMeta:{ 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,RuntimeHandlers:[]NodeRuntimeHandler{},Features:nil,},}": Patch "https://api-int.helix81.telcoqe.eng.rdu2.dc.redhat.com:6443/api/v1/nodes/helix81.telcoqe.eng.rdu2.dc.redhat.com": write tcp [2620:52:9:1684:aa3c:a5ff:fe36:363a]:47586->[2620:52:9:1684:aa3c:a5ff:fe36:363a]:6443: use of closed network connection 2025-02-03T01:55:49.514492031+00:00 stderr F E0203 01:55:49.514452 65176 reflector.go:158] "Unhandled Error" err="k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.Node: the server has asked for the client to provide credentials (get nodes)" 2025-02-03T01:55:49.515658323+00:00 stderr F E0203 01:55:49.515623 65176 daemon.go:1404] Got an error from auxiliary tools: k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.Node: the server has asked for the client to provide credentials (get nodes)
Expected results:
Spoke cluster remains healthy
Additional info:
Spokes defined using SiteConfig V1 do not seem to be affected. Unable to run must-gather on spoke cluster. I've attached logs from /var/log/pods to this bug.
- is cloned by
-
OCPBUGS-51279 Kubelet unhealthy within 24hrs on spoke cluster with LCA installed
-
- Closed
-
- relates to
-
OCPBUGS-50643 Upon generating the seed image the certificates turn expired
-
- New
-
- links to
-
RHEA-2024:139462 OpenShift Container Platform 4.18.0 IBU extras update
- mentioned on