Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.18.0
Affects Version/s: 4.18
Component/s: LCA operator
Labels:
- regression

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
Yes

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
None

RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

    Less than 24 hours after a successful deployment, spoke cluster report note "Not Ready".  

# oc get nodes helix82.telcoqe.eng.rdu2.dc.redhat.com -oyaml
[...]
  - lastHeartbeatTime: "2025-02-03T02:59:00Z"
    lastTransitionTime: "2025-02-03T03:00:35Z"
    message: Kubelet stopped posting node status.
    reason: NodeStatusUnknown
    status: Unknown
    type: Ready

Kubelet is running, but reporting certificate errors

# systemctl status kubelet.service |less
[...]
Feb 06 17:20:47 helix82.telcoqe.eng.rdu2.dc.redhat.com bash[1037739]: I0206 17:20:47.794669 1037739 csi_plugin.go:884] Failed to contact API server when waiting for CSINode publishing: csinodes.storage.k8s.io "helix82.telcoqe.eng.rdu2.dc.redhat.com" is forbidden: User "system:anonymous" cannot get resource "csinodes" in API group "storage.k8s.io" at the cluster scope
Feb 06 17:20:48 helix82.telcoqe.eng.rdu2.dc.redhat.com bash[1037739]: E0206 17:20:48.632171 1037739 transport.go:123] "No valid client certificate is found but the server is not responsive. A restart may be necessary to retrieve new initial credentials." lastCertificateAvailabilityTime="2025-02-03 03:04:55.63120282 +0000 UTC m=+0.061510194" shutdownThreshold="5m0s"
Feb 06 17:20:48 helix82.telcoqe.eng.rdu2.dc.redhat.com bash[1037739]: I0206 17:20:48.795128 1037739 csi_plugin.go:884] Failed to contact API server when waiting for CSINode publishing: csinodes.storage.k8s.io "helix82.telcoqe.eng.rdu2.dc.redhat.com" is forbidden: User "system:anonymous" cannot get resource "csinodes" in API group "storage.k8s.io" at the cluster scope
Feb 06 17:20:49 helix82.telcoqe.eng.rdu2.dc.redhat.com bash[1037739]: E0206 17:20:49.632113 1037739 transport.go:123] "No valid client certificate is found but the server is not responsive. A restart may be necessary to retrieve new initial credentials." lastCertificateAvailabilityTime="2025-02-03 03:04:55.63120282 +0000 UTC m=+0.061510194" shutdownThreshold="5m0s"
Feb 06 17:20:49 helix82.telcoqe.eng.rdu2.dc.redhat.com bash[1037739]: I0206 17:20:49.788454 1037739 csi_plugin.go:884] Failed to contact API server when waiting for CSINode publishing: csinodes.storage.k8s.io "helix82.telcoqe.eng.rdu2.dc.redhat.com" is forbidden: User "system:anonymous" cannot get resource "csinodes" in API group "storage.k8s.io" at the cluster scope

Version-Release number of selected component (if applicable):

Spoke Cluster OCP 4.18.0-rc.7, standard DU profile
Hub: OCP 4.17, GitOps, TALM, ACM 2.12

How reproducible:

    Always? So far 2/2 times

Steps to Reproduce:

1.Deploy hub
2.Deploy Spoke cluster using either IBI or AI workflow
3.Spoke deployment successful
4. Within 24 hours, spoke becomes unresponsive

Actual results:

    Spoke is unresponsive:
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-rc.7   True        False         4d18h   Error while reconciling 4.18.0-rc.7: an unknown error has occurred: MultipleErrors
[root@helix81 machine-config-daemon]# ls *machine*^C
[root@helix81 machine-config-daemon]# oc get co
NAME                                       VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.18.0-rc.7   False       True          True       3d17h   APIServerDeploymentAvailable: no apiserver.openshift-oauth-apiserver pods available on any node....
config-operator                            4.18.0-rc.7   True        False         False      4d18h   
dns                                        4.18.0-rc.7   False       True          True       3d17h   DNS "default" is unavailable.
etcd                                       4.18.0-rc.7   True        False         True       4d18h   NodeControllerDegraded: The master nodes not ready: node "helix81.telcoqe.eng.rdu2.dc.redhat.com" not ready since 2025-02-03 01:56:04 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
image-registry                             4.18.0-rc.7   False       True          False      3d17h   Available: The registry is removed...
ingress                                    4.18.0-rc.7   True        True          True       4d18h   The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 0/1 of replicas are available)
kube-apiserver                             4.18.0-rc.7   True        True          True       4d18h   NodeControllerDegraded: The master nodes not ready: node "helix81.telcoqe.eng.rdu2.dc.redhat.com" not ready since 2025-02-03 01:56:04 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
kube-controller-manager                    4.18.0-rc.7   True        False         True       4d18h   NodeControllerDegraded: The master nodes not ready: node "helix81.telcoqe.eng.rdu2.dc.redhat.com" not ready since 2025-02-03 01:56:04 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
kube-scheduler                             4.18.0-rc.7   True        False         True       4d18h   NodeControllerDegraded: The master nodes not ready: node "helix81.telcoqe.eng.rdu2.dc.redhat.com" not ready since 2025-02-03 01:56:04 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
kube-storage-version-migrator              4.18.0-rc.7   False       True          False      3d17h   KubeStorageVersionMigratorAvailable: Waiting for Deployment
machine-approver                           4.18.0-rc.7   True        False         False      4d18h   
machine-config                             4.18.0-rc.7   True        False         True       4d18h   Failed to resync 4.18.0-rc.7 because: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 1, updated: 1, ready: 0, unavailable: 1)]
monitoring                                 4.18.0-rc.7   False       True          True       3d17h   UpdatingPrometheusOperator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: context deadline exceeded: got 1 unavailable replicas
network                                    4.18.0-rc.7   True        True          False      4d18h   DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes)...
node-tuning                                4.18.0-rc.7   False       True          True       3d17h   DaemonSet "tuned" has no available Pod(s)
openshift-apiserver                        4.18.0-rc.7   False       False         True       3d17h   APIServerDeploymentAvailable: no apiserver.openshift-apiserver pods available on any node....
openshift-controller-manager               4.18.0-rc.7   False       True          False      3d17h   Available: no pods available on any node.
operator-lifecycle-manager                 4.18.0-rc.7   True        False         False      4d18h   
operator-lifecycle-manager-catalog         4.18.0-rc.7   True        False         False      4d18h   
operator-lifecycle-manager-packageserver   4.18.0-rc.7   False       True          False      3d17h   ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver observed in phase Failed with reason: InstallCheckFailed, message: install timeout
service-ca                                 4.18.0-rc.7   True        True          False      4d18h   Progressing: ...



Errors from Machine config Daemon log:
2025-02-02T02:13:37.250742312+00:00 stderr F I0202 02:13:37.250713   65176 daemon.go:1731] state: Done
2025-02-02T02:13:37.257021995+00:00 stderr F I0202 02:13:37.256988   65176 daemon.go:2237] Completing update to target MachineConfig: rendered-master-aa88c68deab3f85b5398c5071ba28fcc
2025-02-02T02:13:47.295959078+00:00 stderr F I0202 02:13:47.295886   65176 update.go:2691] "Update completed for config rendered-master-aa88c68deab3f85b5398c5071ba28fcc and node has been successfully uncordoned"
2025-02-02T02:13:47.314119423+00:00 stderr F I0202 02:13:47.310239   65176 daemon.go:2262] In desired state MachineConfig: rendered-master-aa88c68deab3f85b5398c5071ba28fcc
2025-02-02T02:13:47.368862553+00:00 stderr F I0202 02:13:47.368825   65176 config_drift_monitor.go:246] Config Drift Monitor started
2025-02-02T02:13:47.368932257+00:00 stderr F I0202 02:13:47.368920   65176 update.go:2722] Removing SIGTERM protection
2025-02-02T02:14:03.907018605+00:00 stderr F I0202 02:14:03.906666   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 32146
2025-02-02T02:14:10.912593055+00:00 stderr F I0202 02:14:10.912534   65176 certificate_writer.go:185] Unable to decode cert into a pem block. Cert is either empty or invalid.
2025-02-02T02:14:10.934982079+00:00 stderr F I0202 02:14:10.934929   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33627
2025-02-02T02:14:11.040925975+00:00 stderr F I0202 02:14:11.040697   65176 certificate_writer.go:185] Unable to decode cert into a pem block. Cert is either empty or invalid.
2025-02-02T02:14:11.162126516+00:00 stderr F I0202 02:14:11.161720   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33629
2025-02-02T02:14:12.056354588+00:00 stderr F I0202 02:14:12.056238   65176 certificate_writer.go:185] Unable to decode cert into a pem block. Cert is either empty or invalid.
2025-02-02T02:14:12.073828092+00:00 stderr F I0202 02:14:12.073781   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33633
2025-02-02T02:14:17.884110055+00:00 stderr F I0202 02:14:17.882639   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33718
2025-02-02T02:14:47.370607584+00:00 stderr F I0202 02:14:47.370505   65176 daemon.go:874] Starting health listener on 127.0.0.1:8798
2025-02-02T02:39:32.944763388+00:00 stderr F I0202 02:39:32.937526   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33718
2025-02-02T03:06:02.447229983+00:00 stderr F I0202 03:06:02.447084   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33718
2025-02-02T03:32:31.994287572+00:00 stderr F I0202 03:32:31.994082   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33718
2025-02-02T03:59:01.541019900+00:00 stderr F I0202 03:59:01.540926   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33718
2025-02-02T04:25:31.088572765+00:00 stderr F I0202 04:25:31.088436   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33718

...
2025-02-03T01:39:35.868913143+00:00 stderr F I0203 01:39:35.868850   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 192389
2025-02-03T01:44:36.933295742+00:00 stderr F I0203 01:44:36.933215   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 192389
2025-02-03T01:52:52.004005058+00:00 stderr F I0203 01:52:52.003934   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 192389
2025-02-03T01:55:49.503609539+00:00 stderr F E0203 01:55:49.503224   65176 writer.go:226] Marking Degraded due to: failed to set annotations on node: unable to update node "&Node{ObjectMeta:{      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,RuntimeHandlers:[]NodeRuntimeHandler{},Features:nil,},}": Unauthorized
2025-02-03T01:55:49.509539979+00:00 stderr F E0203 01:55:49.509474   65176 writer.go:242] Error setting Degraded annotation for node helix81.telcoqe.eng.rdu2.dc.redhat.com: unable to update node "&Node{ObjectMeta:{      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,RuntimeHandlers:[]NodeRuntimeHandler{},Features:nil,},}": Patch "https://api-int.helix81.telcoqe.eng.rdu2.dc.redhat.com:6443/api/v1/nodes/helix81.telcoqe.eng.rdu2.dc.redhat.com": write tcp [2620:52:9:1684:aa3c:a5ff:fe36:363a]:47586->[2620:52:9:1684:aa3c:a5ff:fe36:363a]:6443: use of closed network connection
2025-02-03T01:55:49.509600297+00:00 stderr F E0203 01:55:49.509585   65176 certificate_writer.go:90] Could not update annotation: unable to update node "&Node{ObjectMeta:{      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,RuntimeHandlers:[]NodeRuntimeHandler{},Features:nil,},}": Patch "https://api-int.helix81.telcoqe.eng.rdu2.dc.redhat.com:6443/api/v1/nodes/helix81.telcoqe.eng.rdu2.dc.redhat.com": write tcp [2620:52:9:1684:aa3c:a5ff:fe36:363a]:47586->[2620:52:9:1684:aa3c:a5ff:fe36:363a]:6443: use of closed network connection
2025-02-03T01:55:49.514492031+00:00 stderr F E0203 01:55:49.514452   65176 reflector.go:158] "Unhandled Error" err="k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.Node: the server has asked for the client to provide credentials (get nodes)"
2025-02-03T01:55:49.515658323+00:00 stderr F E0203 01:55:49.515623   65176 daemon.go:1404] Got an error from auxiliary tools: k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.Node: the server has asked for the client to provide credentials (get nodes)

Expected results:

    Spoke cluster remains healthy

Additional info:

Spokes defined using SiteConfig V1 do not seem to be affected.

Unable to run must-gather on spoke cluster. I've attached logs from /var/log/pods to this bug.

is cloned by

OCPBUGS-51279 Kubelet unhealthy within 24hrs on spoke cluster with LCA installed

Closed

relates to

OCPBUGS-50643 Upon generating the seed image the certificates turn expired

links to

RHEA-2024:139462 OpenShift Container Platform 4.18.0 IBU extras update

mentioned on

Merge request - Updated US source to: 67a41bb Merge pull request #174 from omertuc/kid

Merge request - Updated US source to: da074de Merge pull request #175 from openshift-cherrypick-robot/cherry-pick-174-to-release-4.18

Assignee:: Omer Tuchfeld

Reporter:: Joshua Clark

Need Info From:: None

Contributors:: None

QA Contact:: Joshua Clark

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Created:: 2025/02/06 7:59 PM

Updated:: 2025/07/16 1:23 PM

Resolved:: 2025/02/26 3:28 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide