Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-49972

Kubelet unhealthy within 24hrs on spoke cluster with LCA installed

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Critical Critical
    • 4.18.0
    • 4.18
    • LCA operator
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Critical
    • Yes
    • None
    • None
    • Rejected
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

          Less than 24 hours after a successful deployment, spoke cluster report note "Not Ready".  
      
      # oc get nodes helix82.telcoqe.eng.rdu2.dc.redhat.com -oyaml
      [...]
        - lastHeartbeatTime: "2025-02-03T02:59:00Z"
          lastTransitionTime: "2025-02-03T03:00:35Z"
          message: Kubelet stopped posting node status.
          reason: NodeStatusUnknown
          status: Unknown
          type: Ready
      
      Kubelet is running, but reporting certificate errors
      
      # systemctl status kubelet.service |less
      [...]
      Feb 06 17:20:47 helix82.telcoqe.eng.rdu2.dc.redhat.com bash[1037739]: I0206 17:20:47.794669 1037739 csi_plugin.go:884] Failed to contact API server when waiting for CSINode publishing: csinodes.storage.k8s.io "helix82.telcoqe.eng.rdu2.dc.redhat.com" is forbidden: User "system:anonymous" cannot get resource "csinodes" in API group "storage.k8s.io" at the cluster scope
      Feb 06 17:20:48 helix82.telcoqe.eng.rdu2.dc.redhat.com bash[1037739]: E0206 17:20:48.632171 1037739 transport.go:123] "No valid client certificate is found but the server is not responsive. A restart may be necessary to retrieve new initial credentials." lastCertificateAvailabilityTime="2025-02-03 03:04:55.63120282 +0000 UTC m=+0.061510194" shutdownThreshold="5m0s"
      Feb 06 17:20:48 helix82.telcoqe.eng.rdu2.dc.redhat.com bash[1037739]: I0206 17:20:48.795128 1037739 csi_plugin.go:884] Failed to contact API server when waiting for CSINode publishing: csinodes.storage.k8s.io "helix82.telcoqe.eng.rdu2.dc.redhat.com" is forbidden: User "system:anonymous" cannot get resource "csinodes" in API group "storage.k8s.io" at the cluster scope
      Feb 06 17:20:49 helix82.telcoqe.eng.rdu2.dc.redhat.com bash[1037739]: E0206 17:20:49.632113 1037739 transport.go:123] "No valid client certificate is found but the server is not responsive. A restart may be necessary to retrieve new initial credentials." lastCertificateAvailabilityTime="2025-02-03 03:04:55.63120282 +0000 UTC m=+0.061510194" shutdownThreshold="5m0s"
      Feb 06 17:20:49 helix82.telcoqe.eng.rdu2.dc.redhat.com bash[1037739]: I0206 17:20:49.788454 1037739 csi_plugin.go:884] Failed to contact API server when waiting for CSINode publishing: csinodes.storage.k8s.io "helix82.telcoqe.eng.rdu2.dc.redhat.com" is forbidden: User "system:anonymous" cannot get resource "csinodes" in API group "storage.k8s.io" at the cluster scope
      
      

      Version-Release number of selected component (if applicable):

      Spoke Cluster OCP 4.18.0-rc.7, standard DU profile
      Hub: OCP 4.17, GitOps, TALM, ACM 2.12

      How reproducible:

          Always? So far 2/2 times

      Steps to Reproduce:

      1.Deploy hub
      2.Deploy Spoke cluster using either IBI or AI workflow
      3.Spoke deployment successful
      4. Within 24 hours, spoke becomes unresponsive
          

      Actual results:

          Spoke is unresponsive:
      NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.18.0-rc.7   True        False         4d18h   Error while reconciling 4.18.0-rc.7: an unknown error has occurred: MultipleErrors
      [root@helix81 machine-config-daemon]# ls *machine*^C
      [root@helix81 machine-config-daemon]# oc get co
      NAME                                       VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.18.0-rc.7   False       True          True       3d17h   APIServerDeploymentAvailable: no apiserver.openshift-oauth-apiserver pods available on any node....
      config-operator                            4.18.0-rc.7   True        False         False      4d18h   
      dns                                        4.18.0-rc.7   False       True          True       3d17h   DNS "default" is unavailable.
      etcd                                       4.18.0-rc.7   True        False         True       4d18h   NodeControllerDegraded: The master nodes not ready: node "helix81.telcoqe.eng.rdu2.dc.redhat.com" not ready since 2025-02-03 01:56:04 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
      image-registry                             4.18.0-rc.7   False       True          False      3d17h   Available: The registry is removed...
      ingress                                    4.18.0-rc.7   True        True          True       4d18h   The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 0/1 of replicas are available)
      kube-apiserver                             4.18.0-rc.7   True        True          True       4d18h   NodeControllerDegraded: The master nodes not ready: node "helix81.telcoqe.eng.rdu2.dc.redhat.com" not ready since 2025-02-03 01:56:04 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
      kube-controller-manager                    4.18.0-rc.7   True        False         True       4d18h   NodeControllerDegraded: The master nodes not ready: node "helix81.telcoqe.eng.rdu2.dc.redhat.com" not ready since 2025-02-03 01:56:04 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
      kube-scheduler                             4.18.0-rc.7   True        False         True       4d18h   NodeControllerDegraded: The master nodes not ready: node "helix81.telcoqe.eng.rdu2.dc.redhat.com" not ready since 2025-02-03 01:56:04 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
      kube-storage-version-migrator              4.18.0-rc.7   False       True          False      3d17h   KubeStorageVersionMigratorAvailable: Waiting for Deployment
      machine-approver                           4.18.0-rc.7   True        False         False      4d18h   
      machine-config                             4.18.0-rc.7   True        False         True       4d18h   Failed to resync 4.18.0-rc.7 because: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 1, updated: 1, ready: 0, unavailable: 1)]
      monitoring                                 4.18.0-rc.7   False       True          True       3d17h   UpdatingPrometheusOperator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: context deadline exceeded: got 1 unavailable replicas
      network                                    4.18.0-rc.7   True        True          False      4d18h   DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes)...
      node-tuning                                4.18.0-rc.7   False       True          True       3d17h   DaemonSet "tuned" has no available Pod(s)
      openshift-apiserver                        4.18.0-rc.7   False       False         True       3d17h   APIServerDeploymentAvailable: no apiserver.openshift-apiserver pods available on any node....
      openshift-controller-manager               4.18.0-rc.7   False       True          False      3d17h   Available: no pods available on any node.
      operator-lifecycle-manager                 4.18.0-rc.7   True        False         False      4d18h   
      operator-lifecycle-manager-catalog         4.18.0-rc.7   True        False         False      4d18h   
      operator-lifecycle-manager-packageserver   4.18.0-rc.7   False       True          False      3d17h   ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver observed in phase Failed with reason: InstallCheckFailed, message: install timeout
      service-ca                                 4.18.0-rc.7   True        True          False      4d18h   Progressing: ...
      
      
      
      Errors from Machine config Daemon log:
      2025-02-02T02:13:37.250742312+00:00 stderr F I0202 02:13:37.250713   65176 daemon.go:1731] state: Done
      2025-02-02T02:13:37.257021995+00:00 stderr F I0202 02:13:37.256988   65176 daemon.go:2237] Completing update to target MachineConfig: rendered-master-aa88c68deab3f85b5398c5071ba28fcc
      2025-02-02T02:13:47.295959078+00:00 stderr F I0202 02:13:47.295886   65176 update.go:2691] "Update completed for config rendered-master-aa88c68deab3f85b5398c5071ba28fcc and node has been successfully uncordoned"
      2025-02-02T02:13:47.314119423+00:00 stderr F I0202 02:13:47.310239   65176 daemon.go:2262] In desired state MachineConfig: rendered-master-aa88c68deab3f85b5398c5071ba28fcc
      2025-02-02T02:13:47.368862553+00:00 stderr F I0202 02:13:47.368825   65176 config_drift_monitor.go:246] Config Drift Monitor started
      2025-02-02T02:13:47.368932257+00:00 stderr F I0202 02:13:47.368920   65176 update.go:2722] Removing SIGTERM protection
      2025-02-02T02:14:03.907018605+00:00 stderr F I0202 02:14:03.906666   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 32146
      2025-02-02T02:14:10.912593055+00:00 stderr F I0202 02:14:10.912534   65176 certificate_writer.go:185] Unable to decode cert into a pem block. Cert is either empty or invalid.
      2025-02-02T02:14:10.934982079+00:00 stderr F I0202 02:14:10.934929   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33627
      2025-02-02T02:14:11.040925975+00:00 stderr F I0202 02:14:11.040697   65176 certificate_writer.go:185] Unable to decode cert into a pem block. Cert is either empty or invalid.
      2025-02-02T02:14:11.162126516+00:00 stderr F I0202 02:14:11.161720   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33629
      2025-02-02T02:14:12.056354588+00:00 stderr F I0202 02:14:12.056238   65176 certificate_writer.go:185] Unable to decode cert into a pem block. Cert is either empty or invalid.
      2025-02-02T02:14:12.073828092+00:00 stderr F I0202 02:14:12.073781   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33633
      2025-02-02T02:14:17.884110055+00:00 stderr F I0202 02:14:17.882639   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33718
      2025-02-02T02:14:47.370607584+00:00 stderr F I0202 02:14:47.370505   65176 daemon.go:874] Starting health listener on 127.0.0.1:8798
      2025-02-02T02:39:32.944763388+00:00 stderr F I0202 02:39:32.937526   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33718
      2025-02-02T03:06:02.447229983+00:00 stderr F I0202 03:06:02.447084   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33718
      2025-02-02T03:32:31.994287572+00:00 stderr F I0202 03:32:31.994082   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33718
      2025-02-02T03:59:01.541019900+00:00 stderr F I0202 03:59:01.540926   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33718
      2025-02-02T04:25:31.088572765+00:00 stderr F I0202 04:25:31.088436   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 33718
      
      ...
      2025-02-03T01:39:35.868913143+00:00 stderr F I0203 01:39:35.868850   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 192389
      2025-02-03T01:44:36.933295742+00:00 stderr F I0203 01:44:36.933215   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 192389
      2025-02-03T01:52:52.004005058+00:00 stderr F I0203 01:52:52.003934   65176 certificate_writer.go:303] Certificate was synced from controllerconfig resourceVersion 192389
      2025-02-03T01:55:49.503609539+00:00 stderr F E0203 01:55:49.503224   65176 writer.go:226] Marking Degraded due to: failed to set annotations on node: unable to update node "&Node{ObjectMeta:{      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,RuntimeHandlers:[]NodeRuntimeHandler{},Features:nil,},}": Unauthorized
      2025-02-03T01:55:49.509539979+00:00 stderr F E0203 01:55:49.509474   65176 writer.go:242] Error setting Degraded annotation for node helix81.telcoqe.eng.rdu2.dc.redhat.com: unable to update node "&Node{ObjectMeta:{      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,RuntimeHandlers:[]NodeRuntimeHandler{},Features:nil,},}": Patch "https://api-int.helix81.telcoqe.eng.rdu2.dc.redhat.com:6443/api/v1/nodes/helix81.telcoqe.eng.rdu2.dc.redhat.com": write tcp [2620:52:9:1684:aa3c:a5ff:fe36:363a]:47586->[2620:52:9:1684:aa3c:a5ff:fe36:363a]:6443: use of closed network connection
      2025-02-03T01:55:49.509600297+00:00 stderr F E0203 01:55:49.509585   65176 certificate_writer.go:90] Could not update annotation: unable to update node "&Node{ObjectMeta:{      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,RuntimeHandlers:[]NodeRuntimeHandler{},Features:nil,},}": Patch "https://api-int.helix81.telcoqe.eng.rdu2.dc.redhat.com:6443/api/v1/nodes/helix81.telcoqe.eng.rdu2.dc.redhat.com": write tcp [2620:52:9:1684:aa3c:a5ff:fe36:363a]:47586->[2620:52:9:1684:aa3c:a5ff:fe36:363a]:6443: use of closed network connection
      2025-02-03T01:55:49.514492031+00:00 stderr F E0203 01:55:49.514452   65176 reflector.go:158] "Unhandled Error" err="k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.Node: the server has asked for the client to provide credentials (get nodes)"
      2025-02-03T01:55:49.515658323+00:00 stderr F E0203 01:55:49.515623   65176 daemon.go:1404] Got an error from auxiliary tools: k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.Node: the server has asked for the client to provide credentials (get nodes)

       

      Expected results:

          Spoke cluster remains healthy 

      Additional info:

      Spokes defined using SiteConfig V1 do not seem to be affected.
      
      Unable to run must-gather on spoke cluster. I've attached logs from /var/log/pods to this bug.

              otuchfel@redhat.com Omer Tuchfeld
              josclark@redhat.com Joshua Clark
              None
              None
              Joshua Clark Joshua Clark
              None
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

                Created:
                Updated:
                Resolved: