Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-19509

4.13.z Azure Install Failures: Kubelet stopped posting node status

XMLWordPrintable

    • Critical
    • No
    • SDN Sprint 242
    • 1
    • Approved
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Install issues for 4.14 && 4.15 where we lose contact with kublet on master nodes.

      https://search.ci.openshift.org/?search=Kubelet+stopped+posting+node+status&maxAge=168h&context=1&type=build-log&name=periodic.*4.14.*azure.*sdn&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

      This search shows its happening on about 35% of azure sdn 4.14 jobs over the past week at least. There are no ovn hits.

      1703590387039342592/artifacts/e2e-azure-sdn-upgrade/gather-extra/artifacts/nodes.json

                          {
                              "lastHeartbeatTime": "2023-09-18T02:33:11Z",
                              "lastTransitionTime": "2023-09-18T02:35:39Z",
                              "message": "Kubelet stopped posting node status.",
                              "reason": "NodeStatusUnknown",
                              "status": "Unknown",
                              "type": "Ready"
                          }

      4.14 is interesting as it is a minor upgrade from 4.13 and we see the install failures with a master node dropping out.

      Focusing on periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-azure-sdn-upgrade/1703590387039342592

      Build log shows

      INFO[2023-09-18T02:03:03Z] Using explicitly provided pull-spec for release initial (registry.ci.openshift.org/ocp/release:4.13.0-0.ci-2023-09-17-050449) 

      ipi-azure-conf shows region centralus (not the single zone westus)

      get ocp version: 4.13
      /output
      Azure region: centralus

      oc_cmds/nodes shows master-1 not ready

      ci-op-82xkimh8-0dd98-9g9wh-master-1                  NotReady   control-plane,master   82m   v1.26.7+c7ee51f   10.0.0.6      <none>        Red Hat Enterprise Linux CoreOS 413.92.202309141211-0 (Plow)  

      ci-op-82xkimh8-0dd98-9g9wh-master-1-boot.log shows ignition

      install log shows we have lost contact

      time="2023-09-18T03:15:33Z" level=error msg="Cluster operator kube-apiserver Degraded is True with GuardController_SyncError::NodeController_MasterNodesReady: GuardControllerDegraded: [Missing operand on node ci-op-82xkimh8-0dd98-9g9wh-master-0, Missing operand on node ci-op-82xkimh8-0dd98-9g9wh-master-2]\nNodeControllerDegraded: The master nodes not ready: node \"ci-op-82xkimh8-0dd98-9g9wh-master-1\" not ready since 2023-09-18 02:35:39 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)"

      4.15 4.15.0-0.ci-2023-09-17-172341 and 4.14 4.14.0-0.ci-2023-09-18-020137

      Version-Release number of selected component (if applicable):

       

      How reproducible:

      We are seeing this on a high number of failed payloads for 4.14 && 4.15. Additional recent failures

      4.14.0-0.ci-2023-09-17-012321
      aggregated-azure-sdn-upgrade-4.14-minor shows failures like: Passed 5 times, failed 0 times, skipped 0 times: we require at least 6 attempts to have a chance at success indicating that only 5 of the 10 runs were valid.
      Checking install logs shows we have lost master-2

      time="2023-09-17T02:44:22Z" level=error msg="Cluster operator kube-apiserver Degraded is True with GuardController_SyncError::NodeController_MasterNodesReady: GuardControllerDegraded: [Missing operand on node ci-op-crj5cf00-0dd98-p5snd-master-1, Missing operand on node ci-op-crj5cf00-0dd98-p5snd-master-0]\nNodeControllerDegraded: The master nodes not ready: node \"ci-op-crj5cf00-0dd98-p5snd-master-2\" not ready since 2023-09-17 02:01:49 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)"
      

      oc_cmds/nodes also shows master-2 not ready

      4.15.0-0.nightly-2023-09-17-113421 install analysis failed due to azure tech preview oc_cmds/nodes shows master-1 not ready

      4.15.0-0.ci-2023-09-17-112341 aggregated-azure-sdn-upgrade-4.15-minor only 5 of 10 runs are valid sample oc_cmds/nodes shows master-0 not ready

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

       

            sseethar Surya Seetharaman
            rh-ee-fbabcock Forrest Babcock
            Michael Fiedler Michael Fiedler
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: