Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-1117

Kubelet failing to update node status for short period of time

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • None
    • None
    • False
    • None
    • False

      It has been observed that master 0 was unready for 4s in this job during conformance test: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-azure-ovn-upgrade/1674247083348987904

       

      Slack thread for the context: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1688049633796099

       

      The initial aggregated failure is about "clusteroperator/control-plane-machine-set should not change condition/Available": https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-azure-ovn-upgrade-4.14-micro-release-openshift-release-analysis-aggregator/1674247086704431104

       

      But the clusteroperator was changing state due to master0 state change. The master node state change at that time:

       

         conditions:

         - lastHeartbeatTime: "2023-06-29T05:11:28Z"

      -    lastTransitionTime: "2023-06-29T04:35:36Z"

      -    message: kubelet has sufficient memory available

      -    reason: KubeletHasSufficientMemory

      -    status: "False"

      +    lastTransitionTime: "2023-06-29T05:16:49Z"

      +    message: Kubelet stopped posting node status.

      +    reason: NodeStatusUnknown

      +    status: Unknown

           type: MemoryPressure

         - lastHeartbeatTime: "2023-06-29T05:11:28Z"

      -    lastTransitionTime: "2023-06-29T04:35:36Z"

      -    message: kubelet has no disk pressure

      -    reason: KubeletHasNoDiskPressure

      -    status: "False"

      +    lastTransitionTime: "2023-06-29T05:16:49Z"

      +    message: Kubelet stopped posting node status.

      +    reason: NodeStatusUnknown

      +    status: Unknown

           type: DiskPressure

         - lastHeartbeatTime: "2023-06-29T05:11:28Z"

      -    lastTransitionTime: "2023-06-29T04:35:36Z"

      -    message: kubelet has sufficient PID available

      -    reason: KubeletHasSufficientPID

      -    status: "False"

      +    lastTransitionTime: "2023-06-29T05:16:49Z"

      +    message: Kubelet stopped posting node status.

      +    reason: NodeStatusUnknown

      +    status: Unknown

           type: PIDPressure

         - lastHeartbeatTime: "2023-06-29T05:11:28Z"

      -    lastTransitionTime: "2023-06-29T04:35:36Z"

      -    message: kubelet is posting ready status

      -    reason: KubeletReady

      -    status: "True"

      +    lastTransitionTime: "2023-06-29T05:16:49Z"

      +    message: Kubelet stopped posting node status.

      +    reason: NodeStatusUnknown

      +    status: Unknown

           type: Ready

       

       

      Kubelet is complaining about api access around that time:

      Jun 29 05:16:24.250797 ci-op-cyqgzj4w-ed5cd-ll5md-master-0 kubenswrapper[2336]: E0629 05:16:24.250754 2336 kubelet_node_status.go:567] "Error updating node status, will retry" err="error getting node \"ci-op-cyqgzj4w-ed5cd-ll5md-master-0\": Get \"https://api-int.ci-op-cyqgzj4w-ed5cd.ci2.azure.devcluster.openshift.com:6443/api/v1/nodes/ci-op-cyqgzj4w-ed5cd-ll5md-master-0?resourceVersion=0&timeout=10s
      \": net/http: request canceled (Client.Timeout exceeded while awaiting headers)"

            Unassigned Unassigned
            kenzhang@redhat.com Ken Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: