Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-57219

[4.18] Kubelet metrics endpoints experiencing prolonged outages

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • Yes
    • Rejected
    • None
    • Done
    • Bug Fix
    • Hide
      * Previously, the kubelet stopped reporting metrics if a `stat` call stalled from the kernel (for example, in instances where a `stat` call on the disk was run on the Network File System (NFS)). With this release, the kubelet reports metrics even if a disk is stuck. (link:https://issues.redhat.com/browse/OCPBUGS-57219[OCPBUGS-57219])
      Show
      * Previously, the kubelet stopped reporting metrics if a `stat` call stalled from the kernel (for example, in instances where a `stat` call on the disk was run on the Network File System (NFS)). With this release, the kubelet reports metrics even if a disk is stuck. (link: https://issues.redhat.com/browse/OCPBUGS-57219 [ OCPBUGS-57219 ])
    • None
    • None
    • None
    • None

      Description of problem:

      Component Readiness reveals a potential regression with the following test:

      [sig-node][invariant] alert/TargetDown should not be at or above info in ns/kube-system

      Currently the test details link is showing 3 recent failures similar to the following:

      Jun 12 07:48:09.154 - 58s W namespace/kube-system alert/TargetDown alertstate/firing severity/warning ALERTS

      Unknown macro: {alertname="TargetDown", alertstate="firing", job="kubelet", namespace="kube-system", prometheus="openshift-monitoring/k8s", service="kubelet", severity="warning"}

      }

      This test ran 139 times in the 4 weeks before 4.15 GA and never failed once.  It's failed 3 out of 31 times in the last week (with all the failures since yesterday).

      Relevant slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1718200183009099

      It seems that /metrics and /metrics/cadvisor endpoint fell over, and later recovered

              aos-node@redhat.com Node Team Bot Account
              kenzhang@redhat.com Ken Zhang
              None
              None
              Min Li Min Li
              None
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: