Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-35371

Kubelet metrics endpoints experiencing prolonged outages

XMLWordPrintable

    • Yes
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Component Readiness reveals a potential regression with the following test:

      [sig-node][invariant] alert/TargetDown should not be at or above info in ns/kube-system

      Currently the test details link is showing 3 recent failures similar to the following:

      Jun 12 07:48:09.154 - 58s W namespace/kube-system alert/TargetDown alertstate/firing severity/warning ALERTS

      Unknown macro: {alertname="TargetDown", alertstate="firing", job="kubelet", namespace="kube-system", prometheus="openshift-monitoring/k8s", service="kubelet", severity="warning"}

      }

      This test ran 139 times in the 4 weeks before 4.15 GA and never failed once.  It's failed 3 out of 31 times in the last week (with all the failures since yesterday).

      Relevant slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1718200183009099

      It seems that /metrics and /metrics/cadvisor endpoint fell over, and later recovered

              aos-node@redhat.com Node Team Bot Account
              kenzhang@redhat.com Ken Zhang
              Sunil Choudhary Sunil Choudhary
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: