Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-57290

[4.16] Kubelet metrics endpoints experiencing prolonged outages

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • Yes
    • Rejected
    • OCP Node Sprint 273 (Green)
    • 1
    • Done
    • Bug Fix
    • Fix a bug where the kubelet would stop reporting metrics if a stat call stalled from the kernel (like in situations where the disk that is being stat'd is run on NFS). Now, the kubelet reports metrics regardless of whether one disk is stuck
    • None
    • None
    • None
    • None

      Description of problem:

      Component Readiness reveals a potential regression with the following test:

      [sig-node][invariant] alert/TargetDown should not be at or above info in ns/kube-system

      Currently the test details link is showing 3 recent failures similar to the following:

      Jun 12 07:48:09.154 - 58s W namespace/kube-system alert/TargetDown alertstate/firing severity/warning ALERTS

      Unknown macro: {alertname="TargetDown", alertstate="firing", job="kubelet", namespace="kube-system", prometheus="openshift-monitoring/k8s", service="kubelet", severity="warning"}

      }

      This test ran 139 times in the 4 weeks before 4.15 GA and never failed once.  It's failed 3 out of 31 times in the last week (with all the failures since yesterday).

      Relevant slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1718200183009099

      It seems that /metrics and /metrics/cadvisor endpoint fell over, and later recovered

              mdemaced Maysa De Macedo Souza
              kenzhang@redhat.com Ken Zhang
              None
              None
              Min Li Min Li
              None
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: