Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-1721

Add Intervals for TargetDown metrics in 4.16

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • None
    • False
    • None
    • False

      Fallout of https://issues.redhat.com/browse/OCPBUGS-35371

      We simply do not have enough visibility into why these kubelet endpoints are going down, outside of a reboot, while kubelet itself stays up.

      A big step would be charting them with the intervals. Add a new monitor test to query prometheus at the end of the run looking for when these targets were down.

      Prom query:

      max by (node, metrics_path) (up{job="kubelet"}) == 0
      

      Then perhaps a test to flake if we see this happen outside of a node reboot. This seems to happen on every gcp-ovn (non-upgrade) job I look at. It does NOT seem to happen on AWS.

            rhn-engineering-dgoodwin Devan Goodwin
            rhn-engineering-dgoodwin Devan Goodwin
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: