Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-1830

Investigate sippy metrics loops / regression tracking for bugs

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Something seems off with the metrics loop. We know that component readiness alerts are flapping and we suspect others are as well. I just fixed an issue where a failure to update regressions for a view (4.18) would stop processing all the remaining views (4.17) which somehow resulted in no metrics being published, we could observe the missing 4.17 metrics by curling the endpoint.

      curl https://sippy.dptools.openshift.org/metrics | grep component_readiness | grep 4.17 

      My understanding was that prometheus would keep metrics published on an endpoint until it was restarted, so I'm confused how they would suddenly disappear, unless perhaps sippy is rolling out.

      For example this morning I saw the console operator alert flap, reporting green for three hours. Prometheus shows the gap in the chart, see attached screenshot. Note that 4.18 did not have a gap, only 4.17. This may again point to an error in the loop preventing us from getting to 4.17. (but i'm still not sure how metrics get wiped out)

      Does regression tracking need to get moved to fetchdata so it's single threaded?

      And while we're at it, we have problems with live/ready probes where sippy is still down during pod updates. I think there's a significant delay when restarting before we can serve requests, that are not reflected in the appropriate kube endpoints.

              rhn-engineering-dgoodwin Devan Goodwin
              rhn-engineering-dgoodwin Devan Goodwin
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: