Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-55755

Gather CI metrics to monitor apiserver/etcd health

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • None
    • 4.20.0
    • Etcd
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • Approved
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      As part of the fallout from OCPBUGS-55445 and OCPBUGS-54222, it is clear we need a better means of monitoring the health of the apiserver and etcd. While we decided to allow the "regression" for the prior two bugs, need to close this gap to eliminate the ambiguity we saw in that run.

      TRT has a framework whereby origin can run a prometheus query at the end of the run for you, and include the results in a datafile. That datafile is automatically ingested into a bigquery table and then becomes chartable. In the near future we hope that this could automatically be monitored for regressions in a manner that is fed into component readiness. This happens today in https://github.com/openshift/release/blob/master/ci-operator/step-registry/gather/extra/gather-extra-commands.sh#L361

      WARNING: the framework is minimally used today, and may need improvements as we get into things. Will require working closely with TRT.

      The catch is that we can only chart a singular value per job run, we cannot work with full time series data as you'd graph in prometheus, so the queries must boil down to useful single values.

      This list needs improving/expanding but some initial ideas:

      • various percentiles for etcd disk fsync
      • percentages of requests hitting various http status codes
      • request duration percentiles
      • something capturing counts/percentages of request timeouts

      After discussion with Nick and David, the request here is to have this gap closed in 4.20. I have marked as a 4.20 release blocker.

              dwest@redhat.com Dean West
              rhn-engineering-dgoodwin Devan Goodwin
              None
              None
              Ge Liu Ge Liu
              None
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: