Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-2061

Investigate API load from monitortest init

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • None
    • False
    • None
    • False

      Fallout from OCPBUGS-50510, we can now get etcdserver timeouts whereas previously clients would retry and succeed, this seems to hit pod sandbox creation specifically during the window of time we're setting up monitortests prior to e2e testing.

      Two goals here:

      Dig into how heavy the load is during monitortest StartCollection phase from the kube api audit logs. Via deads "openshift/cluster-debug-tools has a handy audit command". We should document whatever use of the tool proves useful here in our team drive somewhere.

      Then we likely proceed to slow down the monitortest initialization, perhaps only run a few monitortest setups at a time here instead of all of them in parallel at once: https://github.com/openshift/origin/blob/50451ebe907765cbe3a5537ed089eb0045f9e0f6/pkg/monitortestframework/impl.go#L86

      We'll need to then monitor for these errors coming out of that test, this postgres query should help:

      select r.timestamp, r.url
      from prow_job_run_tests t, prow_job_run_test_outputs o, prow_job_runs r, prow_jobs j
      where t.test_id = 260 and t.id = o.prow_job_run_test_id and r.id = t.prow_job_run_id
      and j.id = r.prow_job_id
      and j.release = '4.18'
      and o.output like '%etcdserver%'
      order by r.timestamp asc;
      

              rh-ee-fbabcock Forrest Babcock
              rhn-engineering-dgoodwin Devan Goodwin
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: