-
Story
-
Resolution: Done
-
Major
-
None
-
None
-
None
-
False
-
None
-
False
-
-
Fallout from OCPBUGS-50510, we can now get etcdserver timeouts whereas previously clients would retry and succeed, this seems to hit pod sandbox creation specifically during the window of time we're setting up monitortests prior to e2e testing.
Two goals here:
Dig into how heavy the load is during monitortest StartCollection phase from the kube api audit logs. Via deads "openshift/cluster-debug-tools has a handy audit command". We should document whatever use of the tool proves useful here in our team drive somewhere.
Then we likely proceed to slow down the monitortest initialization, perhaps only run a few monitortest setups at a time here instead of all of them in parallel at once: https://github.com/openshift/origin/blob/50451ebe907765cbe3a5537ed089eb0045f9e0f6/pkg/monitortestframework/impl.go#L86
We'll need to then monitor for these errors coming out of that test, this postgres query should help:
select r.timestamp, r.url from prow_job_run_tests t, prow_job_run_test_outputs o, prow_job_runs r, prow_jobs j where t.test_id = 260 and t.id = o.prow_job_run_test_id and r.id = t.prow_job_run_id and j.id = r.prow_job_id and j.release = '4.18' and o.output like '%etcdserver%' order by r.timestamp asc;
- is related to
-
OCPBUGS-50510 etcd timeouts causing failed pod sandbox creation writing network status
-
- Closed
-
- links to