-
Story
-
Resolution: Done
-
Undefined
-
None
-
None
-
None
-
False
-
None
-
False
The related Jira is TRT-529 and this slack thread for context.
We will target the openshift-config-operator pods at first since that one is quite common.
In this chart, we see the symptom we're trying to track for openshift-config-operator. Note the reason/ReadinessFailed events with "Client.Timeout exceeded".
In this log, we see:
E0829 12:50:46.153659 1 timeout.go:141] post-timeout activity - time-elapsed: 401.468µs, GET "/healthz" result: <nil> E0829 12:55:55.138175 1 timeout.go:141] post-timeout activity - time-elapsed: 36.423581ms, GET "/healthz" result: <nil> E0829 12:57:04.484155 1 timeout.go:141] post-timeout activity - time-elapsed: 301.763968ms, GET "/healthz" result: <nil> E0829 12:58:13.315812 1 timeout.go:141] post-timeout activity - time-elapsed: 60.883527ms, GET "/healthz" result: <nil> E0829 13:00:31.233383 1 timeout.go:141] post-timeout activity - time-elapsed: 134.115856ms, GET "/healthz" result: <nil> E0829 13:02:49.168533 1 timeout.go:141] post-timeout activity - time-elapsed: 974.408µs, GET "/healthz" result: <nil> E0829 13:02:49.474493 1 timeout.go:141] post-timeout activity - time-elapsed: 305.37752ms, GET "/healthz" result: <nil>
You can see there is a lot of latency related to the probe replies.
The point of the test is to identify how frequent the problem is happening and on what jobs.
After this, we can take next steps as mentioned in the slack thread mentioned above including trying to understand why the /healthz probe is taking so long.