Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-566

Create invariant test for finding reason/ReadinessFailed events with "Client.Timeout exceeded"

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • False
    • None
    • False

      The related Jira is TRT-529 and this slack thread for context.

      We will target the openshift-config-operator pods at first since that one is quite common.

      In this chart, we see the symptom we're trying to track for openshift-config-operator. Note the reason/ReadinessFailed events with "Client.Timeout exceeded".

      In this log, we see:

      E0829 12:50:46.153659       1 timeout.go:141] post-timeout activity - time-elapsed: 401.468µs, GET "/healthz" result: <nil>
      E0829 12:55:55.138175       1 timeout.go:141] post-timeout activity - time-elapsed: 36.423581ms, GET "/healthz" result: <nil>
      E0829 12:57:04.484155       1 timeout.go:141] post-timeout activity - time-elapsed: 301.763968ms, GET "/healthz" result: <nil>
      E0829 12:58:13.315812       1 timeout.go:141] post-timeout activity - time-elapsed: 60.883527ms, GET "/healthz" result: <nil>
      E0829 13:00:31.233383       1 timeout.go:141] post-timeout activity - time-elapsed: 134.115856ms, GET "/healthz" result: <nil>
      E0829 13:02:49.168533       1 timeout.go:141] post-timeout activity - time-elapsed: 974.408µs, GET "/healthz" result: <nil>
      E0829 13:02:49.474493       1 timeout.go:141] post-timeout activity - time-elapsed: 305.37752ms, GET "/healthz" result: <nil>
      

      You can see there is a lot of latency related to the probe replies.

      The point of the test is to identify how frequent the problem is happening and on what jobs.

      After this, we can take next steps as mentioned in the slack thread mentioned above including trying to understand why the /healthz probe is taking so long.

            dperique@redhat.com Dennis Periquet
            dperique@redhat.com Dennis Periquet
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: