Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-3182

[2109965] [4.11] oci hook Low-latency-hooks causing high container creation times under platform cpu load

    • None
    • CNF Compute Sprint 227, CNF Compute Sprint 228
    • 2
    • Proposed
    • False
    • Hide

      None

      Show
      None
    • Hide
      previously, the rollout times for kube-apiserver, on loaded cluster, were slow and sometimes exceeded the 5 min rollout timeout.
      With this fix, the rollout times should be shorter and within the 5 min threshold.
      Show
      previously, the rollout times for kube-apiserver, on loaded cluster, were slow and sometimes exceeded the 5 min rollout timeout. With this fix, the rollout times should be shorter and within the 5 min threshold.

      Description of problem:

      We have been investigating an issue with slow kube-apiserver rollout times. When a new revision is created, the current static pod is deleted and a new one created to pick up the revision. There is a 5 min timer on the creation, if this timeout is exceeded the rollout will revert to the previous revision.

      The customer has been seeing failed rollouts due to this 5 min timer being exceeded. There is load on the platform cpus with the biggest contributor being exec probe overhead, but there is still significant idle ~ 50%.

      While not able to reproduce to the same degree as the customer, I was able to reproduce slow rollout times with a similar platform cpu overhead.

      From the logs, we see slow container creation times.

      I added some instrumentation to the low_latency_hooks.sh script

      snip
      pid=$(jq '.pid' /dev/stdin 2>&1)
      logger "Start low_latency_hooks ${pid}"
      [[ $? -eq 0 && -n "${pid}" ]] || { logger "${0}: Failed to extract the pid: ${pid}"; exit 0; }
      snip
      if [ "${mode}" = "ro" ]; then
      ip netns exec "${ns}" mount -o remount,ro /sys
      [ $? -eq 0 ] || exit 1 # Error out so the pod will not start with a writable /sys
      fi
      logger "Stop low_latency_hooks ${pid}"

      Analysing the logs for the five running containers in the apiserver we see that the bulk of the time is being spent in the hook.

      insecure-readyz
      total container create time: 35s
      hook time: 29s

      cert-syncer
      total container create time: 41s
      hook time: 32s

      cert-regeneration-controller
      total container create time: 73s
      hook time: 54s

      kube-apiserver
      total container create time: 18s
      hook time: 16s

      check-endpoints
      total container create time: 31s
      hook time: 31s

      I ran another test where I removed the oci hook and kept everything else the same, the results were dramatically different.

      Container create times:
      insecure-readyz - 1s
      cert-syncer - 1s
      cert-regeneration-controller - 1s
      kube-apiserver -1s
      check-endpoints - 5s

      I was then able to run the same test in the customers lab. In some joint testing we did with the customer we originally saw 4-5 mins for a rollout. Without the hook in the exact same environment, the total rollout time dropped to <=2 mins.

      Version-Release number of selected component (if applicable):
      4.9.37
      Issue likely in later releases as well, have not timed yet

      How reproducible:
      100%

      Steps to Reproduce:
      1. Force a rollout with a platform cpu load representative of the application
      2.
      3.

      Actual results:
      Slow rollout times sometimes exceeding the timeout

      Expected results:
      Rollout should fit into the timeout window

      Additional info:

            [OCPBUGS-3182] [2109965] [4.11] oci hook Low-latency-hooks causing high container creation times under platform cpu load

            Errata Tool added a comment -

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory, and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHBA-2022:8698

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:8698

              titzhak Talor Itzhak
              titzhak Talor Itzhak
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: