Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-3182

[2109965] [4.11] oci hook Low-latency-hooks causing high container creation times under platform cpu load

XMLWordPrintable

    • None
    • CNF Compute Sprint 227, CNF Compute Sprint 228
    • 2
    • Proposed
    • False
    • Hide

      None

      Show
      None
    • Hide
      previously, the rollout times for kube-apiserver, on loaded cluster, were slow and sometimes exceeded the 5 min rollout timeout.
      With this fix, the rollout times should be shorter and within the 5 min threshold.
      Show
      previously, the rollout times for kube-apiserver, on loaded cluster, were slow and sometimes exceeded the 5 min rollout timeout. With this fix, the rollout times should be shorter and within the 5 min threshold.

      Description of problem:

      We have been investigating an issue with slow kube-apiserver rollout times. When a new revision is created, the current static pod is deleted and a new one created to pick up the revision. There is a 5 min timer on the creation, if this timeout is exceeded the rollout will revert to the previous revision.

      The customer has been seeing failed rollouts due to this 5 min timer being exceeded. There is load on the platform cpus with the biggest contributor being exec probe overhead, but there is still significant idle ~ 50%.

      While not able to reproduce to the same degree as the customer, I was able to reproduce slow rollout times with a similar platform cpu overhead.

      From the logs, we see slow container creation times.

      I added some instrumentation to the low_latency_hooks.sh script

      snip
      pid=$(jq '.pid' /dev/stdin 2>&1)
      logger "Start low_latency_hooks ${pid}"
      [[ $? -eq 0 && -n "${pid}" ]] || { logger "${0}: Failed to extract the pid: ${pid}"; exit 0; }
      snip
      if [ "${mode}" = "ro" ]; then
      ip netns exec "${ns}" mount -o remount,ro /sys
      [ $? -eq 0 ] || exit 1 # Error out so the pod will not start with a writable /sys
      fi
      logger "Stop low_latency_hooks ${pid}"

      Analysing the logs for the five running containers in the apiserver we see that the bulk of the time is being spent in the hook.

      insecure-readyz
      total container create time: 35s
      hook time: 29s

      cert-syncer
      total container create time: 41s
      hook time: 32s

      cert-regeneration-controller
      total container create time: 73s
      hook time: 54s

      kube-apiserver
      total container create time: 18s
      hook time: 16s

      check-endpoints
      total container create time: 31s
      hook time: 31s

      I ran another test where I removed the oci hook and kept everything else the same, the results were dramatically different.

      Container create times:
      insecure-readyz - 1s
      cert-syncer - 1s
      cert-regeneration-controller - 1s
      kube-apiserver -1s
      check-endpoints - 5s

      I was then able to run the same test in the customers lab. In some joint testing we did with the customer we originally saw 4-5 mins for a rollout. Without the hook in the exact same environment, the total rollout time dropped to <=2 mins.

      Version-Release number of selected component (if applicable):
      4.9.37
      Issue likely in later releases as well, have not timed yet

      How reproducible:
      100%

      Steps to Reproduce:
      1. Force a rollout with a platform cpu load representative of the application
      2.
      3.

      Actual results:
      Slow rollout times sometimes exceeding the timeout

      Expected results:
      Rollout should fit into the timeout window

      Additional info:

            titzhak Talor Itzhak
            titzhak Talor Itzhak
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: