Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-1030

Deploy DaemonSet in origin prior to upgrade which delays node reboots

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • False
    • None
    • False

      In https://issues.redhat.com/browse/OCPBUGS-13543 we think we may have found a pattern to explain why we have long standing SLB workload disruption of around 8s a run. This was discovered by accident with promtail pods in clusters had a problem with shutdown that slowed down the node rebooting and going NotReady.

      To test this theory we want to:

      • deploy a DaemonSet in origin prior to running the upgrade suite, which runs a process which delays shutdown by ~ 10s to all workers and masters.
      • get this merged assuming it proves safe.
      • open multiple prs which increase this 10s to 1m, 2m, 3m, 5m, and run /payload against each. technically periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-aws-ovn-upgrade would probably be enough, but we might want to examine other clouds though AWS is presently the most visible.
      • look to see if any of the delays affects disruption to the SLB backend

      If we find that at some thresholds we consistently see better disruption results, we have a case that we could improve node shutdown.

              rh-ee-fbabcock Forrest Babcock
              rhn-engineering-dgoodwin Devan Goodwin
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: