-
Story
-
Resolution: Done
-
Major
-
None
-
None
-
False
-
None
-
False
-
-
In https://issues.redhat.com/browse/OCPBUGS-13543 we think we may have found a pattern to explain why we have long standing SLB workload disruption of around 8s a run. This was discovered by accident with promtail pods in clusters had a problem with shutdown that slowed down the node rebooting and going NotReady.
To test this theory we want to:
- deploy a DaemonSet in origin prior to running the upgrade suite, which runs a process which delays shutdown by ~ 10s to all workers and masters.
- get this merged assuming it proves safe.
- open multiple prs which increase this 10s to 1m, 2m, 3m, 5m, and run /payload against each. technically periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-aws-ovn-upgrade would probably be enough, but we might want to examine other clouds though AWS is presently the most visible.
- look to see if any of the delays affects disruption to the SLB backend
If we find that at some thresholds we consistently see better disruption results, we have a case that we could improve node shutdown.