Resolution: Done
4.13, 4.14
Bug Fix
Description of problem:
After coreos 9.2 bump, we are seeing disruptions happening during node upgrade that affects all backends going through kube-apiserver. Our investigation reveals that we are missing shutdown signals on the kube-apiserver during node reboot. This can be observed with most if not all micro upgrades. My analysis is from comparison between the following two jobs: pre-9.2 4.13 job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-aws-sdn-upgrade/1634235926655799296 post-9.2 4.14 job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade/1635210801117663232 From the Spyglass Chart we clearly see disruptions with the 4.14 job during node upgrade. The first disruption happens at 10:57:26, one second after the node started rebooting at 10:57.25. Shutdown signal is missing from the event (see attached) and kube-apiserver log: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%7B%22datasource%22:%22Grafana%20Cloud%22,%22queries%22:%5B%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade%2F1635210801117663232%5C%22%7D%20%7C%20unpack%20%20%7C%20pod_name%3D%5C%22kube-apiserver-ip-10-0-232-149.us-west-1.compute.internal%5C%22%22,%22refId%22:%22A%22%7D%5D,%22range%22:%7B%22from%22:%221678099663000%22,%22to%22:%221678705170000%22%7D%7D For comparison in the pre-9.2 4.13 job, node reboot started at 18:03:45 for And shutdown signal is observed in both events (see attached) and kube-apiserver log. 2023-03-10 18:03:46 I0310 18:03:45.807438 15 genericapiserver.go:978] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-apiserver", Name:"kube-apiserver-ip-10-0-220-42.us-west-1.compute.internal", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ShutdownInitiated' Received signal to terminate, becoming unready, but keeping serving It is my understanding that proper shutdown signal is needed for readyz check to return false. Without it, load balancer will fail to remove the end point and result in disruption. The same can be observed in post-9.2 4.13 jobs and 4.14 azure micro upgrade jobs.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
- blocks
COS-1926 Move RHCOS to RHEL 9.2 in OCP 4.13
- Closed
TRT-898 4.14 AWS SDN Upgrade Disruption
- Closed
OCPBUGS-8710 [4.13] don't enforce PSa in 4.13
- Closed
- links to