Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.13, 4.14
Component/s: RHCOS
Labels:

Severity:
Critical
Regression:
No
Release Blocker:
Approved
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
N/A
Release Note Type:
Bug Fix
Release Note Status:
Done
Blocked by Bugzilla Bug:
https://bugzilla.redhat.com/show_bug.cgi?id=2179165
Target Version:

4.13.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

After coreos 9.2 bump, we are seeing disruptions happening during node upgrade that affects all backends going through kube-apiserver. Our investigation reveals that we are missing shutdown signals on the kube-apiserver during node reboot. This can be observed with most if not all micro upgrades.

My analysis is from comparison between the following two jobs:

pre-9.2 4.13 job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-aws-sdn-upgrade/1634235926655799296

post-9.2 4.14 job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade/1635210801117663232

From the Spyglass Chart we clearly see disruptions with the 4.14 job during node upgrade. The first disruption happens at 10:57:26, one second after the node 10.0.232.149 started rebooting at 10:57.25. Shutdown signal is missing from the event (see attached) and kube-apiserver log: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%7B%22datasource%22:%22Grafana%20Cloud%22,%22queries%22:%5B%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade%2F1635210801117663232%5C%22%7D%20%7C%20unpack%20%20%7C%20pod_name%3D%5C%22kube-apiserver-ip-10-0-232-149.us-west-1.compute.internal%5C%22%22,%22refId%22:%22A%22%7D%5D,%22range%22:%7B%22from%22:%221678099663000%22,%22to%22:%221678705170000%22%7D%7D


For comparison in the pre-9.2 4.13 job, node reboot started at 18:03:45 for 10.0.220.42. And shutdown signal is observed in both events (see attached) and kube-apiserver log.

2023-03-10 18:03:46	
I0310 18:03:45.807438      15 genericapiserver.go:978] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-apiserver", Name:"kube-apiserver-ip-10-0-220-42.us-west-1.compute.internal", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ShutdownInitiated' Received signal to terminate, becoming unready, but keeping serving 


It is my understanding that proper shutdown signal is needed for readyz check to return false. Without it, load balancer will fail to remove the end point and result in disruption.

The same can be observed in post-9.2 4.13 jobs and 4.14 azure micro upgrade jobs.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

4.13 With Shutdown Signal.png
221 kB
2023/03/15 5:04 PM
4.14 Missing Shutdown Signal.png
154 kB
2023/03/15 5:04 PM

blocks

COS-1926 Move RHCOS to RHEL 9.2 in OCP 4.13

Closed

TRT-898 4.14 AWS SDN Upgrade Disruption

Closed

OCPBUGS-8710 [4.13] don't enforce PSa in 4.13

Closed

links to

https://gitlab.com/redhat/centos-stream/rpms/systemd/-/merge_requests/72

Assignee:: Michael Nguyen

Reporter:: Ken Zhang

QA Contact:: Sunil Choudhary

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Created:: 2023/03/15 5:03 PM

Updated:: 2024/04/29 5:00 PM

Resolved:: 2023/05/17 10:43 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates