Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: 4.13, 4.14
Component/s: Node / Kubelet
Labels:
- triaged
- trt-incident

Severity:
Critical
Regression:
No
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.13, 4.14

SFDC Cases Counter:
SFDC Cases Links:

Description

Recently, both 4.14 and 4.13 payloads have been blocked by this disruption variant: "service-load-balancer-with-pdb-new-connections/service-load-balancer-with-pdb-reused-connections xxxxxx should not be worse".

The following chart shows the problem: https://grafana-loki.ci.openshift.org/d/ISnBj4LVk/disruption?from=1681226000299&to=1683818000299&var-platform=azure&var-platform=aws&var-platform=gcp&var-percentile=P50&var-backend=service-load-balancer-with-pdb-new-connections&var-backend=service-load-balancer-with-pdb-reused-connections&var-releases=4.13&var-from_releases=4.12&var-networks=ovn&var-topologies=ha&var-architectures=amd64&var-min_job_runs=10&orgId=1

Our analysis indicates that there is a node life cycle behavior change that might be related to this regression. 4.14 data and 4.13 data are very similar. I am going to use 4.13 data for detailed description below since there were fewer changes on 4.13 branch.

Take 4.13 CI as an example

Good payload: https://sippy.dptools.openshift.org/sippy-ng/release/4.13/tags/4.13.0-0.ci-2023-05-06-033438

Problem payload: https://sippy.dptools.openshift.org/sippy-ng/release/4.13/tags/4.13.0-0.ci-2023-05-08-161719

But we do not see any PRs in the problem payload!

Analysis of the problem indicates reporting of NodeNotReady status seems to be related to the issue. From the following example job runs, if you expand on the first interval chart and go to the node-state section, you will see that, for the worker nodes, the yellow bar that indicates node not ready is narrow and lasted about 20s for the problem job. In comparison, for the good payload, this bar is typically 2 minutes!

This is a job from the good payload: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-aws-ovn-upgrade/1654691394544996352

For comparison, here is a job from the problem payload: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-aws-ovn-upgrade/1655609894633476096

The CCM (4.14) or in tree kube-controller-manager (4.13) removes the instance from the load balancer based on this status:

For example from this log: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-aws-ovn-upgrade/1655609894633476096/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods/openshift-kube-controller-manager_kube-controller-manager-ip-10-0-132-208.us-west-2.compute.internal_kube-controller-manager.log

I0508 18:31:52.138738 1 aws_loadbalancer.go:1483] Instances removed from load-balancer a24bcbae1386d4d6ca8c710cc358322d

Also in the bad payload run, based on journal log, the short 20s "node not ready" seems to start after a node is rebooted. But in the good payload run, the event started before reboot start (typically over 1 minute).

We are suspecting that, since the node readiness is not changed sooner, and therefore instance is not removed from the load balancer, it causes the disruption we are seeing.

What is odd is that this problem seemed to exist before and somehow it was fixed on 4/20/2023. But then it is broken again between the above example payloads.

Please ping TRT if any more information is needed.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

image-2023-05-11-15-10-13-466.png
2023/05/11 7:10 PM
120 kB
Forrest Babcock
image-2023-05-11-15-12-02-986.png
2023/05/11 7:12 PM
73 kB
Forrest Babcock
image-2023-05-11-15-18-48-807.png
2023/05/11 7:18 PM
68 kB
Forrest Babcock
screenshot-1.png
2023/05/11 6:58 PM
91 kB
Forrest Babcock

Activity

People

Assignee:: Forrest Babcock

Reporter:: Ken Zhang

QA Contact:: Sunil Choudhary

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 2023/05/11 3:18 PM

Updated:: 2023/05/15 3:04 PM

Resolved:: 2023/05/15 1:06 PM