Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.14.0
Affects Version/s: 4.13
Component/s: Networking / openshift-sdn
Labels:
- trt-incident

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
No

Target Backport Versions:
None
Target Version:

4.14.0
Release Blocker:
Approved
Sprint:
SDN Sprint 242
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
Release Note Not Required
Release Note Text:
N/A

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Install issues for 4.14 && 4.15 where we lose contact with kublet on master nodes.

https://search.ci.openshift.org/?search=Kubelet+stopped+posting+node+status&maxAge=168h&context=1&type=build-log&name=periodic.*4.14.*azure.*sdn&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

This search shows its happening on about 35% of azure sdn 4.14 jobs over the past week at least. There are no ovn hits.

1703590387039342592/artifacts/e2e-azure-sdn-upgrade/gather-extra/artifacts/nodes.json

                    {
                        "lastHeartbeatTime": "2023-09-18T02:33:11Z",
                        "lastTransitionTime": "2023-09-18T02:35:39Z",
                        "message": "Kubelet stopped posting node status.",
                        "reason": "NodeStatusUnknown",
                        "status": "Unknown",
                        "type": "Ready"
                    }

4.14 is interesting as it is a minor upgrade from 4.13 and we see the install failures with a master node dropping out.

Focusing on periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-azure-sdn-upgrade/1703590387039342592

Build log shows

[36mINFO[0m[2023-09-18T02:03:03Z] Using explicitly provided pull-spec for release initial (registry.ci.openshift.org/ocp/release:4.13.0-0.ci-2023-09-17-050449)

ipi-azure-conf shows region centralus (not the single zone westus)

get ocp version: 4.13
/output
Azure region: centralus

oc_cmds/nodes shows master-1 not ready

ci-op-82xkimh8-0dd98-9g9wh-master-1                  NotReady   control-plane,master   82m   v1.26.7+c7ee51f   10.0.0.6      <none>        Red Hat Enterprise Linux CoreOS 413.92.202309141211-0 (Plow)

ci-op-82xkimh8-0dd98-9g9wh-master-1-boot.log shows ignition

install log shows we have lost contact

time="2023-09-18T03:15:33Z" level=error msg="Cluster operator kube-apiserver Degraded is True with GuardController_SyncError::NodeController_MasterNodesReady: GuardControllerDegraded: [Missing operand on node ci-op-82xkimh8-0dd98-9g9wh-master-0, Missing operand on node ci-op-82xkimh8-0dd98-9g9wh-master-2]\nNodeControllerDegraded: The master nodes not ready: node \"ci-op-82xkimh8-0dd98-9g9wh-master-1\" not ready since 2023-09-18 02:35:39 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)"

4.15 4.15.0-0.ci-2023-09-17-172341 and 4.14 4.14.0-0.ci-2023-09-18-020137

Version-Release number of selected component (if applicable):

How reproducible:

We are seeing this on a high number of failed payloads for 4.14 && 4.15. Additional recent failures

4.14.0-0.ci-2023-09-17-012321
aggregated-azure-sdn-upgrade-4.14-minor shows failures like: Passed 5 times, failed 0 times, skipped 0 times: we require at least 6 attempts to have a chance at success indicating that only 5 of the 10 runs were valid.
Checking install logs shows we have lost master-2

time="2023-09-17T02:44:22Z" level=error msg="Cluster operator kube-apiserver Degraded is True with GuardController_SyncError::NodeController_MasterNodesReady: GuardControllerDegraded: [Missing operand on node ci-op-crj5cf00-0dd98-p5snd-master-1, Missing operand on node ci-op-crj5cf00-0dd98-p5snd-master-0]\nNodeControllerDegraded: The master nodes not ready: node \"ci-op-crj5cf00-0dd98-p5snd-master-2\" not ready since 2023-09-17 02:01:49 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)"

oc_cmds/nodes also shows master-2 not ready

4.15.0-0.nightly-2023-09-17-113421 install analysis failed due to azure tech preview oc_cmds/nodes shows master-1 not ready

4.15.0-0.ci-2023-09-17-112341 aggregated-azure-sdn-upgrade-4.15-minor only 5 of 10 runs are valid sample oc_cmds/nodes shows master-0 not ready

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

blocks

OCPBUGS-19509 4.13.z Azure Install Failures: Kubelet stopped posting node status

Closed

is blocked by

OCPBUGS-19365 Azure cluster installation failed with sdn plugin

Closed

is cloned by

OCPBUGS-19509 4.13.z Azure Install Failures: Kubelet stopped posting node status

Closed

links to

openshift/machine-config-operator#3928: [release-4.14] OCPBUGS-19344: Ignore invoking nbctl calls if its SDN

RHSA-2023:5006 OpenShift Container Platform 4.14.z security update

Assignee:: Surya Seetharaman

Reporter:: Forrest Babcock

Need Info From:: None

Contributors:: None

QA Contact:: Huiran Wang

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Created:: 2023/09/18 3:18 PM

Updated:: 2025/07/25 11:51 AM

Resolved:: 2023/10/31 1:43 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide