-
Bug
-
Resolution: Done-Errata
-
Critical
-
4.13
-
Critical
-
No
-
SDN Sprint 242
-
1
-
Approved
-
False
-
-
N/A
-
Release Note Not Required
Description of problem:
Install issues for 4.14 && 4.15 where we lose contact with kublet on master nodes.
This search shows its happening on about 35% of azure sdn 4.14 jobs over the past week at least. There are no ovn hits.
1703590387039342592/artifacts/e2e-azure-sdn-upgrade/gather-extra/artifacts/nodes.json
{ "lastHeartbeatTime": "2023-09-18T02:33:11Z", "lastTransitionTime": "2023-09-18T02:35:39Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "Ready" }
4.14 is interesting as it is a minor upgrade from 4.13 and we see the install failures with a master node dropping out.
Build log shows
[36mINFO[0m[2023-09-18T02:03:03Z] Using explicitly provided pull-spec for release initial (registry.ci.openshift.org/ocp/release:4.13.0-0.ci-2023-09-17-050449)
ipi-azure-conf shows region centralus (not the single zone westus)
get ocp version: 4.13 /output Azure region: centralus
oc_cmds/nodes shows master-1 not ready
ci-op-82xkimh8-0dd98-9g9wh-master-1 NotReady control-plane,master 82m v1.26.7+c7ee51f 10.0.0.6 <none> Red Hat Enterprise Linux CoreOS 413.92.202309141211-0 (Plow)
ci-op-82xkimh8-0dd98-9g9wh-master-1-boot.log shows ignition
install log shows we have lost contact
time="2023-09-18T03:15:33Z" level=error msg="Cluster operator kube-apiserver Degraded is True with GuardController_SyncError::NodeController_MasterNodesReady: GuardControllerDegraded: [Missing operand on node ci-op-82xkimh8-0dd98-9g9wh-master-0, Missing operand on node ci-op-82xkimh8-0dd98-9g9wh-master-2]\nNodeControllerDegraded: The master nodes not ready: node \"ci-op-82xkimh8-0dd98-9g9wh-master-1\" not ready since 2023-09-18 02:35:39 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)"
4.15 4.15.0-0.ci-2023-09-17-172341 and 4.14 4.14.0-0.ci-2023-09-18-020137
Version-Release number of selected component (if applicable):
How reproducible:
We are seeing this on a high number of failed payloads for 4.14 && 4.15. Additional recent failures
4.14.0-0.ci-2023-09-17-012321
aggregated-azure-sdn-upgrade-4.14-minor shows failures like: Passed 5 times, failed 0 times, skipped 0 times: we require at least 6 attempts to have a chance at success indicating that only 5 of the 10 runs were valid.
Checking install logs shows we have lost master-2
time="2023-09-17T02:44:22Z" level=error msg="Cluster operator kube-apiserver Degraded is True with GuardController_SyncError::NodeController_MasterNodesReady: GuardControllerDegraded: [Missing operand on node ci-op-crj5cf00-0dd98-p5snd-master-1, Missing operand on node ci-op-crj5cf00-0dd98-p5snd-master-0]\nNodeControllerDegraded: The master nodes not ready: node \"ci-op-crj5cf00-0dd98-p5snd-master-2\" not ready since 2023-09-17 02:01:49 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)"
oc_cmds/nodes also shows master-2 not ready
4.15.0-0.nightly-2023-09-17-113421 install analysis failed due to azure tech preview oc_cmds/nodes shows master-1 not ready
4.15.0-0.ci-2023-09-17-112341 aggregated-azure-sdn-upgrade-4.15-minor only 5 of 10 runs are valid sample oc_cmds/nodes shows master-0 not ready
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
- blocks
-
OCPBUGS-19509 4.13.z Azure Install Failures: Kubelet stopped posting node status
- Closed
- is blocked by
-
OCPBUGS-19365 Azure cluster installation failed with sdn plugin
- Closed
- is cloned by
-
OCPBUGS-19509 4.13.z Azure Install Failures: Kubelet stopped posting node status
- Closed
- links to
-
RHSA-2023:5006 OpenShift Container Platform 4.14.z security update