-
Bug
-
Resolution: Done-Errata
-
Critical
-
4.13
-
Critical
-
Yes
-
SDN Sprint 242
-
1
-
Approved
-
False
-
Description of problem:
Azure cluster installation failed with sdn network plugin
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-09-17-045811 4.13.0-0.nightly-2023-09-18-210322
How reproducible:
Sometimes, found 2 failed in 5 jobs in ci
Steps to Reproduce:
1. Install azure cluster with template aos-4_15/ipi-on-azure/versioned-installer-customer_vpc
Actual results:
Installation failed 09-19 10:56:47.536 level=info msg=Cluster operator node-tuning Progressing is True with Reconciling: Working towards "4.15.0-0.nightly-2023-09-17-045811" 09-19 10:56:47.536 level=info msg=Cluster operator openshift-apiserver Progressing is True with APIServerDeployment_PodsUpdating: APIServerDeploymentProgressing: deployment/apiserver.openshift-apiserver: 1/3 pods have been updated to the latest generation 09-19 10:56:47.536 level=info msg=Cluster operator openshift-controller-manager Progressing is True with _DesiredStateNotYetAchieved: Progressing: deployment/controller-manager: updated replicas is 1, desired replicas is 3 09-19 10:56:47.536 level=info msg=Progressing: deployment/route-controller-manager: updated replicas is 1, desired replicas is 3 09-19 10:56:47.536 level=info msg=Cluster operator storage Progressing is True with AzureDiskCSIDriverOperatorCR_AzureDiskDriverNodeServiceController_Deploying::AzureFileCSIDriverOperatorCR_AzureFileDriverNodeServiceController_Deploying: AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods 09-19 10:56:47.536 level=info msg=AzureFileCSIDriverOperatorCRProgressing: AzureFileDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods 09-19 10:56:47.536 level=error msg=Cluster initialization failed because one or more operators are not functioning properly. 09-19 10:56:47.536 level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below, 09-19 10:56:47.537 level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html 09-19 10:56:47.537 level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation 09-19 10:56:47.537 level=error msg=failed to initialize the cluster: Cluster operators authentication, console, control-plane-machine-set, kube-apiserver, machine-config are not available 09-19 10:56:47.537 [[1;31mERROR[0;39m] Installation failed with error code '6'. Aborting execution. oc get nodes NAME STATUS ROLES AGE VERSION jima41501-c646k-master-0 NotReady control-plane,master 3h35m v1.28.2+fde2a12 jima41501-c646k-master-1 Ready control-plane,master 3h35m v1.28.2+fde2a12 jima41501-c646k-master-2 Ready control-plane,master 3h35m v1.28.2+fde2a12 jima41501-c646k-worker-southcentralus1-x82cb Ready worker 3h22m v1.28.2+fde2a12 jima41501-c646k-worker-southcentralus2-jxbbt Ready worker 3h19m v1.28.2+fde2a12 jima41501-c646k-worker-southcentralus3-s4j6c Ready worker 3h18m v1.28.2+fde2a12 huirwang@huirwang-mac workspace % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.15.0-0.nightly-2023-09-17-045811 False True True 3h31m WellKnownAvailable: The well-known endpoint is not yet available: kube-apiserver oauth endpoint https://10.0.0.7:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance) baremetal 4.15.0-0.nightly-2023-09-17-045811 True False False 3h30m cloud-controller-manager 4.15.0-0.nightly-2023-09-17-045811 True False False 3h34m cloud-credential 4.15.0-0.nightly-2023-09-17-045811 True False False 3h39m cluster-autoscaler 4.15.0-0.nightly-2023-09-17-045811 True False False 3h30m config-operator 4.15.0-0.nightly-2023-09-17-045811 True False False 3h31m console 4.15.0-0.nightly-2023-09-17-045811 False True False 3h20m DeploymentAvailable: 0 replicas available for console deployment... control-plane-machine-set 4.15.0-0.nightly-2023-09-17-045811 False True False 3h24m Missing 1 available replica(s) csi-snapshot-controller 4.15.0-0.nightly-2023-09-17-045811 True False False 3h30m dns 4.15.0-0.nightly-2023-09-17-045811 True True False 3h30m DNS "default" reports Progressing=True: "Have 5 available node-resolver pods, want 6." etcd 4.15.0-0.nightly-2023-09-17-045811 True True True 3h29m NodeControllerDegraded: The master nodes not ready: node "jima41501-c646k-master-0" not ready since 2023-09-19 02:13:06 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) image-registry 4.15.0-0.nightly-2023-09-17-045811 True True False 3h19m Progressing: The registry is ready... ingress 4.15.0-0.nightly-2023-09-17-045811 True False False 3h19m insights 4.15.0-0.nightly-2023-09-17-045811 True False False 3h19m kube-apiserver 4.15.0-0.nightly-2023-09-17-045811 False True True 3h31m StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 8 kube-controller-manager 4.15.0-0.nightly-2023-09-17-045811 True True True 3h27m NodeControllerDegraded: The master nodes not ready: node "jima41501-c646k-master-0" not ready since 2023-09-19 02:13:06 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) kube-scheduler 4.15.0-0.nightly-2023-09-17-045811 True True True 3h27m NodeControllerDegraded: The master nodes not ready: node "jima41501-c646k-master-0" not ready since 2023-09-19 02:13:06 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) kube-storage-version-migrator 4.15.0-0.nightly-2023-09-17-045811 True False False 3h30m machine-api 4.15.0-0.nightly-2023-09-17-045811 True False False 3h17m machine-approver 4.15.0-0.nightly-2023-09-17-045811 True False False 3h30m machine-config 4.15.0-0.nightly-2023-09-17-045811 False False True 164m Cluster not available for [{operator 4.15.0-0.nightly-2023-09-17-045811}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)] marketplace 4.15.0-0.nightly-2023-09-17-045811 True False False 3h30m monitoring 4.15.0-0.nightly-2023-09-17-045811 True False False 3h15m network 4.15.0-0.nightly-2023-09-17-045811 True True False 3h31m DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)... node-tuning 4.15.0-0.nightly-2023-09-17-045811 True True False 3h30m Working towards "4.15.0-0.nightly-2023-09-17-045811" openshift-apiserver 4.15.0-0.nightly-2023-09-17-045811 True True True 3h24m APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver () openshift-controller-manager 4.15.0-0.nightly-2023-09-17-045811 True True False 3h27m Progressing: deployment/controller-manager: updated replicas is 1, desired replicas is 3... openshift-samples 4.15.0-0.nightly-2023-09-17-045811 True False False 3h23m operator-lifecycle-manager 4.15.0-0.nightly-2023-09-17-045811 True False False 3h30m operator-lifecycle-manager-catalog 4.15.0-0.nightly-2023-09-17-045811 True False False 3h30m operator-lifecycle-manager-packageserver 4.15.0-0.nightly-2023-09-17-045811 True False False 3h25m service-ca 4.15.0-0.nightly-2023-09-17-045811 True False False 3h31m storage 4.15.0-0.nightly-2023-09-17-045811 True True False 3h30m AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods... [systemd] Failed Units: 1 openshift-azure-routes.service [core@jima41501-c646k-master-0 ~]$ sudo -i [systemd] Failed Units: 1 openshift-azure-routes.service [root@jima41501-c646k-master-0 ~]# systemctl status openshift-azure-routes.service × openshift-azure-routes.service - Work around Azure load balancer hairpin Loaded: loaded (/etc/systemd/system/openshift-azure-routes.service; static) Active: failed (Result: exit-code) since Tue 2023-09-19 02:10:31 UTC; 3h 23min ago Duration: 55ms TriggeredBy: ● openshift-azure-routes.path Process: 13908 ExecStart=/bin/bash /opt/libexec/openshift-azure-routes.sh start (code=exited, status=1/FAILURE) Main PID: 13908 (code=exited, status=1/FAILURE) CPU: 77ms Sep 19 02:10:31 jima41501-c646k-master-0 systemd[1]: Started Work around Azure load balancer hairpin. Sep 19 02:10:31 jima41501-c646k-master-0 openshift-azure-routes[13908]: processing v4 vip 10.0.0.4 Sep 19 02:10:31 jima41501-c646k-master-0 openshift-azure-routes[13908]: /opt/libexec/openshift-azure-routes.sh: line 130: ovnkContaine> Sep 19 02:10:31 jima41501-c646k-master-0 systemd[1]: openshift-azure-routes.service: Main process exited, code=exited, status=1/FAILURE Sep 19 02:10:31 jima41501-c646k-master-0 systemd[1]: openshift-azure-routes.service: Failed with result 'exit-code'. 4.13 failed in ci https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-azure-sdn/1703878138968150016/artifacts/e2e-azure-sdn/gather-extra/artifacts/oc_cmds/clusteroperators NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.13.0-0.nightly-2023-09-18-210322 False True True 55m WellKnownAvailable: The well-known endpoint is not yet available: kube-apiserver oauth endpoint https://10.0.0.6:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance) baremetal 4.13.0-0.nightly-2023-09-18-210322 True False False 54m cloud-controller-manager 4.13.0-0.nightly-2023-09-18-210322 True False False 56m cloud-credential 4.13.0-0.nightly-2023-09-18-210322 True False False 58m cluster-autoscaler 4.13.0-0.nightly-2023-09-18-210322 True False False 53m config-operator 4.13.0-0.nightly-2023-09-18-210322 True False False 55m console 4.13.0-0.nightly-2023-09-18-210322 False True False 45m DeploymentAvailable: 0 replicas available for console deployment... control-plane-machine-set 4.13.0-0.nightly-2023-09-18-210322 False True False 47m Missing 1 available replica(s) csi-snapshot-controller 4.13.0-0.nightly-2023-09-18-210322 True False False 54m dns 4.13.0-0.nightly-2023-09-18-210322 True True False 53m DNS "default" reports Progressing=True: "Have 5 available node-resolver pods, want 6." etcd 4.13.0-0.nightly-2023-09-18-210322 True True True 52m NodeControllerDegraded: The master nodes not ready: node "ci-op-pjxb081y-0c3e0-bxvlr-master-0" not ready since 2023-09-18 21:40:51 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) image-registry 4.13.0-0.nightly-2023-09-18-210322 True True False 45m NodeCADaemonProgressing: The daemon set node-ca is deploying node pods... ingress 4.13.0-0.nightly-2023-09-18-210322 True False False 44m insights 4.13.0-0.nightly-2023-09-18-210322 True False False 47m kube-apiserver 4.13.0-0.nightly-2023-09-18-210322 False True True 53m StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 10 kube-controller-manager 4.13.0-0.nightly-2023-09-18-210322 True True True 51m NodeControllerDegraded: The master nodes not ready: node "ci-op-pjxb081y-0c3e0-bxvlr-master-0" not ready since 2023-09-18 21:40:51 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) kube-scheduler 4.13.0-0.nightly-2023-09-18-210322 True True True 51m NodeControllerDegraded: The master nodes not ready: node "ci-op-pjxb081y-0c3e0-bxvlr-master-0" not ready since 2023-09-18 21:40:51 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) kube-storage-version-migrator 4.13.0-0.nightly-2023-09-18-210322 True False False 54m machine-api 4.13.0-0.nightly-2023-09-18-210322 True False False 46m machine-approver 4.13.0-0.nightly-2023-09-18-210322 True False False 54m machine-config 4.13.0-0.nightly-2023-09-18-210322 False False True 31m Cluster not available for [{operator 4.13.0-0.nightly-2023-09-18-210322}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)] marketplace 4.13.0-0.nightly-2023-09-18-210322 True False False 53m monitoring 4.13.0-0.nightly-2023-09-18-210322 True False False 43m network 4.13.0-0.nightly-2023-09-18-210322 True True False 55m DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)... node-tuning 4.13.0-0.nightly-2023-09-18-210322 True True False 53m Working towards "4.13.0-0.nightly-2023-09-18-210322" openshift-apiserver 4.13.0-0.nightly-2023-09-18-210322 True True True 44m APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver (3 containers are waiting in pending apiserver-66d764fbd6-r2s8d pod) openshift-controller-manager 4.13.0-0.nightly-2023-09-18-210322 True True False 54m Progressing: deployment/controller-manager: updated replicas is 1, desired replicas is 3... openshift-samples 4.13.0-0.nightly-2023-09-18-210322 True False False 47m operator-lifecycle-manager 4.13.0-0.nightly-2023-09-18-210322 True False False 54m operator-lifecycle-manager-catalog 4.13.0-0.nightly-2023-09-18-210322 True False False 54m operator-lifecycle-manager-packageserver 4.13.0-0.nightly-2023-09-18-210322 True False False 48m service-ca 4.13.0-0.nightly-2023-09-18-210322 True False False 55m storage 4.13.0-0.nightly-2023-09-18-210322 True True False 54m AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods...
Expected results:
Installation succeeds
Additional info:
We doubted this is caused by PR https://github.com/openshift/machine-config-operator/pull/3878/files
- blocks
-
OCPBUGS-19344 4.14 & 4.15 Azure Install Failures: Kubelet stopped posting node status
- Closed
- is blocked by
-
SDN-4137 Impact Azure cluster installation failed with sdn plugin
- Closed
- links to
-
RHEA-2023:7198 rpm