Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-19365

Azure cluster installation failed with sdn plugin

XMLWordPrintable

    • Critical
    • Yes
    • SDN Sprint 242
    • 1
    • Approved
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Azure cluster installation failed with sdn network plugin
      
      

      Version-Release number of selected component (if applicable):

      4.15.0-0.nightly-2023-09-17-045811
      4.13.0-0.nightly-2023-09-18-210322 
      
      

      How reproducible:

      Sometimes, found 2 failed in 5 jobs in ci
      
      

      Steps to Reproduce:

      1.  Install azure cluster with template aos-4_15/ipi-on-azure/versioned-installer-customer_vpc
      
      

      Actual results:

      Installation failed 
       09-19 10:56:47.536  level=info msg=Cluster operator node-tuning Progressing is True with Reconciling: Working towards "4.15.0-0.nightly-2023-09-17-045811"
      09-19 10:56:47.536  level=info msg=Cluster operator openshift-apiserver Progressing is True with APIServerDeployment_PodsUpdating: APIServerDeploymentProgressing: deployment/apiserver.openshift-apiserver: 1/3 pods have been updated to the latest generation
      09-19 10:56:47.536  level=info msg=Cluster operator openshift-controller-manager Progressing is True with _DesiredStateNotYetAchieved: Progressing: deployment/controller-manager: updated replicas is 1, desired replicas is 3
      09-19 10:56:47.536  level=info msg=Progressing: deployment/route-controller-manager: updated replicas is 1, desired replicas is 3
      09-19 10:56:47.536  level=info msg=Cluster operator storage Progressing is True with AzureDiskCSIDriverOperatorCR_AzureDiskDriverNodeServiceController_Deploying::AzureFileCSIDriverOperatorCR_AzureFileDriverNodeServiceController_Deploying: AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods
      09-19 10:56:47.536  level=info msg=AzureFileCSIDriverOperatorCRProgressing: AzureFileDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods
      09-19 10:56:47.536  level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
      09-19 10:56:47.536  level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
      09-19 10:56:47.537  level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
      09-19 10:56:47.537  level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation
      09-19 10:56:47.537  level=error msg=failed to initialize the cluster: Cluster operators authentication, console, control-plane-machine-set, kube-apiserver, machine-config are not available
      09-19 10:56:47.537  [ERROR] Installation failed with error code '6'. Aborting execution.
      
      oc get nodes
      NAME                                           STATUS     ROLES                  AGE     VERSION
      jima41501-c646k-master-0                       NotReady   control-plane,master   3h35m   v1.28.2+fde2a12
      jima41501-c646k-master-1                       Ready      control-plane,master   3h35m   v1.28.2+fde2a12
      jima41501-c646k-master-2                       Ready      control-plane,master   3h35m   v1.28.2+fde2a12
      jima41501-c646k-worker-southcentralus1-x82cb   Ready      worker                 3h22m   v1.28.2+fde2a12
      jima41501-c646k-worker-southcentralus2-jxbbt   Ready      worker                 3h19m   v1.28.2+fde2a12
      jima41501-c646k-worker-southcentralus3-s4j6c   Ready      worker                 3h18m   v1.28.2+fde2a12
      huirwang@huirwang-mac workspace % oc get co
      NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.15.0-0.nightly-2023-09-17-045811   False       True          True       3h31m   WellKnownAvailable: The well-known endpoint is not yet available: kube-apiserver oauth endpoint https://10.0.0.7:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)
      baremetal                                  4.15.0-0.nightly-2023-09-17-045811   True        False         False      3h30m   
      cloud-controller-manager                   4.15.0-0.nightly-2023-09-17-045811   True        False         False      3h34m   
      cloud-credential                           4.15.0-0.nightly-2023-09-17-045811   True        False         False      3h39m   
      cluster-autoscaler                         4.15.0-0.nightly-2023-09-17-045811   True        False         False      3h30m   
      config-operator                            4.15.0-0.nightly-2023-09-17-045811   True        False         False      3h31m   
      console                                    4.15.0-0.nightly-2023-09-17-045811   False       True          False      3h20m   DeploymentAvailable: 0 replicas available for console deployment...
      control-plane-machine-set                  4.15.0-0.nightly-2023-09-17-045811   False       True          False      3h24m   Missing 1 available replica(s)
      csi-snapshot-controller                    4.15.0-0.nightly-2023-09-17-045811   True        False         False      3h30m   
      dns                                        4.15.0-0.nightly-2023-09-17-045811   True        True          False      3h30m   DNS "default" reports Progressing=True: "Have 5 available node-resolver pods, want 6."
      etcd                                       4.15.0-0.nightly-2023-09-17-045811   True        True          True       3h29m   NodeControllerDegraded: The master nodes not ready: node "jima41501-c646k-master-0" not ready since 2023-09-19 02:13:06 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
      image-registry                             4.15.0-0.nightly-2023-09-17-045811   True        True          False      3h19m   Progressing: The registry is ready...
      ingress                                    4.15.0-0.nightly-2023-09-17-045811   True        False         False      3h19m   
      insights                                   4.15.0-0.nightly-2023-09-17-045811   True        False         False      3h19m   
      kube-apiserver                             4.15.0-0.nightly-2023-09-17-045811   False       True          True       3h31m   StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 8
      kube-controller-manager                    4.15.0-0.nightly-2023-09-17-045811   True        True          True       3h27m   NodeControllerDegraded: The master nodes not ready: node "jima41501-c646k-master-0" not ready since 2023-09-19 02:13:06 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
      kube-scheduler                             4.15.0-0.nightly-2023-09-17-045811   True        True          True       3h27m   NodeControllerDegraded: The master nodes not ready: node "jima41501-c646k-master-0" not ready since 2023-09-19 02:13:06 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
      kube-storage-version-migrator              4.15.0-0.nightly-2023-09-17-045811   True        False         False      3h30m   
      machine-api                                4.15.0-0.nightly-2023-09-17-045811   True        False         False      3h17m   
      machine-approver                           4.15.0-0.nightly-2023-09-17-045811   True        False         False      3h30m   
      machine-config                             4.15.0-0.nightly-2023-09-17-045811   False       False         True       164m    Cluster not available for [{operator 4.15.0-0.nightly-2023-09-17-045811}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)]
      marketplace                                4.15.0-0.nightly-2023-09-17-045811   True        False         False      3h30m   
      monitoring                                 4.15.0-0.nightly-2023-09-17-045811   True        False         False      3h15m   
      network                                    4.15.0-0.nightly-2023-09-17-045811   True        True          False      3h31m   DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)...
      node-tuning                                4.15.0-0.nightly-2023-09-17-045811   True        True          False      3h30m   Working towards "4.15.0-0.nightly-2023-09-17-045811"
      openshift-apiserver                        4.15.0-0.nightly-2023-09-17-045811   True        True          True       3h24m   APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()
      openshift-controller-manager               4.15.0-0.nightly-2023-09-17-045811   True        True          False      3h27m   Progressing: deployment/controller-manager: updated replicas is 1, desired replicas is 3...
      openshift-samples                          4.15.0-0.nightly-2023-09-17-045811   True        False         False      3h23m   
      operator-lifecycle-manager                 4.15.0-0.nightly-2023-09-17-045811   True        False         False      3h30m   
      operator-lifecycle-manager-catalog         4.15.0-0.nightly-2023-09-17-045811   True        False         False      3h30m   
      operator-lifecycle-manager-packageserver   4.15.0-0.nightly-2023-09-17-045811   True        False         False      3h25m   
      service-ca                                 4.15.0-0.nightly-2023-09-17-045811   True        False         False      3h31m   
      storage                                    4.15.0-0.nightly-2023-09-17-045811   True        True          False      3h30m   AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods...
      
      [systemd]
      Failed Units: 1
        openshift-azure-routes.service
      [core@jima41501-c646k-master-0 ~]$ sudo -i
      [systemd]
      Failed Units: 1
        openshift-azure-routes.service
      [root@jima41501-c646k-master-0 ~]# systemctl status openshift-azure-routes.service
      × openshift-azure-routes.service - Work around Azure load balancer hairpin
           Loaded: loaded (/etc/systemd/system/openshift-azure-routes.service; static)
           Active: failed (Result: exit-code) since Tue 2023-09-19 02:10:31 UTC; 3h 23min ago
         Duration: 55ms
      TriggeredBy: ● openshift-azure-routes.path
          Process: 13908 ExecStart=/bin/bash /opt/libexec/openshift-azure-routes.sh start (code=exited, status=1/FAILURE)
         Main PID: 13908 (code=exited, status=1/FAILURE)
              CPU: 77ms
      
      Sep 19 02:10:31 jima41501-c646k-master-0 systemd[1]: Started Work around Azure load balancer hairpin.
      Sep 19 02:10:31 jima41501-c646k-master-0 openshift-azure-routes[13908]: processing v4 vip 10.0.0.4
      Sep 19 02:10:31 jima41501-c646k-master-0 openshift-azure-routes[13908]: /opt/libexec/openshift-azure-routes.sh: line 130: ovnkContaine>
      Sep 19 02:10:31 jima41501-c646k-master-0 systemd[1]: openshift-azure-routes.service: Main process exited, code=exited, status=1/FAILURE
      Sep 19 02:10:31 jima41501-c646k-master-0 systemd[1]: openshift-azure-routes.service: Failed with result 'exit-code'.
      
      
      4.13 failed in ci
      https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-azure-sdn/1703878138968150016/artifacts/e2e-azure-sdn/gather-extra/artifacts/oc_cmds/clusteroperators
      NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.13.0-0.nightly-2023-09-18-210322   False       True          True       55m     WellKnownAvailable: The well-known endpoint is not yet available: kube-apiserver oauth endpoint https://10.0.0.6:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)
      baremetal                                  4.13.0-0.nightly-2023-09-18-210322   True        False         False      54m     
      cloud-controller-manager                   4.13.0-0.nightly-2023-09-18-210322   True        False         False      56m     
      cloud-credential                           4.13.0-0.nightly-2023-09-18-210322   True        False         False      58m     
      cluster-autoscaler                         4.13.0-0.nightly-2023-09-18-210322   True        False         False      53m     
      config-operator                            4.13.0-0.nightly-2023-09-18-210322   True        False         False      55m     
      console                                    4.13.0-0.nightly-2023-09-18-210322   False       True          False      45m     DeploymentAvailable: 0 replicas available for console deployment...
      control-plane-machine-set                  4.13.0-0.nightly-2023-09-18-210322   False       True          False      47m     Missing 1 available replica(s)
      csi-snapshot-controller                    4.13.0-0.nightly-2023-09-18-210322   True        False         False      54m     
      dns                                        4.13.0-0.nightly-2023-09-18-210322   True        True          False      53m     DNS "default" reports Progressing=True: "Have 5 available node-resolver pods, want 6."
      etcd                                       4.13.0-0.nightly-2023-09-18-210322   True        True          True       52m     NodeControllerDegraded: The master nodes not ready: node "ci-op-pjxb081y-0c3e0-bxvlr-master-0" not ready since 2023-09-18 21:40:51 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
      image-registry                             4.13.0-0.nightly-2023-09-18-210322   True        True          False      45m     NodeCADaemonProgressing: The daemon set node-ca is deploying node pods...
      ingress                                    4.13.0-0.nightly-2023-09-18-210322   True        False         False      44m     
      insights                                   4.13.0-0.nightly-2023-09-18-210322   True        False         False      47m     
      kube-apiserver                             4.13.0-0.nightly-2023-09-18-210322   False       True          True       53m     StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 10
      kube-controller-manager                    4.13.0-0.nightly-2023-09-18-210322   True        True          True       51m     NodeControllerDegraded: The master nodes not ready: node "ci-op-pjxb081y-0c3e0-bxvlr-master-0" not ready since 2023-09-18 21:40:51 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
      kube-scheduler                             4.13.0-0.nightly-2023-09-18-210322   True        True          True       51m     NodeControllerDegraded: The master nodes not ready: node "ci-op-pjxb081y-0c3e0-bxvlr-master-0" not ready since 2023-09-18 21:40:51 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
      kube-storage-version-migrator              4.13.0-0.nightly-2023-09-18-210322   True        False         False      54m     
      machine-api                                4.13.0-0.nightly-2023-09-18-210322   True        False         False      46m     
      machine-approver                           4.13.0-0.nightly-2023-09-18-210322   True        False         False      54m     
      machine-config                             4.13.0-0.nightly-2023-09-18-210322   False       False         True       31m     Cluster not available for [{operator 4.13.0-0.nightly-2023-09-18-210322}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)]
      marketplace                                4.13.0-0.nightly-2023-09-18-210322   True        False         False      53m     
      monitoring                                 4.13.0-0.nightly-2023-09-18-210322   True        False         False      43m     
      network                                    4.13.0-0.nightly-2023-09-18-210322   True        True          False      55m     DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)...
      node-tuning                                4.13.0-0.nightly-2023-09-18-210322   True        True          False      53m     Working towards "4.13.0-0.nightly-2023-09-18-210322"
      openshift-apiserver                        4.13.0-0.nightly-2023-09-18-210322   True        True          True       44m     APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver (3 containers are waiting in pending apiserver-66d764fbd6-r2s8d pod)
      openshift-controller-manager               4.13.0-0.nightly-2023-09-18-210322   True        True          False      54m     Progressing: deployment/controller-manager: updated replicas is 1, desired replicas is 3...
      openshift-samples                          4.13.0-0.nightly-2023-09-18-210322   True        False         False      47m     
      operator-lifecycle-manager                 4.13.0-0.nightly-2023-09-18-210322   True        False         False      54m     
      operator-lifecycle-manager-catalog         4.13.0-0.nightly-2023-09-18-210322   True        False         False      54m     
      operator-lifecycle-manager-packageserver   4.13.0-0.nightly-2023-09-18-210322   True        False         False      48m     
      service-ca                                 4.13.0-0.nightly-2023-09-18-210322   True        False         False      55m     
      storage                                    4.13.0-0.nightly-2023-09-18-210322   True        True          False      54m     AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods...
      
      
      

      Expected results:

      
      Installation succeeds
      
      

      Additional info:

      We doubted this is caused by PR https://github.com/openshift/machine-config-operator/pull/3878/files
      
      

            sseethar Surya Seetharaman
            huirwang Huiran Wang
            Huiran Wang Huiran Wang
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated:
              Resolved: