Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-15654

UPI installation failed when disabling MachineAPI on 4.14

XMLWordPrintable

    • Critical
    • No
    • 2
    • OCP VE Sprint 239, OCP VE Sprint 240
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • 8/8: testing this again now that OCPBUGS-16889 is verified

      Description of problem:

      Install IPI sno and specify baselineCapabilitySet as None in install-config.yaml, installation failed at stage of bootstrap complete.
      
      Node is Ready but etcd operator is degraded:
      $ oc get nodes
      NAME                       STATUS   ROLES                         AGE   VERSION
      jima03sno-cgqzt-master-0   Ready    control-plane,master,worker   52m   v1.27.3+ab0b8ee
      
      $ oc get co
      NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.14.0-0.nightly-2023-06-30-131338   False       False         True       49m     APIServicesAvailable: PreconditionNotReady...
      cloud-controller-manager                   4.14.0-0.nightly-2023-06-30-131338   True        False         False      50m     
      cloud-credential                                                                True        False         False      58m     
      config-operator                            4.14.0-0.nightly-2023-06-30-131338   True        False         False      49m     
      dns                                        4.14.0-0.nightly-2023-06-30-131338   True        False         False      48m     
      etcd                                       4.14.0-0.nightly-2023-06-30-131338   False       True          True       48m     StaticPodsAvailable: 0 nodes are active; 1 nodes are at revision 0; 0 nodes have achieved new revision 2
      image-registry                                                                                                               
      ingress                                                                         False       True          True       48m     The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), LoadBalancerReady=False (LoadBalancerPending: The LoadBalancer service is pending)
      kube-apiserver                             4.14.0-0.nightly-2023-06-30-131338   False       True          True       49m     StaticPodsAvailable: 0 nodes are active; 1 nodes are at revision 0; 0 nodes have achieved new revision 2
      kube-controller-manager                    4.14.0-0.nightly-2023-06-30-131338   True        False         False      45m     
      kube-scheduler                             4.14.0-0.nightly-2023-06-30-131338   True        False         False      45m     
      kube-storage-version-migrator              4.14.0-0.nightly-2023-06-30-131338   True        False         False      49m     
      machine-approver                           4.14.0-0.nightly-2023-06-30-131338   True        False         False      48m     
      machine-config                             4.14.0-0.nightly-2023-06-30-131338   True        False         False      48m     
      monitoring                                                                      False       True          True       104s    reconciling Alertmanager Route failed: creating Route object failed: the server could not find the requested resource (post routes.route.openshift.io), reconciling Thanos Querier Route failed: creating Route object failed: the server could not find the requested resource (post routes.route.openshift.io), reconciling Prometheus API Route failed: creating Route object failed: the server could not find the requested resource (post routes.route.openshift.io), client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline
      network                                    4.14.0-0.nightly-2023-06-30-131338   True        False         False      50m     
      openshift-apiserver                        4.14.0-0.nightly-2023-06-30-131338   False       False         True       49m     APIServicesAvailable: PreconditionNotReady
      openshift-controller-manager               4.14.0-0.nightly-2023-06-30-131338   True        False         False      42m     
      operator-lifecycle-manager                 4.14.0-0.nightly-2023-06-30-131338   True        False         False      48m     
      operator-lifecycle-manager-catalog         4.14.0-0.nightly-2023-06-30-131338   True        False         False      48m     
      operator-lifecycle-manager-packageserver                                        False       True          False      48m     ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver observed in phase Failed with reason: InstallCheckFailed, message: install timeout
      service-ca                                 4.14.0-0.nightly-2023-06-30-131338   True        False         False      49m  
      
      
      And I also found that node could not be accessed by ssh with below error:
      # ssh -i ~/.ssh/openshift-qe.pem core@10.0.0.6
      The authenticity of host '10.0.0.6 (10.0.0.6)' can't be established.
      ECDSA key fingerprint is SHA256:rCrEiTqPIPuRU84ierPqo0J/UAv4+yiEoLOzlakfvGs.
      Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
      Warning: Permanently added '10.0.0.6' (ECDSA) to the list of known hosts.
      core@10.0.0.6: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
      
      Debug on node, ssh public key is not copied into /home/core/.ssh/authorized_keys.
      sh-5.1# ls -ltr /home/core/.ssh/authorized_keys.d/        
      total 0
      -rw-------. 1 core core 0 Jul  3 00:48 ignition
      
      machine-api operator is disabled, but I still see namespace openshift-machine-api and service resource under it.
      $ oc get all -n openshift-machine-api
      NAME                                  TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)            AGE
      service/cluster-autoscaler-operator   ClusterIP   172.30.129.2   <none>        443/TCP,9192/TCP   68m
      
      Similar issue also happened on UPI cluster specifying baselineCapabilitySet:None in install-config.yaml.
      
      attached must-gather log.

      Version-Release number of selected component (if applicable):

      4.14.0-0.nightly-2023-06-30-131338

      How reproducible:

      Always when installing IPI SNO or UPI cluster with disabling MachineAPI capability.

      Steps to Reproduce:

      1. Prepare install-config.yaml and set baselineCapabilitySet:None
      2. Install IPI SNO or UPI cluster
      3.
      

      Actual results:

      Installation is failure

      Expected results:

      Installation is successful.

      Additional info:

      Installation is successful if setting baselineCapabilitySet:None + addtionalEabledCapabilities: [MachineAPI] in install-config.yaml

              bzamalut@redhat.com Bulat Zamalutdinov
              jinyunma Jinyun Ma
              Jinyun Ma Jinyun Ma
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

                Created:
                Updated:
                Resolved: