Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-2892

Recovering from periodic provisioning failure on IBM Cloud VPC

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Undefined Undefined
    • 4.12
    • 4.12, 4.11
    • Documentation
    • None
    • 2
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      There is a known issue that results in a periodic provisioning failure when installing a cluster to IBM VPC Cloud. The Verification step in Deploying the cluster needs to be updated with how to recover nodes that fail to provision during the installation process. IBM (Jeff Nowicki) to provide steps to manually recover the cluster.

      Proposed Mitigation Documentation:

      A machine provisioning issue - https://issues.redhat.com/browse/OCPBUGS-1327 - was reported against IBM Cloud VPC. A fix to mitigate the issue in this release has been has delivered. This NetworkManager enhancement - https://github.com/dracutdevs/dracut/commit/112f03f9e225a790cbc6378c70773c6af5e7ee34 - in RHEL9 will address the root cause of the provisioning issue. It will be included in a future OpenShift release.

      During initial cluster installation, the install may fail due to compute machines appearing to be 'stuck' in a `Provisioned` status. If you also check the IBM Cloud VPC virtual server status, it should indicate a status of `Running`.

      If you encounter this situation, following these steps should fix the 'stuck' machine and complete the cluster installation. The recovery actions should be performed from the host that the initial installation was performed from.

      1. Verify that the IBM Cloud VPC private control plane application load balancer (ALB) is active and operating as required.

      • Check the status of your cluster's Private Control Plan ALB and verify it is `active`
        # infraID can be found from the Infrastructure resource via:
        $ oc get infrastructure/cluster -ojson | jq -r '.status.infrastructureName'
        
        # verify the ALB status is "active"
        $ ibmcloud is lb <infraID>-kubernetes-api-private  --output json | jq -r '.provisioning_status'
        
      • (Optional step) Run these commands from a new machine (VSI) provisioned on the same subnet as one of the failed machines. The machine should also have the same security groups applied to it. Confirm traffic through the Private Control Plane ALB is reaching the Machine Config Server (MCS) with no failures.
        # apiServerInternalURI can be found from the Infrastructure resource via:
        $ oc get infrastructure/cluster -ojson | jq -r '.status.apiServerInternalURI'
        
        # drop the APIServer port (6443), as the MCS uses port 22623 for traffic (which is shown below), noted as mcsURI:
        # https://api-int.<cluster_id>.<domain_name>:22623
        $ curl --max-time 5 --connect-timeout 5 --retry 10 <mcsURI>/healthz
        

      2. Identify the failed machine(s).

      $ oc get machine -n openshift-machine-api
      NAME                                    PHASE         TYPE       REGION    ZONE        AGE
      example-public-1-x4gpn-master-0         Running       bx2-4x16   us-east   us-east-1   23h
      example-public-1-x4gpn-master-1         Running       bx2-4x16   us-east   us-east-2   23h
      example-public-1-x4gpn-master-2         Running       bx2-4x16   us-east   us-east-3   23h
      example-public-1-x4gpn-worker-1-xqzzm   Running       bx2-4x16   us-east   us-east-1   22h
      example-public-1-x4gpn-worker-2-vg9w6   Provisioned   bx2-4x16   us-east   us-east-2   22h
      example-public-1-x4gpn-worker-3-2f7zd   Provisioned   bx2-4x16   us-east   us-east-3   22h
      

      3. Delete the failed machine(s).

      $ oc delete machine example-public-1-x4gpn-worker-2-vg9w6 -n openshift-machine-api
      $ oc delete machine example-public-1-x4gpn-worker-3-2f7zd -n openshift-machine-api
      

      4. Verify the replaced machine(s) status (allow 5-10 minutes for the replacement machine(s) to progress to `Running` status).

      $ oc get machine -n openshift-machine-api
      NAME                                    PHASE     TYPE       REGION    ZONE        AGE
      example-public-1-x4gpn-master-0         Running   bx2-4x16   us-east   us-east-1   23h
      example-public-1-x4gpn-master-1         Running   bx2-4x16   us-east   us-east-2   23h
      example-public-1-x4gpn-master-2         Running   bx2-4x16   us-east   us-east-3   23h
      example-public-1-x4gpn-worker-1-xqzzm   Running   bx2-4x16   us-east   us-east-1   23h
      example-public-1-x4gpn-worker-2-mnlsz   Running   bx2-4x16   us-east   us-east-2   8m2s
      example-public-1-x4gpn-worker-3-7nz4q   Running   bx2-4x16   us-east   us-east-3   7m24s
      

      5. Run the installer again to complete the installation.  This will ensure the cluster's kubeconfig is properly initialized.

      $ ./openshift-install wait-for install-complete
      

              rhn-support-mpytlak Mike Pytlak (Inactive)
              rhn-support-mpytlak Mike Pytlak (Inactive)
              May Xu May Xu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: