-
Bug
-
Resolution: Done
-
Undefined
-
4.12, 4.11
Description of problem:
There is a known issue that results in a periodic provisioning failure when installing a cluster to IBM VPC Cloud. The Verification step in Deploying the cluster needs to be updated with how to recover nodes that fail to provision during the installation process. IBM (Jeff Nowicki) to provide steps to manually recover the cluster.
Proposed Mitigation Documentation:
A machine provisioning issue - https://issues.redhat.com/browse/OCPBUGS-1327 - was reported against IBM Cloud VPC. A fix to mitigate the issue in this release has been has delivered. This NetworkManager enhancement - https://github.com/dracutdevs/dracut/commit/112f03f9e225a790cbc6378c70773c6af5e7ee34 - in RHEL9 will address the root cause of the provisioning issue. It will be included in a future OpenShift release.
During initial cluster installation, the install may fail due to compute machines appearing to be 'stuck' in a `Provisioned` status. If you also check the IBM Cloud VPC virtual server status, it should indicate a status of `Running`.
If you encounter this situation, following these steps should fix the 'stuck' machine and complete the cluster installation. The recovery actions should be performed from the host that the initial installation was performed from.
1. Verify that the IBM Cloud VPC private control plane application load balancer (ALB) is active and operating as required.
- Check the status of your cluster's Private Control Plan ALB and verify it is `active`
# infraID can be found from the Infrastructure resource via: $ oc get infrastructure/cluster -ojson | jq -r '.status.infrastructureName' # verify the ALB status is "active" $ ibmcloud is lb <infraID>-kubernetes-api-private --output json | jq -r '.provisioning_status'
- (Optional step) Run these commands from a new machine (VSI) provisioned on the same subnet as one of the failed machines. The machine should also have the same security groups applied to it. Confirm traffic through the Private Control Plane ALB is reaching the Machine Config Server (MCS) with no failures.
# apiServerInternalURI can be found from the Infrastructure resource via: $ oc get infrastructure/cluster -ojson | jq -r '.status.apiServerInternalURI' # drop the APIServer port (6443), as the MCS uses port 22623 for traffic (which is shown below), noted as mcsURI: # https://api-int.<cluster_id>.<domain_name>:22623 $ curl --max-time 5 --connect-timeout 5 --retry 10 <mcsURI>/healthz
2. Identify the failed machine(s).
$ oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE example-public-1-x4gpn-master-0 Running bx2-4x16 us-east us-east-1 23h example-public-1-x4gpn-master-1 Running bx2-4x16 us-east us-east-2 23h example-public-1-x4gpn-master-2 Running bx2-4x16 us-east us-east-3 23h example-public-1-x4gpn-worker-1-xqzzm Running bx2-4x16 us-east us-east-1 22h example-public-1-x4gpn-worker-2-vg9w6 Provisioned bx2-4x16 us-east us-east-2 22h example-public-1-x4gpn-worker-3-2f7zd Provisioned bx2-4x16 us-east us-east-3 22h
3. Delete the failed machine(s).
$ oc delete machine example-public-1-x4gpn-worker-2-vg9w6 -n openshift-machine-api $ oc delete machine example-public-1-x4gpn-worker-3-2f7zd -n openshift-machine-api
4. Verify the replaced machine(s) status (allow 5-10 minutes for the replacement machine(s) to progress to `Running` status).
$ oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE example-public-1-x4gpn-master-0 Running bx2-4x16 us-east us-east-1 23h example-public-1-x4gpn-master-1 Running bx2-4x16 us-east us-east-2 23h example-public-1-x4gpn-master-2 Running bx2-4x16 us-east us-east-3 23h example-public-1-x4gpn-worker-1-xqzzm Running bx2-4x16 us-east us-east-1 23h example-public-1-x4gpn-worker-2-mnlsz Running bx2-4x16 us-east us-east-2 8m2s example-public-1-x4gpn-worker-3-7nz4q Running bx2-4x16 us-east us-east-3 7m24s
5. Run the installer again to complete the installation. This will ensure the cluster's kubeconfig is properly initialized.
$ ./openshift-install wait-for install-complete
- is related to
-
OCPBUGS-1327 [IBMCloud] Worker machines unreachable during initial bring up
- Closed
- links to