Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Undefined
Fix Version/s: 4.12
Affects Version/s: 4.12, 4.11
Component/s: Documentation
Labels:
- ibmcloud

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
2
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:

4.12, 4.11
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

There is a known issue that results in a periodic provisioning failure when installing a cluster to IBM VPC Cloud. The Verification step in Deploying the cluster needs to be updated with how to recover nodes that fail to provision during the installation process. IBM (Jeff Nowicki) to provide steps to manually recover the cluster.

Proposed Mitigation Documentation:

A machine provisioning issue - https://issues.redhat.com/browse/OCPBUGS-1327 - was reported against IBM Cloud VPC. A fix to mitigate the issue in this release has been has delivered. This NetworkManager enhancement - https://github.com/dracutdevs/dracut/commit/112f03f9e225a790cbc6378c70773c6af5e7ee34 - in RHEL9 will address the root cause of the provisioning issue. It will be included in a future OpenShift release.

During initial cluster installation, the install may fail due to compute machines appearing to be 'stuck' in a `Provisioned` status. If you also check the IBM Cloud VPC virtual server status, it should indicate a status of `Running`.

If you encounter this situation, following these steps should fix the 'stuck' machine and complete the cluster installation. The recovery actions should be performed from the host that the initial installation was performed from.

1. Verify that the IBM Cloud VPC private control plane application load balancer (ALB) is active and operating as required.

Check the status of your cluster's Private Control Plan ALB and verify it is `active`

# infraID can be found from the Infrastructure resource via:
$ oc get infrastructure/cluster -ojson | jq -r '.status.infrastructureName'

# verify the ALB status is "active"
$ ibmcloud is lb <infraID>-kubernetes-api-private  --output json | jq -r '.provisioning_status'

(Optional step) Run these commands from a new machine (VSI) provisioned on the same subnet as one of the failed machines. The machine should also have the same security groups applied to it. Confirm traffic through the Private Control Plane ALB is reaching the Machine Config Server (MCS) with no failures.

# apiServerInternalURI can be found from the Infrastructure resource via:
$ oc get infrastructure/cluster -ojson | jq -r '.status.apiServerInternalURI'

# drop the APIServer port (6443), as the MCS uses port 22623 for traffic (which is shown below), noted as mcsURI:
# https://api-int.<cluster_id>.<domain_name>:22623
$ curl --max-time 5 --connect-timeout 5 --retry 10 <mcsURI>/healthz

2. Identify the failed machine(s).

$ oc get machine -n openshift-machine-api
NAME                                    PHASE         TYPE       REGION    ZONE        AGE
example-public-1-x4gpn-master-0         Running       bx2-4x16   us-east   us-east-1   23h
example-public-1-x4gpn-master-1         Running       bx2-4x16   us-east   us-east-2   23h
example-public-1-x4gpn-master-2         Running       bx2-4x16   us-east   us-east-3   23h
example-public-1-x4gpn-worker-1-xqzzm   Running       bx2-4x16   us-east   us-east-1   22h
example-public-1-x4gpn-worker-2-vg9w6   Provisioned   bx2-4x16   us-east   us-east-2   22h
example-public-1-x4gpn-worker-3-2f7zd   Provisioned   bx2-4x16   us-east   us-east-3   22h

3. Delete the failed machine(s).

$ oc delete machine example-public-1-x4gpn-worker-2-vg9w6 -n openshift-machine-api
$ oc delete machine example-public-1-x4gpn-worker-3-2f7zd -n openshift-machine-api

4. Verify the replaced machine(s) status (allow 5-10 minutes for the replacement machine(s) to progress to `Running` status).

$ oc get machine -n openshift-machine-api
NAME                                    PHASE     TYPE       REGION    ZONE        AGE
example-public-1-x4gpn-master-0         Running   bx2-4x16   us-east   us-east-1   23h
example-public-1-x4gpn-master-1         Running   bx2-4x16   us-east   us-east-2   23h
example-public-1-x4gpn-master-2         Running   bx2-4x16   us-east   us-east-3   23h
example-public-1-x4gpn-worker-1-xqzzm   Running   bx2-4x16   us-east   us-east-1   23h
example-public-1-x4gpn-worker-2-mnlsz   Running   bx2-4x16   us-east   us-east-2   8m2s
example-public-1-x4gpn-worker-3-7nz4q   Running   bx2-4x16   us-east   us-east-3   7m24s

5. Run the installer again to complete the installation. This will ensure the cluster's kubeconfig is properly initialized.

$ ./openshift-install wait-for install-complete

is related to

OCPBUGS-1327 [IBMCloud] Worker machines unreachable during initial bring up

Closed

links to

OCPBUGS#2892: Added known issue for IBM Cloud VPC

Assignee:: Mike Pytlak (Inactive)

Reporter:: Mike Pytlak (Inactive)

Need Info From:: None

Contributors:: None

QA Contact:: May Xu

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2022/10/26 7:23 PM

Updated:: 2025/07/28 11:43 PM

Resolved:: 2022/12/06 8:20 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide