[OCPBUGS-3518] Increase timeout for bootstrap and install while installing on AWS ARM Baremetal m6g.metal instance types - Red Hat Issue Tracker

Type: Bug
Resolution: Duplicate
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.12
Component/s: Multi-Arch / ARM
Labels:
None

Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

https://issues.redhat.com/browse/OCPBUGS-3367 for more reference.

While installing on m6g.metal baremetal instance types from AWS, multiple operators are reported as degraded during install. It turns out that even though the installer exits with following error, eventually it comes up fine after couple of hours and passes basic health check.

It seems that we need to increase bootstrap/install time out for AWS Baremetal types as well for IPI/UPI similar to baremetal-install command.baremetal-install command currently has timeout of 60 min but it does not accommodate for AWS. 

We couldn't try wait-for install-complete command as the cluster did not complete bootstrap before installer exited and I was told by installer QE team that we do not have a way to update our scripts to specify this time during bootstrap/install.

This appears to be easily reproducible. I tried this twice and ran into similar issues both times. 


Error from install log:
11-07 17:27:37.103  level=info msg=Cluster operator insights SCAAvailable is False with Forbidden: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 403: {"code":"ACCT-MGMT-11","href":"/api/accounts_mgmt/v1/errors/11","id":"11","kind":"Error","operation_id":"afc3ae85-2ee6-450b-9300-21c92718ade0","reason":"Account with ID 1V6IJfkUxJwJq1N5Z0k0aWL3AhR denied access to perform create on Certificate with HTTP call POST /api/accounts_mgmt/v1/certificates"}
11-07 17:27:37.103  level=info msg=Cluster operator insights UploadDegraded is True with NotAuthorized: Reporting was not allowed: your Red Hat account is not enabled for remote support or your token has expired: UHC services authentication failed
11-07 17:27:37.103  level=info
11-07 17:27:37.103  level=error msg=Cluster operator kube-controller-manager Degraded is True with GarbageCollector_Error: GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.0.10:53: no such host
11-07 17:27:37.104  level=info msg=Cluster operator machine-api Progressing is True with SyncingResources: Progressing towards operator: 4.12.0-0.nightly-arm64-2022-11-06-054834
11-07 17:27:37.104  level=error msg=Cluster operator machine-api Degraded is True with SyncingFailed: Failed when progressing towards operator: 4.12.0-0.nightly-arm64-2022-11-06-054834 because minimum worker replica count (2) not yet met: current running replicas 1, waiting for [sv-m6g-bm-trial2-ccz68-worker-us-east-2b-jn89g sv-m6g-bm-trial2-ccz68-worker-us-east-2c-c5vsb]
11-07 17:27:37.104  level=error msg=Cluster operator machine-api Available is False with Initializing: Operator is initializing
11-07 17:27:37.104  level=error msg=Cluster operator monitoring Available is False with UpdatingPrometheusOperatorFailed: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 1 unavailable replicas
11-07 17:27:37.104  level=error msg=Cluster operator monitoring Degraded is True with UpdatingPrometheusOperatorFailed: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 1 unavailable replicas
11-07 17:27:37.104  level=info msg=Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack.
11-07 17:27:37.104  level=info msg=Cluster operator network ManagementStateDegraded is False with : 
11-07 17:27:37.104  level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
11-07 17:27:37.104  level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
11-07 17:27:37.104  level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
11-07 17:27:37.104  level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation
11-07 17:27:37.104  level=error msg=failed to initialize the cluster: Cluster operators machine-api, monitoring are not available
11-07 17:27:37.105  [ERROR] Installation failed with error code '6'. Aborting execution.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-arm64-2022-11-06-054834

How reproducible:

Create a cluster using m6gd.metal for master nodes and m6g.metal for worker nodes and notice the errors reported during install as installation fails.

Steps to Reproduce:

1.
2.
3.

Actual results:

Bootstrapping and installation exits and reported as failure.

Expected results:

Bootstrapping and installation should be executed successfully with no errors.

Additional info:

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

log-bundle-20221107182214.tar.gz
9.16 MB
2022/11/11 1:52 AM
openshift_install.rtf
544 kB
2022/11/11 1:52 AM

Assignee:: Unassigned

Reporter:: Sharada Vetsa

QA Contact:: Pedro Jose Amoedo Martinez

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2022/11/11 1:46 AM

Updated:: 2022/11/11 3:34 PM

Resolved:: 2022/11/11 3:34 PM

Details

Description

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide