-
Bug
-
Resolution: Duplicate
-
Undefined
-
None
-
4.12
-
None
-
None
-
False
-
Description of problem:
https://issues.redhat.com/browse/OCPBUGS-3367 for more reference.
While installing on m6g.metal baremetal instance types from AWS, multiple operators are reported as degraded during install. It turns out that even though the installer exits with following error, eventually it comes up fine after couple of hours and passes basic health check. It seems that we need to increase bootstrap/install time out for AWS Baremetal types as well for IPI/UPI similar to baremetal-install command.baremetal-install command currently has timeout of 60 min but it does not accommodate for AWS. We couldn't try wait-for install-complete command as the cluster did not complete bootstrap before installer exited and I was told by installer QE team that we do not have a way to update our scripts to specify this time during bootstrap/install. This appears to be easily reproducible. I tried this twice and ran into similar issues both times. Error from install log: 11-07 17:27:37.103 level=info msg=Cluster operator insights SCAAvailable is False with Forbidden: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 403: {"code":"ACCT-MGMT-11","href":"/api/accounts_mgmt/v1/errors/11","id":"11","kind":"Error","operation_id":"afc3ae85-2ee6-450b-9300-21c92718ade0","reason":"Account with ID 1V6IJfkUxJwJq1N5Z0k0aWL3AhR denied access to perform create on Certificate with HTTP call POST /api/accounts_mgmt/v1/certificates"} 11-07 17:27:37.103 level=info msg=Cluster operator insights UploadDegraded is True with NotAuthorized: Reporting was not allowed: your Red Hat account is not enabled for remote support or your token has expired: UHC services authentication failed 11-07 17:27:37.103 level=info 11-07 17:27:37.103 level=error msg=Cluster operator kube-controller-manager Degraded is True with GarbageCollector_Error: GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.0.10:53: no such host 11-07 17:27:37.104 level=info msg=Cluster operator machine-api Progressing is True with SyncingResources: Progressing towards operator: 4.12.0-0.nightly-arm64-2022-11-06-054834 11-07 17:27:37.104 level=error msg=Cluster operator machine-api Degraded is True with SyncingFailed: Failed when progressing towards operator: 4.12.0-0.nightly-arm64-2022-11-06-054834 because minimum worker replica count (2) not yet met: current running replicas 1, waiting for [sv-m6g-bm-trial2-ccz68-worker-us-east-2b-jn89g sv-m6g-bm-trial2-ccz68-worker-us-east-2c-c5vsb] 11-07 17:27:37.104 level=error msg=Cluster operator machine-api Available is False with Initializing: Operator is initializing 11-07 17:27:37.104 level=error msg=Cluster operator monitoring Available is False with UpdatingPrometheusOperatorFailed: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 1 unavailable replicas 11-07 17:27:37.104 level=error msg=Cluster operator monitoring Degraded is True with UpdatingPrometheusOperatorFailed: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 1 unavailable replicas 11-07 17:27:37.104 level=info msg=Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack. 11-07 17:27:37.104 level=info msg=Cluster operator network ManagementStateDegraded is False with : 11-07 17:27:37.104 level=error msg=Cluster initialization failed because one or more operators are not functioning properly. 11-07 17:27:37.104 level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below, 11-07 17:27:37.104 level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html 11-07 17:27:37.104 level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation 11-07 17:27:37.104 level=error msg=failed to initialize the cluster: Cluster operators machine-api, monitoring are not available 11-07 17:27:37.105 [ERROR] Installation failed with error code '6'. Aborting execution.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-arm64-2022-11-06-054834
How reproducible:
Create a cluster using m6gd.metal for master nodes and m6g.metal for worker nodes and notice the errors reported during install as installation fails.
Steps to Reproduce:
1. 2. 3.
Actual results:
Bootstrapping and installation exits and reported as failure.
Expected results:
Bootstrapping and installation should be executed successfully with no errors.
Additional info: