Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-3518

Increase timeout for bootstrap and install while installing on AWS ARM Baremetal m6g.metal instance types

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Undefined Undefined
    • None
    • 4.12
    • Multi-Arch / ARM
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      https://issues.redhat.com/browse/OCPBUGS-3367 for more reference.

      While installing on m6g.metal baremetal instance types from AWS, multiple operators are reported as degraded during install. It turns out that even though the installer exits with following error, eventually it comes up fine after couple of hours and passes basic health check.
      
      It seems that we need to increase bootstrap/install time out for AWS Baremetal types as well for IPI/UPI similar to baremetal-install command.baremetal-install command currently has timeout of 60 min but it does not accommodate for AWS. 
      
      We couldn't try wait-for install-complete command as the cluster did not complete bootstrap before installer exited and I was told by installer QE team that we do not have a way to update our scripts to specify this time during bootstrap/install.
      
      This appears to be easily reproducible. I tried this twice and ran into similar issues both times. 
      
      
      Error from install log:
      11-07 17:27:37.103  level=info msg=Cluster operator insights SCAAvailable is False with Forbidden: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 403: {"code":"ACCT-MGMT-11","href":"/api/accounts_mgmt/v1/errors/11","id":"11","kind":"Error","operation_id":"afc3ae85-2ee6-450b-9300-21c92718ade0","reason":"Account with ID 1V6IJfkUxJwJq1N5Z0k0aWL3AhR denied access to perform create on Certificate with HTTP call POST /api/accounts_mgmt/v1/certificates"}
      11-07 17:27:37.103  level=info msg=Cluster operator insights UploadDegraded is True with NotAuthorized: Reporting was not allowed: your Red Hat account is not enabled for remote support or your token has expired: UHC services authentication failed
      11-07 17:27:37.103  level=info
      11-07 17:27:37.103  level=error msg=Cluster operator kube-controller-manager Degraded is True with GarbageCollector_Error: GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.0.10:53: no such host
      11-07 17:27:37.104  level=info msg=Cluster operator machine-api Progressing is True with SyncingResources: Progressing towards operator: 4.12.0-0.nightly-arm64-2022-11-06-054834
      11-07 17:27:37.104  level=error msg=Cluster operator machine-api Degraded is True with SyncingFailed: Failed when progressing towards operator: 4.12.0-0.nightly-arm64-2022-11-06-054834 because minimum worker replica count (2) not yet met: current running replicas 1, waiting for [sv-m6g-bm-trial2-ccz68-worker-us-east-2b-jn89g sv-m6g-bm-trial2-ccz68-worker-us-east-2c-c5vsb]
      11-07 17:27:37.104  level=error msg=Cluster operator machine-api Available is False with Initializing: Operator is initializing
      11-07 17:27:37.104  level=error msg=Cluster operator monitoring Available is False with UpdatingPrometheusOperatorFailed: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 1 unavailable replicas
      11-07 17:27:37.104  level=error msg=Cluster operator monitoring Degraded is True with UpdatingPrometheusOperatorFailed: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 1 unavailable replicas
      11-07 17:27:37.104  level=info msg=Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack.
      11-07 17:27:37.104  level=info msg=Cluster operator network ManagementStateDegraded is False with : 
      11-07 17:27:37.104  level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
      11-07 17:27:37.104  level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
      11-07 17:27:37.104  level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
      11-07 17:27:37.104  level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation
      11-07 17:27:37.104  level=error msg=failed to initialize the cluster: Cluster operators machine-api, monitoring are not available
      11-07 17:27:37.105  [ERROR] Installation failed with error code '6'. Aborting execution.
      
      

      Version-Release number of selected component (if applicable):

      4.12.0-0.nightly-arm64-2022-11-06-054834

      How reproducible:

      Create a cluster using m6gd.metal for master nodes and m6g.metal for worker nodes and notice the errors reported during install as installation fails.

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

      Bootstrapping and installation exits and reported as failure.

      Expected results:

      Bootstrapping and installation should be executed successfully with no errors.

      Additional info:

       

            Unassigned Unassigned
            svetsa@redhat.com Sharada Vetsa
            Pedro Jose Amoedo Martinez Pedro Jose Amoedo Martinez
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: