Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-10855

Cluster install fails on Dell R740 server

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Critical
    • No
    • None
    • None
    • Rejected
    • Agent Sprint 233
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      OCP 4.13 (RHEL 9.2/CentOS 9 Stream based) nightly deployments not succeeding on Dell R740 servers. Same deployment configuration works fine on OCP 4.12 (RHEL 8.6 based)
      
      Hybrid cluster nodes: 3x virtual masters, 1x virtual worker, 1x baremetal worker.
      
      All nodes except the baremetal worker (cnfdc8-worker-1) were rebooted and booted from the installed operating system. Node worker-1 fails to boot from the RAID controller (see attached screenshot).
      
      Also attached: must-gather and full outpuf of 'openshift-install agent wait-for install-complete'
      
      # oc get nodes
      NAME              STATUS   ROLES                  AGE     VERSION
      cnfdc8-master-0   Ready    control-plane,master   5h5m    v1.26.2+dc93b13
      cnfdc8-master-1   Ready    control-plane,master   5h34m   v1.26.2+dc93b13
      cnfdc8-master-2   Ready    control-plane,master   5h34m   v1.26.2+dc93b13
      cnfdc8-worker-0   Ready    worker                 5h13m   v1.26.2+dc93b13

      Version-Release number of selected component (if applicable):

      4.13.0-0.nightly-2023-03-23-204038

      How reproducible:

       

      Steps to Reproduce:

      1. 
      2.
      3.
      

      Actual results:

      /root/bin/openshift-install agent wait-for install-complete --log-level debug --dir cnfdc8
      
      [...]
      DEBUG Still waiting for the cluster to initialize: Working towards 4.13.0-0.nightly-2023-03-23-204038: 617 of 840 done (73% complete)
      DEBUG Still waiting for the cluster to initialize: Cluster operators authentication, console, kube-apiserver, machine-api, monitoring are not available
      DEBUG Still waiting for the cluster to initialize: Cluster operators authentication, console, kube-apiserver, machine-api, monitoring are not available
      DEBUG Still waiting for the cluster to initialize: Cluster operators authentication, kube-apiserver, machine-api, monitoring are not available
      DEBUG Still waiting for the cluster to initialize: Cluster operators authentication, machine-api, monitoring are not available
      DEBUG Still waiting for the cluster to initialize: Cluster operators machine-api, monitoring are not available
      ERROR Cluster installation timed out: context deadline exceeded
      INFO Cluster operator baremetal Disabled is False with :
      INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected
      INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected
      INFO Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected
      INFO Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected
      ERROR Cluster operator cluster-autoscaler Degraded is True with MissingDependency: machine-api not ready
      INFO Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required
      INFO Cluster operator ingress Progressing is True with Reconciling: ingresscontroller "default" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 1 of 2 updated replica(s) are available...
      INFO ).
      ERROR Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available)
      INFO Cluster operator ingress EvaluationConditionsDetected is False with AsExpected:
      INFO Cluster operator insights ClusterTransferAvailable is False with NoClusterTransfer: no available cluster transfer
      INFO Cluster operator insights Disabled is False with AsExpected:
      INFO Cluster operator insights SCAAvailable is True with Updated: SCA certs successfully updated in the etc-pki-entitlement secret
      ERROR Cluster operator kube-controller-manager Degraded is True with GarbageCollector_Error: GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.0.10:53: no such host
      INFO Cluster operator machine-api Progressing is True with SyncingResources: Progressing towards operator: 4.13.0-0.nightly-2023-03-23-204038
      ERROR Cluster operator machine-api Degraded is True with SyncingFailed: Failed when progressing towards operator: 4.13.0-0.nightly-2023-03-23-204038 because minimum worker replica count (2) not yet met: current running replicas 1, waiting for [cnfdc8-zbkmg-worker-0-44t59]
      INFO Cluster operator machine-api Available is False with Initializing: Operator is initializing
      INFO Cluster operator monitoring Available is False with UpdatingPrometheusOperatorFailed: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 1 unavailable replicas
      ERROR Cluster operator monitoring Degraded is True with UpdatingPrometheusOperatorFailed: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 1 unavailable replicas
      INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack.
      INFO Cluster operator network ManagementStateDegraded is False with :
      ERROR Cluster initialization failed because one or more operators are not functioning properly.
      ERROR                           The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
      ERROR                           https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html

      Expected results:

       

      Additional info:

       

        1. agent-config.yaml
          4 kB
        2. failed-install-1.tar.gz
          20.87 MB
        3. install-config.yaml
          2 kB
        4. ocp-install-agent-wait-for.txt
          98 kB
        5. Screenshot from 2023-03-24 11-31-07.png
          Screenshot from 2023-03-24 11-31-07.png
          128 kB

              bfournie@redhat.com Robert Fournier
              carlosgoncalves Carlos Goncalves
              None
              None
              Manoj Hans Manoj Hans
              None
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: