Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-10342

Installation fails if < 3 workers defined and number of compute replicas not set

    • No
    • Agent Sprint 233, Sprint 235, Sprint 236, Sprint 238
    • 4
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required

      Description of problem:

      This may be something we want to either add a validation for or document. It was initially found at a customer site but I've also confirmed it happens with just a Compact config with no workers. 
      
      They created an agent-config.yaml with 2 worker nodes but did not set the replicas in install-config.yaml, i.e. they did not set 
      compute:
      - hyperthreading: Enabled
        name: worker
        replicas: {{ num_workers }} 
      
      This resulted in an install failure as by default 3 worker replicas are created if not defined
      https://github.com/openshift/installer/blob/master/pkg/types/defaults/machinepools.go#L11
      
      See the attached console screenshot showing that the expected number of hosts doesn't match the actual.
      
      I've also duplicated this with a compact config. We can see that the install failed as start-cluster-installation.sh is looking for 6 hosts.
      
      [core@master-0 ~]$ sudo systemctl status start-cluster-installation.service
      โ— start-cluster-installation.service - Service that starts cluster installation
         Loaded: loaded (/etc/systemd/system/start-cluster-installation.service; enabled; vendor preset: enabled)
         Active: activating (start) since Wed 2023-03-15 14:40:04 UTC; 3min 41s ago
       Main PID: 3365 (start-cluster-i)
          Tasks: 5 (limit: 101736)
         Memory: 1.7M
         CGroup: /system.slice/start-cluster-installation.service
                 โ”œโ”€3365 /bin/bash /usr/local/bin/start-cluster-installation.sh
                 โ”œโ”€5124 /bin/bash /usr/local/bin/start-cluster-installation.sh
                 โ”œโ”€5132 /bin/bash /usr/local/bin/start-cluster-installation.sh
                 โ””โ”€5138 diff /tmp/tmp.vIq1jH9Vf2 /etc/issue.d/90_start-install.issueMar 15 14:42:54 master-0 start-cluster-installation.sh[3365]: Waiting for hosts to become ready for cluster installation...
      Mar 15 14:43:04 master-0 start-cluster-installation.sh[4746]: Hosts known and ready for cluster installation (3/6)
      Mar 15 14:43:04 master-0 start-cluster-installation.sh[3365]: Waiting for hosts to become ready for cluster installation...
      Mar 15 14:43:15 master-0 start-cluster-installation.sh[4980]: Hosts known and ready for cluster installation (3/6)
      Mar 15 14:43:15 master-0 start-cluster-installation.sh[3365]: Waiting for hosts to become ready for cluster installation...
      Mar 15 14:43:25 master-0 start-cluster-installation.sh[5026]: Hosts known and ready for cluster installation (3/6)
      Mar 15 14:43:25 master-0 start-cluster-installation.sh[3365]: Waiting for hosts to become ready for cluster installation...
      Mar 15 14:43:35 master-0 start-cluster-installation.sh[5079]: Hosts known and ready for cluster installation (3/6)
      Mar 15 14:43:35 master-0 start-cluster-installation.sh[3365]: Waiting for hosts to become ready for cluster installation...
      Mar 15 14:43:45 master-0 start-cluster-installation.sh[5124]: Hosts known and ready for cluster installation (3/6)
      
      Since the compute section in install-config.yaml is optional we can't assume that it will be there 
      https://github.com/openshift/installer/blob/master/pkg/types/installconfig.go#L126

      Version-Release number of selected component (if applicable):

      4.12

      How reproducible:

       

      Steps to Reproduce:

      1. Remove the compute section from install-config.yaml
      2. Do an install
      3. See the failure
      

      Actual results:

       

      Expected results:

       

      Additional info:

       

            [OCPBUGS-10342] Installation fails if < 3 workers defined and number of compute replicas not set

            Errata Tool added a comment -

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Important: OpenShift Container Platform 4.14.0 bug fix and security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2023:5006

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Important: OpenShift Container Platform 4.14.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:5006

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Important: OpenShift Container Platform 4.14.0 bug fix and security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2023:5006

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Important: OpenShift Container Platform 4.14.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:5006

            Biagio Manzari added a comment - Great, thanks bfournie@redhat.com   Fix verified using version 4.14.0-0.nightly-2023-06-20-065807  Compact cluster with compute replicas removed and 2 workers defined Warning message https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/40648/rehearse-40648-periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-agent-baremetal-compact-ipv4-static-p1-f14/1673306316187111424#1:build-log.txt%3A108   agent-config https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/pr-logs/pull/openshift_release/40648/rehearse-40648-periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-agent-baremetal-compact-ipv4-static-p1-f14/1673306316187111424/artifacts/baremetal-compact-ipv4-static-p1-f14/baremetal-lab-agent-install/artifacts/agent-config.yaml install-config https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/pr-logs/pull/openshift_release/40648/rehearse-40648-periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-agent-baremetal-compact-ipv4-static-p1-f14/1673306316187111424/artifacts/baremetal-compact-ipv4-static-p1-f14/baremetal-lab-agent-install/artifacts/install-config.yaml  

            Note that this fix ONLY adds an additional warning message, it does not fail the ISO generation. If you remove the compute section (so that by default 3 computes are used) and add 2 workers in agent-config.yaml you should get the following warning when generating the ISO

            level=warning msg=The number of hosts configured as workers (2) does not match the worker replicas (3)

            Robert Fournier added a comment - Note that this fix ONLY adds an additional warning message, it does not fail the ISO generation. If you remove the compute section (so that by default 3 computes are used) and add 2 workers in agent-config.yaml you should get the following warning when generating the ISO level=warning msg=The number of hosts configured as workers (2) does not match the worker replicas (3)

            bmanzari yes that is the expected result when using SNO. When compute replicas is not set it defaults to 3 which is why you see that failure.

            I'd recommend also testing with a Compact cluster by removing the compute replicas and defining 2 worker nodes in the agent-config.yaml hosts section.

            Robert Fournier added a comment - bmanzari yes that is the expected result when using SNO. When compute replicas is not set it defaults to 3 which is why you see that failure. I'd recommend also testing with a Compact cluster by removing the compute replicas and defining 2 worker nodes in the agent-config.yaml hosts section.

            Biagio Manzari added a comment - - edited

            Using version 4.14.0-0.nightly-2023-06-20-065807

            Removing the compute section from install-config.yaml file results in the following error

             
            level=error msg=failed to write asset (Agent Installer ISO) to disk: cannot generate ISO image due to configuration errors
             
            level=fatal msg=failed to fetch Agent Installer ISO: failed to load asset "Install Config": invalid install-config configuration: Compute.Replicas: Required value: Total number of Compute.Replicas must be 0 for none platform. Found 3
             
            Prow job
            https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/40648/rehearse-40648-periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-agent-baremetal-sno-ipv4-static-p1-f14/1673261389294080000
             
            install-config.yaml file
            https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/pr-logs/pull/openshift_release/40648/rehearse-40648-periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-agent-baremetal-sno-ipv4-static-p1-f14/1673261389294080000/artifacts/baremetal-sno-ipv4-static-p1-f14/baremetal-lab-agent-install/artifacts/install-config.yaml 
             
            is this the expected result?

            Biagio Manzari added a comment - - edited Using version 4.14.0-0.nightly-2023-06-20-065807 Removing the compute section from install-config.yaml file results in the following error   level=error msg=failed to write asset (Agent Installer ISO) to disk: cannot generate ISO image due to configuration errors   level=fatal msg=failed to fetch Agent Installer ISO: failed to load asset "Install Config": invalid install-config configuration: Compute.Replicas: Required value: Total number of Compute.Replicas must be 0 for none platform. Found 3   Prow job https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/40648/rehearse-40648-periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-agent-baremetal-sno-ipv4-static-p1-f14/1673261389294080000   install-config.yaml file https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/pr-logs/pull/openshift_release/40648/rehearse-40648-periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-agent-baremetal-sno-ipv4-static-p1-f14/1673261389294080000/artifacts/baremetal-sno-ipv4-static-p1-f14/baremetal-lab-agent-install/artifacts/install-config.yaml     is this the expected result?

            Zane Bitter added a comment -

            I think we have to keep this behaviour so that the install-config has the same meaning no matter which method you use to install. But we definitely need to document it well. This was still a surprise to me and I'm pretty sure this is not the first time I had learned of it

            Zane Bitter added a comment - I think we have to keep this behaviour so that the install-config has the same meaning no matter which method you use to install. But we definitely need to document it well. This was still a surprise to me and I'm pretty sure this is not the first time I had learned of it

              bfournie@redhat.com Robert Fournier
              bfournie@redhat.com Robert Fournier
              Biagio Manzari Biagio Manzari
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: