Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-20084

Several compact clusters failed to install because non-bootstrap nodes never pulled ignition files

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • No
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      While installing various cluster sizes at scale with ACM/ZTP via the assisted-service, several clusters failed to complete install seemingly all on the additional nodes failing to pull there ignition.
      
      4 clusters out of 34 were in this similar state:
      compact-00031
      compact-00034
      compact-00098
      compact-00340
      
      At the conclusion of the test these clusters were seemingly still in InstallationInProgress as the installs went beyond 12 hours and later went to error state.
      
      Example cluster
      # oc get aci -n compact-00034
      NAME            CLUSTER         STATE
      compact-00034   compact-00034   error
      # oc get bmh -n compact-00034
      NAME      STATE          CONSUMER   ONLINE   ERROR   AGE
      vm01307   provisioned               true             39h
      vm01308   provisioned               true             39h
      vm01309   provisioning              true             39h
      
      # ssh core@vm01307
      ssh: connect to host vm01307 port 22: Connection refused
      # ssh core@vm01308
      ssh: connect to host vm01308 port 22: Connection refused 
      # ssh core@vm01309
      ...
      This is the bootstrap node; it will be destroyed when the master is fully up.The primary services are release-image.service followed by bootkube.service. To watch their status, run e.g.                                                                                                        journalctl -b -f -u release-image.service -u bootkube.service
      Last login: Wed Oct  4 13:40:40 2023 from fc00:1004::1
      [core@vm01309 ~]$
      
      
      Most notably to find a cluster in this condition, we can see 2 of the 3 control-plane bmh's are in provisioned state with the last one in provisioning state.  The two provisioned are unable to be ssh-ed to because ignition has not been applied the ssh key for access.
      
      Journal logs and agentclusterinstall logs have been collected for the available machines to assist in the debug of this issue

       

      Version-Release number of selected component (if applicable):

      Hub OCP - 4.14.0-rc.2
      Deployed clusters 4.14.0-rc.2
      ACM - 2.9.0-DOWNSTREAM-2023-09-27-22-12-46

      How reproducible:

      4 out of 34 failures appear to be this so far

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

      VM to cluster bmhs:
      # oc get bmh -n compact-00031
      NAME      STATE          CONSUMER   ONLINE   ERROR   AGE
      vm01298   provisioned               true             40h
      vm01299   provisioned               true             40h
      vm01300   provisioning              true             40h
      # oc get bmh -n compact-00034
      NAME      STATE          CONSUMER   ONLINE   ERROR   AGE
      vm01307   provisioned               true             40h
      vm01308   provisioned               true             40h
      vm01309   provisioning              true             40h
      # oc get bmh -n compact-00098
      NAME      STATE          CONSUMER   ONLINE   ERROR   AGE
      vm01499   provisioned               true             40h
      vm01500   provisioning              true             40h
      vm01501   provisioned               true             40h
      # oc get bmh -n compact-00340
      NAME      STATE          CONSUMER   ONLINE   ERROR   AGE
      vm02225   provisioning              true             40h
      vm02226   provisioned               true             40h
      vm02227   provisioned               true             40h
      

        1. compact-00031-aci-events.json
          49 kB
        2. compact-00034-aci-events.json
          50 kB
        3. compact-00098-aci-events.json
          53 kB
        4. compact-00340-aci-events.json
          49 kB
        5. vm01300.journal.log.gz
          10.61 MB
        6. vm02225.journal.log.gz
          10.97 MB
        7. vm01500.journal.log.gz
          10.77 MB
        8. compact-00031-aci-logs.tar
          12.90 MB
        9. compact-00034-aci-logs.tar
          12.53 MB
        10. compact-00098-aci-logs.tar
          12.53 MB
        11. compact-00340-aci-logs.tar
          12.49 MB
        12. vm01309.journal.log-1.gz
          10.79 MB

              itsoiref@redhat.com Igal Tsoiref
              akrzos@redhat.com Alex Krzos
              None
              None
              Lital Alon Lital Alon
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: