Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-55025

Nodelink: clears existing machine status fields while applying nodeRef

XMLWordPrintable

    • None
    • CLOUD Sprint 269
    • 1
    • False
    • Hide

      None

      Show
      None
    • Release Note Not Required
    • In Progress

      Description of problem:

      nodelink clears existing machine status fields while applying nodeRef

      I found an conflicting issue between the migration/sync controllers, the nodelink controller and the CPMSO.
      A cluster with MachineAPIMigration=true is installed

      • The installer creates control plane machines, which gets registered with the cluster
      • The installer applies/creates control plane machines objects in the cluster
      • The migration controller gets notified about a machine event, and reconciles, finds the .status.authoritativeAPI is empty and propagates the value from .spec.authoritativeAPI to it [a] (logs here)
        In the meanwhile the CPMSO hasn't started up yet (see logs here) so no owner reference of the CPMSO is set on the control plane machines
      • The sync controller can now reconcile the machine as it has .status.authoritativeAPI, goes into reconcileMAPIMachinetoCAPIMachine() , hits the if len(mapiMachine.OwnerReferences) == 0 { case here, and carries on to r.convertMAPIToCAPIMachine(mapiMachine), here it errors because control plane machines have the loadbalancers set (we are unable to covert it at this point), so it applies the synchronized=false and synchronizedGeneration=0 (this is crucial) [b]
      • The nodelink controller finally wakes up, and tries to set the nodeRef on the control plane machines to the corresponding nodes. But it fails [c] because it attempts an update to the control plane machine's status (to set the noderef) without specifying the synchronizedGeneration , which was previously and erroneously set, and goes against the openAPI validation we have for the authoritativeAPI <> synchronizedGeneration status fields.
        This yields prevents the control plane nodes to go into Running state and renders them un-adoptable by the CPMSO, which goes into Available=False making the entire Cluster initialization to fail, and the installation with it.

       

      Version-Release number of selected component (if applicable):

      4.19    

      How reproducible:

      Always

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

          

      Expected results:

          

      Additional info:

          

              ddonati@redhat.com Damiano Donati
              ddonati@redhat.com Damiano Donati
              Zhaohua Sun Zhaohua Sun
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: