Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-36779

3rd master still not joining to the cluster on ABI

XMLWordPrintable

    • Important
    • No
    • False
    • Hide

      None

      Show
      None
    • Hide
      Cause: When checking the control plane nodes for readiness and when a conflict is encountered from another write from assisted-installer-controller, assisted-installer does not reload fresh data from assisted-service to determine control plane readiness.

      Consequence: A conflict occurs on every retry because the assisted-installer continues to look at old data. If a node is determined to be ready by assisted-installer-controller, assisted-installer does not see it because it still looks at old data.

      Fix: When a conflict occurs, assisted-installer now loads fresh data from assisted-service.

      Result: The refreshed data should show the node has already been updated and to not retry another update. Updated data should indicate the node's current status and whether it is ready.
      Show
      Cause: When checking the control plane nodes for readiness and when a conflict is encountered from another write from assisted-installer-controller, assisted-installer does not reload fresh data from assisted-service to determine control plane readiness. Consequence: A conflict occurs on every retry because the assisted-installer continues to look at old data. If a node is determined to be ready by assisted-installer-controller, assisted-installer does not see it because it still looks at old data. Fix: When a conflict occurs, assisted-installer now loads fresh data from assisted-service. Result: The refreshed data should show the node has already been updated and to not retry another update. Updated data should indicate the node's current status and whether it is ready.
    • Bug Fix
    • In Progress

      Previously, in OCPBUGS-32105, we fixed a bug where a race between the assisted-installer and the assisted-installer-controller to mark a Node as Joined would result in 30+ minutes of (unlogged) retries by the former if the latter won. This was indistinguishable from the installation process hanging and it would eventually timed out.

      This bug has been fixed, but we were unable to reproduce the circumstances that caused it.

      However, a reproduction by the customer reveals another problem: we now correctly retry checking the control plane nodes for readiness if we encounter a conflict with another write from assisted-installer-controller. However, we never reload fresh data from assisted-service - data that would show the host has already been updated and thus prevent us from trying to update it again. Therefore, we continue to get a conflict on every retry. (This is at least now logged, so we can see what is happening.)

      This also suggests a potential way to reproduce the problem: whenever one control plane node has booted to the point that the assisted-installer-controller is running before the second control plane node has booted to the point that the Node is marked as ready in the k8s API, there is a possibility of a race. There is in fact no need for the write from assisted-installer-controller to come in the narrow window between when assisted-installer reads vs. writes to the assisted-service API, because assisted-installer is always using a stale read.

            zabitter Zane Bitter
            zabitter Zane Bitter
            Manoj Hans Manoj Hans
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: