Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-18785

sdn-controller should never try to a lease as "localhost.localdomain"


    • Important
    • No
    • SDN Sprint 243, SDN Sprint 244
    • 2
    • False
    • Hide



      Description of problem:

      During a highly escalated scenario, we have found the following scenario:
      - Due to an unrelated problem, 2 control plane nodes had "localhost.localdomain" hostname when their respective sdn-controller pods started (this problem would be out of the scope of this bug report).
      - As both sdn-controller pods had (and retained) the "localhost.localdomain" hostname, this caused both of them to use "localhost.localdomain" while trying to acquire and renew the controller lease in openshift-network-controller configmap.
      - This ultimately caused both sdn-controller pods to mistakenly believe that they were the active sdn-controller, so both of them were active at the same time.
      Such a situation might have a number of undesired (and unknown) side effects. In our case, the result was that two nodes were allocated the same hostsubnet, disrupting pod communication between the 2 nodes and with the other nodes.
      What we expect from this bug report: That the sdn-controller never tries to acquire a lease as "localhost.localdomain" during a failure scenario. The ideal solution would be to acquire the lease in a way that avoids collisions (more on this on comments), but at the very least, sdn-controller should prefer crash-looping rather than starting with a lease that can collide and wreak havoc.

      Version-Release number of selected component (if applicable):

      Found on 4.11, but it should be reproducible in 4.13 as well.

      How reproducible:

      Under some error scenarios where 2 control plane nodes temporarily have "localhost.localdomain" hostname by mistake.

      Steps to Reproduce:

      1. Start sdn-controller pods

      Actual results:

      2 sdn-controller pods acquire the lease with "localhost.localdomain" holderIdentity and become active at the same time.

      Expected results:

      No sdn-controller pod to acquire the lease with "localhost.localdomain" holderIdentity. Either use unique identities even when there is failure scenario or just crash-loop.

      Additional info:

      Just FYI, the trigger that caused the wrong domain was investigated at this other bug: https://issues.redhat.com/browse/OCPBUGS-11997
      However, this situation may happen under other possible failure scenarios, so it is worth preventing it somehow.

            mkennell@redhat.com Martin Kennelly
            rhn-support-palonsor Pablo Alonso Rodriguez
            Zhanqi Zhao Zhanqi Zhao
            0 Vote for this issue
            8 Start watching this issue