-
Bug
-
Resolution: Done-Errata
-
Normal
-
4.11
-
None
Description of problem:
During a highly escalated scenario, we have found the following scenario: - Due to an unrelated problem, 2 control plane nodes had "localhost.localdomain" hostname when their respective sdn-controller pods started (this problem would be out of the scope of this bug report). - As both sdn-controller pods had (and retained) the "localhost.localdomain" hostname, this caused both of them to use "localhost.localdomain" while trying to acquire and renew the controller lease in openshift-network-controller configmap. - This ultimately caused both sdn-controller pods to mistakenly believe that they were the active sdn-controller, so both of them were active at the same time. Such a situation might have a number of undesired (and unknown) side effects. In our case, the result was that two nodes were allocated the same hostsubnet, disrupting pod communication between the 2 nodes and with the other nodes. What we expect from this bug report: That the sdn-controller never tries to acquire a lease as "localhost.localdomain" during a failure scenario. The ideal solution would be to acquire the lease in a way that avoids collisions (more on this on comments), but at the very least, sdn-controller should prefer crash-looping rather than starting with a lease that can collide and wreak havoc.
Version-Release number of selected component (if applicable):
Found on 4.11, but it should be reproducible in 4.13 as well.
How reproducible:
Under some error scenarios where 2 control plane nodes temporarily have "localhost.localdomain" hostname by mistake.
Steps to Reproduce:
1. Start sdn-controller pods 2. 3.
Actual results:
2 sdn-controller pods acquire the lease with "localhost.localdomain" holderIdentity and become active at the same time.
Expected results:
No sdn-controller pod to acquire the lease with "localhost.localdomain" holderIdentity. Either use unique identities even when there is failure scenario or just crash-loop.
Additional info:
Just FYI, the trigger that caused the wrong domain was investigated at this other bug: https://issues.redhat.com/browse/OCPBUGS-11997 However, this situation may happen under other possible failure scenarios, so it is worth preventing it somehow.