Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: 4.15.0
Affects Version/s: 4.11
Component/s: Networking / openshift-sdn
Labels:
None

Severity:
Important
Regression:
No
Sprint:
SDN Sprint 243, SDN Sprint 244
sprint_count:
2
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.15.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

During a highly escalated scenario, we have found the following scenario:
- Due to an unrelated problem, 2 control plane nodes had "localhost.localdomain" hostname when their respective sdn-controller pods started (this problem would be out of the scope of this bug report).
- As both sdn-controller pods had (and retained) the "localhost.localdomain" hostname, this caused both of them to use "localhost.localdomain" while trying to acquire and renew the controller lease in openshift-network-controller configmap.
- This ultimately caused both sdn-controller pods to mistakenly believe that they were the active sdn-controller, so both of them were active at the same time.

Such a situation might have a number of undesired (and unknown) side effects. In our case, the result was that two nodes were allocated the same hostsubnet, disrupting pod communication between the 2 nodes and with the other nodes.

What we expect from this bug report: That the sdn-controller never tries to acquire a lease as "localhost.localdomain" during a failure scenario. The ideal solution would be to acquire the lease in a way that avoids collisions (more on this on comments), but at the very least, sdn-controller should prefer crash-looping rather than starting with a lease that can collide and wreak havoc.

Version-Release number of selected component (if applicable):

Found on 4.11, but it should be reproducible in 4.13 as well.

How reproducible:

Under some error scenarios where 2 control plane nodes temporarily have "localhost.localdomain" hostname by mistake.

Steps to Reproduce:

1. Start sdn-controller pods
2.
3.

Actual results:

2 sdn-controller pods acquire the lease with "localhost.localdomain" holderIdentity and become active at the same time.

Expected results:

No sdn-controller pod to acquire the lease with "localhost.localdomain" holderIdentity. Either use unique identities even when there is failure scenario or just crash-loop.

Additional info:

Just FYI, the trigger that caused the wrong domain was investigated at this other bug: https://issues.redhat.com/browse/OCPBUGS-11997

However, this situation may happen under other possible failure scenarios, so it is worth preventing it somehow.

links to

openshift/cluster-network-operator#2047: OCPBUGS-18785: SDN controller manifest: add node name from downward kapi to sdn controller

openshift/sdn#578: OCPBUGS-18785: Controller: add flag for node name

RHEA-2023:7198 rpm

Solution (Knowledge Base)