Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.17.z
Component/s: Networking / ovn-kubernetes
Labels:
None

Severity:
Important
Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem: In two separate labs, after upgrading from 4.16 to 4.17.3, the network ClusterOperator never becomes healthy. The ovnkube-control-plane-* pods are crash looping with these errors in the logs repeatedly:

I1118 16:20:31.898589 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:29108: connect: connection refused
I1118 16:20:36.066389 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:29108: connect: connection refused

Version-Release number of selected component (if applicable): 4.17.3

How reproducible: Upgrade from OCP4.16 to OCP4.17.3

Steps to Reproduce:

1. Original environment was 4.16.x (4.16.15 at least, maybe higher)

2. Upgrade via the Console. The upgrade will work fine except the network operator never becomes healthy

Actual results: network ClusterOperator stuck in degraded mode

Expected results: network ClusterOperator should work fine

Additional info: Not sure if it's related, but both clusters have NMState operator installed and bridges defined on the workers on a dedicated NIC that is on the same network as br-ex. In one cluster, I removed the bridge and even removed the secondary NICs from the worker nodes entirely, but the problem persisted.

Furthermore, I have a third cluster that is a fresh OCP4.17 installation (not an upgrade) and I had the same problem. Removing the NMState bridges corrected the issue (in that cluster, the bridges weren't actually necessary, anyway). Removing the bridges in the upgraded clusters does not seem to resolve the issue.

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

Internal RedHat testing failure

If it is an internal RedHat testing failure:

You may login to the cluster for further investigation:
oc login --insecure-skip-tls-verify -u admin -p redhat https://api.virt.msecaur.vmware.tamlab.rdu2.redhat.com:6443

Assignee:: Ben Bennett

Reporter:: Matthew Secaur

QA Contact:: Anurag Saxena

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2024/11/18 4:28 PM

Updated:: 2024/11/20 5:50 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates