Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.12.z
Component/s: Networking / cloud-network-config-controller
Labels:
- low-confidence
- rosa

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

Customer Impact:

Customer Escalated, Customer Facing

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Priority Data:
PX Impact Score:
PX Technical Impact:
PX Impact Range:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

The CNC contoller removed all the egress IPs from the nodes.

After deleting the EgressIP objects and recreating them.  Only the nodes for a single availablilty zone were created.

Version-Release number of selected component (if applicable):

Core controler

How reproducible:

Not very reproducible, but we do have an existing non-PROD and PROD exhibiting this behaviour.

Steps to Reproduce:

    1.  N/A
    2.
    3.

Actual results:

All EgressIp's apreared to be removed from the nodes and after dlelting and re-creating the EgressIP objects only the IPs for one availability zone recovered.

Expected results:

Egress IP's remain perminantly on the allocated nodes until the EgressIP object is removed or then node is unavailable.

Additional info:

There is a numer of excessive egressIP/CloudPrivateIPConfig events logged to the CNCC controller in the pas 18 days aprox 96,000 log enteries.

Looking at the OCP audit logs, there does not apear to be any entries for any messages calling the API for example

Put "https://api-int.uat-rosa.80g0.p1.openshiftapps.com:6443/apis/cloud.network.openshift.io/v1/cloudprivateipconfigs/10.134.17.151/status": context deadline exceeded, requeuing in cloud-private-ip-config workqueue

grep "cloudprivateipconfigs" audit-prod-rosa-x6fzw-2024.02.10T08.00_0800-2024.02.10T08.00_0800.log.txt | wc -l
       0

Note the date of the Audit logs does not overlap with the current CNCC logs, but the issue did exist back when the Audit logs were taken

I am opening this tonight, but we have also just thought of another reason when this might be occuring.

There is a firewall involved and so we are wondering if cross AZ traffic for the ROSA node EC2 instances is routed via the managed firewall.  As they had a catastrophic Firewall failure, which required revoering without backup

Please do not put to much effort into this today as we will check this with the customer tomorrow APAC time.

Assignee:: Ben Bennett

Reporter:: David Squirrell

Need Info From:: None

Contributors:: None

QA Contact:: Jean Chen

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/03/19 1:17 PM

Updated:: 2025/09/13 1:45 PM

Resolved:: 2024/03/25 1:51 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates