Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-31049

Egress IPs were removed from nodes

XMLWordPrintable

    • Moderate
    • No
    • False
    • Hide

      None

      Show
      None
    • Customer Escalated, Customer Facing
    • Likely a firewall configuration issue; nodes hosting egressip's not reachable

      Description of problem:

      The CNC contoller removed all the egress IPs from the nodes.
      
      After deleting the EgressIP objects and recreating them.  Only the nodes for a single availablilty zone were created.    

      Version-Release number of selected component (if applicable):

      Core controler 

      How reproducible:

      Not very reproducible, but we do have an existing non-PROD and PROD exhibiting this behaviour.    

      Steps to Reproduce:

          1.  N/A
          2.
          3.
          

      Actual results:

      All EgressIp's apreared to be removed from the nodes and after dlelting and re-creating the EgressIP objects only the IPs for one availability zone recovered.

      Expected results:

      Egress IP's remain perminantly on the allocated nodes until the EgressIP object is removed or then node is unavailable.

      Additional info:

      There is a numer of excessive egressIP/CloudPrivateIPConfig events logged to the CNCC controller in the pas 18 days aprox 96,000 log enteries.
      Looking at the OCP audit logs, there does not apear to be any entries for any messages calling the API for example
      
      Put "https://api-int.uat-rosa.80g0.p1.openshiftapps.com:6443/apis/cloud.network.openshift.io/v1/cloudprivateipconfigs/10.134.17.151/status": context deadline exceeded, requeuing in cloud-private-ip-config workqueue
      
      grep "cloudprivateipconfigs" audit-prod-rosa-x6fzw-2024.02.10T08.00_0800-2024.02.10T08.00_0800.log.txt | wc -l
             0
      
      Note the date of the Audit logs does not overlap with the current CNCC logs, but the issue did exist back when the Audit logs were taken

       

      I am opening this tonight, but we have also just thought of another reason when this might be occuring.
      
      There is a firewall involved and so we are wondering if cross AZ traffic for the ROSA node EC2 instances is routed via the managed firewall.  As they had a catastrophic Firewall failure, which required revoering without backup
      
      Please do not put to much effort into this today as we will check this with the customer tomorrow APAC time.

       

              bbennett@redhat.com Ben Bennett
              rhn-support-dsquirre David Squirrell
              Jean Chen Jean Chen
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: