Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-31049

Egress IPs were removed from nodes

    XMLWordPrintable

Details

    • Moderate
    • No
    • False
    • Hide

      None

      Show
      None
    • Customer Escalated, Customer Facing
    • Likely a firewall configuration issue; nodes hosting egressip's not reachable

    Description

      Description of problem:

      The CNC contoller removed all the egress IPs from the nodes.
      
      After deleting the EgressIP objects and recreating them.  Only the nodes for a single availablilty zone were created.    

      Version-Release number of selected component (if applicable):

      Core controler 

      How reproducible:

      Not very reproducible, but we do have an existing non-PROD and PROD exhibiting this behaviour.    

      Steps to Reproduce:

          1.  N/A
          2.
          3.
          

      Actual results:

      All EgressIp's apreared to be removed from the nodes and after dlelting and re-creating the EgressIP objects only the IPs for one availability zone recovered.

      Expected results:

      Egress IP's remain perminantly on the allocated nodes until the EgressIP object is removed or then node is unavailable.

      Additional info:

      There is a numer of excessive egressIP/CloudPrivateIPConfig events logged to the CNCC controller in the pas 18 days aprox 96,000 log enteries.
      Looking at the OCP audit logs, there does not apear to be any entries for any messages calling the API for example
      
      Put "https://api-int.uat-rosa.80g0.p1.openshiftapps.com:6443/apis/cloud.network.openshift.io/v1/cloudprivateipconfigs/10.134.17.151/status": context deadline exceeded, requeuing in cloud-private-ip-config workqueue
      
      grep "cloudprivateipconfigs" audit-prod-rosa-x6fzw-2024.02.10T08.00_0800-2024.02.10T08.00_0800.log.txt | wc -l
             0
      
      Note the date of the Audit logs does not overlap with the current CNCC logs, but the issue did exist back when the Audit logs were taken

       

      I am opening this tonight, but we have also just thought of another reason when this might be occuring.
      
      There is a firewall involved and so we are wondering if cross AZ traffic for the ROSA node EC2 instances is routed via the managed firewall.  As they had a catastrophic Firewall failure, which required revoering without backup
      
      Please do not put to much effort into this today as we will check this with the customer tomorrow APAC time.

       

      Attachments

        Activity

          People

            bbennett@redhat.com Ben Bennett
            rhn-support-dsquirre David Squirrell
            Jean Chen Jean Chen
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: