Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-70337

OCP4.18.30 - EgressIP readiness flapping causing continuous failover/interrupts

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Low
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      platform upgraded from 4.17.45 to 4.18.30 was completed successfully. Planned step-through to 4.19.z, was paused when it was discovered that egressIP traffic is interrupted - significant traffic loss/interruption. 
      
      Observed that egressIP is continually selecting new peer host nodes and remains unstable. During failover, traffic is interrupted as we move to neighbor nodes. 
      
      Original configuration was 7 nodes with egress-assignable capacity, using secondary vrf networks (bond1-3045@bond1) to host targetIPs. 2 IPs were allocated to an egressiP object: egress-qa
      
      This object has been paired down to 1 egressIP to simplify testing, and it has been observed that all labeled nodes are cycled through.
      
      Observed that when we remove the labels from all nodes and label only one host, we have variable connectivity upstream - some nodes appear to be able to make new connections and fail intermittently, and some nodes fail continually and are able to connect intermittently. But no node is "stable". 
      
      This is in stark contrast to 4.17 where behavior was not present. 
      
      Customer performed workaround in KCS: https://access.redhat.com/solutions/7125049: removed all egressIPs, performed ovnkube db rebuild on all nodes, re-added egressIPs. Behavior appears immediately.
      
      
      `reachabilityTotalTimeoutSeconds: 1` has been set so we see the failure condition frequently as we continue to bounce around nodes every 4s or so (approximate)

       

      Version-Release number of selected component (if applicable):

      4.18.30

       

      How reproducible:

      • Every time within customer environment (high impact)
      • Working on internal replicator now

      Steps to Reproduce:

      1. Deploy 4.18.30 cluster (baremetal observed, unclear if platform specific)

      2. Create secondary interface using nncp to handle egressIP traffic

      3. Create egressIP object 

      4. label namespace with matching label

      5. create httpd test pod to curl from and observe intermittent connectivity failure reaching outbound - observe egressIP allocation swapping between nodes.

      GRPC curls from node to node at port 9 give expected "connection refused" but this hasn't been exhaustively tested.

      Actual results:

      • EgressIP flapping state, pod traffic interrupted during continued migrations

      Expected results:

      • EgressIP should be stable/viable on 4.18.30

      Additional info:

      bond1 vrf interface appears to be flapping on these nodes - which is configured via nmstate operator (and seems stable according to NNCP object and nmstate handler pod). However, we see the interface is repeatedly being re-upped and the MAC is cycling. I expect this can and does lead to instable network conditions.

       

      unclear why this is occurring.  - see first comment below for specific data segments and log snips. 

              bbennett@redhat.com Ben Bennett
              rhn-support-wrussell Will Russell
              None
              None
              Anurag Saxena Anurag Saxena
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: