Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-16657

OCP 4.12.22 - Unicast split brain behavior on Keepalived (missing unicast peer entries)

XMLWordPrintable

    • Moderate
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Customer upgrade from 4.12.21 --> 4.12.22
      Observed degraded access to console/auth/routes on cluster
      IPI cluster
      
      keepalived on two nodes were broadcasting ownership of the ingress VIP address; observed that on two affected nodes, the unicast src and unicast peer IP tables were missing from /etc/keepalived/keepalived.conf
      
      After manually editing the file to include src and peer entries + restarting the affected host pods, duplicated GARPs stopped, collisions ended and resolution to console/routes was restored. 
      
      VRRP traffic did not appear blocked, as adding the entries allowed the node to check in and recognize it was part of the cluster for ip failover and cease broadcasting, but it is odd that this process was not automatically defined on node boot.
      
      (customer had previously restarted all nodes in the cluster to mitigate the issue, which implies that the table was not refreshed on a restart).

      Version-Release number of selected component (if applicable):

      4.12.22

      How reproducible:

      Issue was ongoing until we manually edited the /etc/keepalived/keepalived.conf file to reflect the proper entries on the two hosts missing them (a worker and a storage node).
      
      one time.

      Steps to Reproduce:

      1. observe console degrades after multiple calls to the address path
      2. observed via KCS checks: https://access.redhat.com/solutions/7013445 that ARP for ingress was being sent from multiple host keepalived pods
      3. observed that on the secondary hosts, these nodes did not have a valid router pod, leading to dropped packets/port rejection on 443/80, leading to degraded state.
      4. Confirmed that /etc/keepalived/keepalived.conf was missing unicast peer/src entries on affected nodes
      5. modified entries to match working peers
      6. observed issue resolved.

      Actual results:

      Cluster degraded until manual intervention

      Expected results:

      cluster should be able to populate unicast peer entries for keepalived during cluster updates without interference

      Additional info:

      this is a stateside support case linked, which means all uploads will be cleaned prior to submission (no ip addresses/hostnames). Have requested sosreport from affected node for analysis + have cleaned must-gather for review.

              bnemec@redhat.com Benjamin Nemec
              rhn-support-wrussell Will Russell
              Zhanqi Zhao Zhanqi Zhao
              Votes:
              2 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: