Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-65649

Frequent VIP movement in openshift-kni-infra caused by keepalive track scripts being killed

XMLWordPrintable

    • Incidents & Support
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

          Frequent VIP movement in openshift-kni-infra caused by keepalive track scripts being killed

      Version-Release number of selected component (if applicable):

          4.18.26

      How reproducible:

          100% in customer env

      Steps to Reproduce:

      The customer is running OCP 4.18.26 BareMetal IPI/Assisted where the ingress Keepalived seem to be slow executing / slow stat update resulting in `track script` being killed frequently, further result into VIP movement.
      
      2025-11-12T02:56:22.596696183Z Wed Nov 12 02:56:22 2025: (clu001_INGRESS_0) Entering BACKUP STATE 
      2025-11-12T02:56:22.596747258Z Wed Nov 12 02:56:22 2025: clu001_INGRESS_0: sending gratuitous ARP for 10.220.16.31 
      2025-11-12T02:56:22.596747258Z Wed Nov 12 02:56:22 2025: Sending gratuitous ARP on br-ex for 10.220.16.31 
      2025-11-12T02:56:26.284534037Z Wed Nov 12 02:56:26 2025: (clu001_INGRESS_0) Receive advertisement timeout 
      2025-11-12T02:56:26.285943161Z Wed Nov 12 02:56:26 2025: (clu001_INGRESS_0) Entering MASTER STATE 
      2025-11-12T02:56:26.285971774Z Wed Nov 12 02:56:26 2025: (clu001_INGRESS_0) setting VIPs. 
      2025-11-12T02:56:26.286250884Z Wed Nov 12 02:56:26 2025: (clu001_INGRESS_0) Sending/queueing gratuitous ARPs on br-ex for 10.220.17.1 
      2025-11-12T02:56:26.286276602Z Wed Nov 12 02:56:26 2025: Sending gratuitous ARP on br-ex for 10.220.17.1 
      2025-11-12T02:56:26.286375556Z Wed Nov 12 02:56:26 2025: Sending gratuitous ARP on br-ex for 10.220.17.1 
      2025-11-12T02:56:26.286398148Z Wed Nov 12 02:56:26 2025: Sending gratuitous ARP on br-ex for 10.220.17.1 
      2025-11-12T02:56:26.286423876Z Wed Nov 12 02:56:26 2025: Sending gratuitous ARP on br-ex for 10.220.17.1 
      2025-11-12T02:56:26.286447790Z Wed Nov 12 02:56:26 2025: Sending gratuitous ARP on br-ex for 10.220.17.1 
      2025-11-12T02:56:26.560868788Z Wed Nov 12 02:56:26 2025: Track script chk_ingress_ready is already running, expect idle - skipping run 
      2025-11-12T02:56:26.585841792Z Wed Nov 12 02:56:26 2025: Track script chk_ingress is already running, expect idle - skipping run 
      2025-11-12T02:56:27.560961416Z Wed Nov 12 02:56:27 2025: Track script chk_ingress_ready is being timed out, expect idle - skipping run 
      2025-11-12T02:56:27.585894626Z Wed Nov 12 02:56:27 2025: Track script chk_ingress is being timed out, expect idle - skipping run 
      2025-11-12T02:56:28.560982821Z Wed Nov 12 02:56:28 2025: Track script chk_ingress_ready is being timed out, expect idle - skipping run 
      2025-11-12T02:56:28.585894090Z Wed Nov 12 02:56:28 2025: Track script chk_ingress is being timed out, expect idle - skipping run 
      2025-11-12T02:56:28.985570543Z Wed Nov 12 02:56:28 2025: VRRP_Script(chk_ingress_ready) failed (due to signal 15) 
      2025-11-12T02:56:28.986482432Z Wed Nov 12 02:56:28 2025: (clu001_INGRESS_0) Changing effective priority from 80 to 70 
      2025-11-12T02:56:28.986855065Z Wed Nov 12 02:56:28 2025: VRRP_Script(chk_ingress) failed (due to signal 15) 
      2025-11-12T02:56:28.986872658Z Wed Nov 12 02:56:28 2025: (clu001_INGRESS_0) Entering FAULT STATE 
      2025-11-12T02:56:28.988800889Z Wed Nov 12 02:56:28 2025: (clu001_INGRESS_0) sent 0 priority 
      2025-11-12T02:56:28.988825024Z Wed Nov 12 02:56:28 2025: (clu001_INGRESS_0) removing VIPs. 
      2025-11-12T02:56:29.681662772Z Wed Nov 12 02:56:29 2025: Printing parent data for process(19) on signal 
      2025-11-12T02:56:29.681662772Z Wed Nov 12 02:56:29 2025: Printing VRRP data for process(22) on signal 
      2025-11-12T02:56:30.572348346Z Wed Nov 12 02:56:30 2025: VRRP_Script(chk_ingress_ready) succeeded 
      2025-11-12T02:56:30.572409219Z Wed Nov 12 02:56:30 2025: (clu001_INGRESS_0) Changing effective priority from 70 to 80 
      2025-11-12T02:56:30.597224480Z Wed Nov 12 02:56:30 2025: VRRP_Script(chk_ingress) succeeded 
      2025-11-12T02:56:30.597262571Z Wed Nov 12 02:56:30 2025: (clu001_INGRESS_0) Entering BACKUP STATE 
      2025-11-12T02:56:30.597288489Z Wed Nov 12 02:56:30 2025: clu001_INGRESS_0: sending gratuitous ARP for 10.220.16.31 
      2025-11-12T02:56:30.597312904Z Wed Nov 12 02:56:30 2025: Sending gratuitous ARP on br-ex for 10.220.16.31 
      2025-11-12T02:56:36.302598089Z Wed Nov 12 02:56:36 2025: (clu001_INGRESS_0) Backup received priority 0 advertisement 
      2025-11-12T02:56:36.302889281Z Wed Nov 12 02:56:36 2025: (clu001_INGRESS_0) Backup received priority 0 advertisement 
      2025-11-12T02:56:36.990563133Z Wed Nov 12 02:56:36 2025: (clu001_INGRESS_0) Receive advertisement timeout 
      2025-11-12T02:56:36.991663362Z Wed Nov 12 02:56:36 2025: (clu001_INGRESS_0) Entering MASTER STATE 
      2025-11-12T02:56:36.991703136Z Wed Nov 12 02:56:36 2025: (clu001_INGRESS_0) setting VIPs. 
      2025-11-12T02:56:36.991996502Z Wed Nov 12 02:56:36 2025: (clu001_INGRESS_0) Sending/queueing gratuitous ARPs on br-ex for 10.220.17.1 
      2025-11-12T02:56:36.992031537Z Wed Nov 12 02:56:36 2025: Sending gratuitous ARP on br-ex for 10.220.17.1 
      2025-11-12T02:56:36.992253310Z Wed Nov 12 02:56:36 2025: Sending gratuitous ARP on br-ex for 10.220.17.1 
      2025-11-12T02:56:36.992360600Z Wed Nov 12 02:56:36 2025: Sending gratuitous ARP on br-ex for 10.220.17.1 
      2025-11-12T02:56:36.992422085Z Wed Nov 12 02:56:36 2025: Sending gratuitous ARP on br-ex for 10.220.17.1

       

      Actual results:

          Keepalive complaining about script still running but it doesn't appear to be. 

      Expected results:

      If keepalive get proper stat, it should help in stabilizing VIPs on nodes which would further help stabilizing HCP clusters  LDAP login

      Additional info:

          The customer is facing intermittent LDAP login issues to all of their HCP clusters. The management cluster Keepalived Ingress VIP availability is crucial to ensure LDAP logins for HCP cluster always success. We observed that the keepalived health check scripts getting timed out/failing intermittently and at the same time HCP clusters login fail too.

      The issue here is matching with what has been observed into below Jira tickets:

      https://issues.redhat.com/browse/OCPBUGS-61384 
      https://issues.redhat.com/browse/OCPBUGS-60021

              bnemec@redhat.com Benjamin Nemec
              rhn-support-dpateriy Divyam Pateriya
              None
              None
              Ross Brattain Ross Brattain
              None
              Votes:
              2 Vote for this issue
              Watchers:
              20 Start watching this issue

                Created:
                Updated:
                Resolved: