-
Bug
-
Resolution: Can't Do
-
Normal
-
None
-
4.18.z
-
None
-
Incidents & Support
-
False
-
-
None
-
Moderate
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Frequent VIP movement in openshift-kni-infra caused by keepalive track scripts being killed
Version-Release number of selected component (if applicable):
4.18.26
How reproducible:
100% in customer env
Steps to Reproduce:
The customer is running OCP 4.18.26 BareMetal IPI/Assisted where the ingress Keepalived seem to be slow executing / slow stat update resulting in `track script` being killed frequently, further result into VIP movement.
2025-11-12T02:56:22.596696183Z Wed Nov 12 02:56:22 2025: (clu001_INGRESS_0) Entering BACKUP STATE 2025-11-12T02:56:22.596747258Z Wed Nov 12 02:56:22 2025: clu001_INGRESS_0: sending gratuitous ARP for 10.220.16.31 2025-11-12T02:56:22.596747258Z Wed Nov 12 02:56:22 2025: Sending gratuitous ARP on br-ex for 10.220.16.31 2025-11-12T02:56:26.284534037Z Wed Nov 12 02:56:26 2025: (clu001_INGRESS_0) Receive advertisement timeout 2025-11-12T02:56:26.285943161Z Wed Nov 12 02:56:26 2025: (clu001_INGRESS_0) Entering MASTER STATE 2025-11-12T02:56:26.285971774Z Wed Nov 12 02:56:26 2025: (clu001_INGRESS_0) setting VIPs. 2025-11-12T02:56:26.286250884Z Wed Nov 12 02:56:26 2025: (clu001_INGRESS_0) Sending/queueing gratuitous ARPs on br-ex for 10.220.17.1 2025-11-12T02:56:26.286276602Z Wed Nov 12 02:56:26 2025: Sending gratuitous ARP on br-ex for 10.220.17.1 2025-11-12T02:56:26.286375556Z Wed Nov 12 02:56:26 2025: Sending gratuitous ARP on br-ex for 10.220.17.1 2025-11-12T02:56:26.286398148Z Wed Nov 12 02:56:26 2025: Sending gratuitous ARP on br-ex for 10.220.17.1 2025-11-12T02:56:26.286423876Z Wed Nov 12 02:56:26 2025: Sending gratuitous ARP on br-ex for 10.220.17.1 2025-11-12T02:56:26.286447790Z Wed Nov 12 02:56:26 2025: Sending gratuitous ARP on br-ex for 10.220.17.1 2025-11-12T02:56:26.560868788Z Wed Nov 12 02:56:26 2025: Track script chk_ingress_ready is already running, expect idle - skipping run 2025-11-12T02:56:26.585841792Z Wed Nov 12 02:56:26 2025: Track script chk_ingress is already running, expect idle - skipping run 2025-11-12T02:56:27.560961416Z Wed Nov 12 02:56:27 2025: Track script chk_ingress_ready is being timed out, expect idle - skipping run 2025-11-12T02:56:27.585894626Z Wed Nov 12 02:56:27 2025: Track script chk_ingress is being timed out, expect idle - skipping run 2025-11-12T02:56:28.560982821Z Wed Nov 12 02:56:28 2025: Track script chk_ingress_ready is being timed out, expect idle - skipping run 2025-11-12T02:56:28.585894090Z Wed Nov 12 02:56:28 2025: Track script chk_ingress is being timed out, expect idle - skipping run 2025-11-12T02:56:28.985570543Z Wed Nov 12 02:56:28 2025: VRRP_Script(chk_ingress_ready) failed (due to signal 15) 2025-11-12T02:56:28.986482432Z Wed Nov 12 02:56:28 2025: (clu001_INGRESS_0) Changing effective priority from 80 to 70 2025-11-12T02:56:28.986855065Z Wed Nov 12 02:56:28 2025: VRRP_Script(chk_ingress) failed (due to signal 15) 2025-11-12T02:56:28.986872658Z Wed Nov 12 02:56:28 2025: (clu001_INGRESS_0) Entering FAULT STATE 2025-11-12T02:56:28.988800889Z Wed Nov 12 02:56:28 2025: (clu001_INGRESS_0) sent 0 priority 2025-11-12T02:56:28.988825024Z Wed Nov 12 02:56:28 2025: (clu001_INGRESS_0) removing VIPs. 2025-11-12T02:56:29.681662772Z Wed Nov 12 02:56:29 2025: Printing parent data for process(19) on signal 2025-11-12T02:56:29.681662772Z Wed Nov 12 02:56:29 2025: Printing VRRP data for process(22) on signal 2025-11-12T02:56:30.572348346Z Wed Nov 12 02:56:30 2025: VRRP_Script(chk_ingress_ready) succeeded 2025-11-12T02:56:30.572409219Z Wed Nov 12 02:56:30 2025: (clu001_INGRESS_0) Changing effective priority from 70 to 80 2025-11-12T02:56:30.597224480Z Wed Nov 12 02:56:30 2025: VRRP_Script(chk_ingress) succeeded 2025-11-12T02:56:30.597262571Z Wed Nov 12 02:56:30 2025: (clu001_INGRESS_0) Entering BACKUP STATE 2025-11-12T02:56:30.597288489Z Wed Nov 12 02:56:30 2025: clu001_INGRESS_0: sending gratuitous ARP for 10.220.16.31 2025-11-12T02:56:30.597312904Z Wed Nov 12 02:56:30 2025: Sending gratuitous ARP on br-ex for 10.220.16.31 2025-11-12T02:56:36.302598089Z Wed Nov 12 02:56:36 2025: (clu001_INGRESS_0) Backup received priority 0 advertisement 2025-11-12T02:56:36.302889281Z Wed Nov 12 02:56:36 2025: (clu001_INGRESS_0) Backup received priority 0 advertisement 2025-11-12T02:56:36.990563133Z Wed Nov 12 02:56:36 2025: (clu001_INGRESS_0) Receive advertisement timeout 2025-11-12T02:56:36.991663362Z Wed Nov 12 02:56:36 2025: (clu001_INGRESS_0) Entering MASTER STATE 2025-11-12T02:56:36.991703136Z Wed Nov 12 02:56:36 2025: (clu001_INGRESS_0) setting VIPs. 2025-11-12T02:56:36.991996502Z Wed Nov 12 02:56:36 2025: (clu001_INGRESS_0) Sending/queueing gratuitous ARPs on br-ex for 10.220.17.1 2025-11-12T02:56:36.992031537Z Wed Nov 12 02:56:36 2025: Sending gratuitous ARP on br-ex for 10.220.17.1 2025-11-12T02:56:36.992253310Z Wed Nov 12 02:56:36 2025: Sending gratuitous ARP on br-ex for 10.220.17.1 2025-11-12T02:56:36.992360600Z Wed Nov 12 02:56:36 2025: Sending gratuitous ARP on br-ex for 10.220.17.1 2025-11-12T02:56:36.992422085Z Wed Nov 12 02:56:36 2025: Sending gratuitous ARP on br-ex for 10.220.17.1
Actual results:
Keepalive complaining about script still running but it doesn't appear to be.
Expected results:
If keepalive get proper stat, it should help in stabilizing VIPs on nodes which would further help stabilizing HCP clusters LDAP login
Additional info:
The customer is facing intermittent LDAP login issues to all of their HCP clusters. The management cluster Keepalived Ingress VIP availability is crucial to ensure LDAP logins for HCP cluster always success. We observed that the keepalived health check scripts getting timed out/failing intermittently and at the same time HCP clusters login fail too.
The issue here is matching with what has been observed into below Jira tickets:
https://issues.redhat.com/browse/OCPBUGS-61384
https://issues.redhat.com/browse/OCPBUGS-60021