Loading...

XML

Word

Printable

Type: Bug
Resolution: Can't Do
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.18.z
Component/s: Networking / On-Prem Host Networking
Labels:
None

Activity Type:
Incidents & Support
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

    Frequent VIP movement in openshift-kni-infra caused by keepalive track scripts being killed

Version-Release number of selected component (if applicable):

    4.18.26

How reproducible:

    100% in customer env

Steps to Reproduce:

The customer is running OCP 4.18.26 BareMetal IPI/Assisted where the ingress Keepalived seem to be slow executing / slow stat update resulting in `track script` being killed frequently, further result into VIP movement.


2025-11-12T02:56:22.596696183Z Wed Nov 12 02:56:22 2025: (clu001_INGRESS_0) Entering BACKUP STATE 
2025-11-12T02:56:22.596747258Z Wed Nov 12 02:56:22 2025: clu001_INGRESS_0: sending gratuitous ARP for 10.220.16.31 
2025-11-12T02:56:22.596747258Z Wed Nov 12 02:56:22 2025: Sending gratuitous ARP on br-ex for 10.220.16.31 
2025-11-12T02:56:26.284534037Z Wed Nov 12 02:56:26 2025: (clu001_INGRESS_0) Receive advertisement timeout 
2025-11-12T02:56:26.285943161Z Wed Nov 12 02:56:26 2025: (clu001_INGRESS_0) Entering MASTER STATE 
2025-11-12T02:56:26.285971774Z Wed Nov 12 02:56:26 2025: (clu001_INGRESS_0) setting VIPs. 
2025-11-12T02:56:26.286250884Z Wed Nov 12 02:56:26 2025: (clu001_INGRESS_0) Sending/queueing gratuitous ARPs on br-ex for 10.220.17.1 
2025-11-12T02:56:26.286276602Z Wed Nov 12 02:56:26 2025: Sending gratuitous ARP on br-ex for 10.220.17.1 
2025-11-12T02:56:26.286375556Z Wed Nov 12 02:56:26 2025: Sending gratuitous ARP on br-ex for 10.220.17.1 
2025-11-12T02:56:26.286398148Z Wed Nov 12 02:56:26 2025: Sending gratuitous ARP on br-ex for 10.220.17.1 
2025-11-12T02:56:26.286423876Z Wed Nov 12 02:56:26 2025: Sending gratuitous ARP on br-ex for 10.220.17.1 
2025-11-12T02:56:26.286447790Z Wed Nov 12 02:56:26 2025: Sending gratuitous ARP on br-ex for 10.220.17.1 
2025-11-12T02:56:26.560868788Z Wed Nov 12 02:56:26 2025: Track script chk_ingress_ready is already running, expect idle - skipping run 
2025-11-12T02:56:26.585841792Z Wed Nov 12 02:56:26 2025: Track script chk_ingress is already running, expect idle - skipping run 
2025-11-12T02:56:27.560961416Z Wed Nov 12 02:56:27 2025: Track script chk_ingress_ready is being timed out, expect idle - skipping run 
2025-11-12T02:56:27.585894626Z Wed Nov 12 02:56:27 2025: Track script chk_ingress is being timed out, expect idle - skipping run 
2025-11-12T02:56:28.560982821Z Wed Nov 12 02:56:28 2025: Track script chk_ingress_ready is being timed out, expect idle - skipping run 
2025-11-12T02:56:28.585894090Z Wed Nov 12 02:56:28 2025: Track script chk_ingress is being timed out, expect idle - skipping run 
2025-11-12T02:56:28.985570543Z Wed Nov 12 02:56:28 2025: VRRP_Script(chk_ingress_ready) failed (due to signal 15) 
2025-11-12T02:56:28.986482432Z Wed Nov 12 02:56:28 2025: (clu001_INGRESS_0) Changing effective priority from 80 to 70 
2025-11-12T02:56:28.986855065Z Wed Nov 12 02:56:28 2025: VRRP_Script(chk_ingress) failed (due to signal 15) 
2025-11-12T02:56:28.986872658Z Wed Nov 12 02:56:28 2025: (clu001_INGRESS_0) Entering FAULT STATE 
2025-11-12T02:56:28.988800889Z Wed Nov 12 02:56:28 2025: (clu001_INGRESS_0) sent 0 priority 
2025-11-12T02:56:28.988825024Z Wed Nov 12 02:56:28 2025: (clu001_INGRESS_0) removing VIPs. 
2025-11-12T02:56:29.681662772Z Wed Nov 12 02:56:29 2025: Printing parent data for process(19) on signal 
2025-11-12T02:56:29.681662772Z Wed Nov 12 02:56:29 2025: Printing VRRP data for process(22) on signal 
2025-11-12T02:56:30.572348346Z Wed Nov 12 02:56:30 2025: VRRP_Script(chk_ingress_ready) succeeded 
2025-11-12T02:56:30.572409219Z Wed Nov 12 02:56:30 2025: (clu001_INGRESS_0) Changing effective priority from 70 to 80 
2025-11-12T02:56:30.597224480Z Wed Nov 12 02:56:30 2025: VRRP_Script(chk_ingress) succeeded 
2025-11-12T02:56:30.597262571Z Wed Nov 12 02:56:30 2025: (clu001_INGRESS_0) Entering BACKUP STATE 
2025-11-12T02:56:30.597288489Z Wed Nov 12 02:56:30 2025: clu001_INGRESS_0: sending gratuitous ARP for 10.220.16.31 
2025-11-12T02:56:30.597312904Z Wed Nov 12 02:56:30 2025: Sending gratuitous ARP on br-ex for 10.220.16.31 
2025-11-12T02:56:36.302598089Z Wed Nov 12 02:56:36 2025: (clu001_INGRESS_0) Backup received priority 0 advertisement 
2025-11-12T02:56:36.302889281Z Wed Nov 12 02:56:36 2025: (clu001_INGRESS_0) Backup received priority 0 advertisement 
2025-11-12T02:56:36.990563133Z Wed Nov 12 02:56:36 2025: (clu001_INGRESS_0) Receive advertisement timeout 
2025-11-12T02:56:36.991663362Z Wed Nov 12 02:56:36 2025: (clu001_INGRESS_0) Entering MASTER STATE 
2025-11-12T02:56:36.991703136Z Wed Nov 12 02:56:36 2025: (clu001_INGRESS_0) setting VIPs. 
2025-11-12T02:56:36.991996502Z Wed Nov 12 02:56:36 2025: (clu001_INGRESS_0) Sending/queueing gratuitous ARPs on br-ex for 10.220.17.1 
2025-11-12T02:56:36.992031537Z Wed Nov 12 02:56:36 2025: Sending gratuitous ARP on br-ex for 10.220.17.1 
2025-11-12T02:56:36.992253310Z Wed Nov 12 02:56:36 2025: Sending gratuitous ARP on br-ex for 10.220.17.1 
2025-11-12T02:56:36.992360600Z Wed Nov 12 02:56:36 2025: Sending gratuitous ARP on br-ex for 10.220.17.1 
2025-11-12T02:56:36.992422085Z Wed Nov 12 02:56:36 2025: Sending gratuitous ARP on br-ex for 10.220.17.1

Actual results:

    Keepalive complaining about script still running but it doesn't appear to be.

Expected results:

If keepalive get proper stat, it should help in stabilizing VIPs on nodes which would further help stabilizing HCP clusters  LDAP login

Additional info:

    The customer is facing intermittent LDAP login issues to all of their HCP clusters. The management cluster Keepalived Ingress VIP availability is crucial to ensure LDAP logins for HCP cluster always success. We observed that the keepalived health check scripts getting timed out/failing intermittently and at the same time HCP clusters login fail too.

The issue here is matching with what has been observed into below Jira tickets:

https://issues.redhat.com/browse/OCPBUGS-61384
https://issues.redhat.com/browse/OCPBUGS-60021

Assignee:: Benjamin Nemec

Reporter:: Divyam Pateriya

QA Contact:: Ross Brattain

Need Info From:: None

Votes:: 2 Vote for this issue

Watchers:: 20 Start watching this issue

Created:: 2025/11/15 3:44 PM

Updated:: 2025/11/25 10:06 AM

Resolved:: 2025/11/19 1:01 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates