-
Bug
-
Resolution: Can't Do
-
Major
-
None
-
4.12
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
None
-
None
-
None
-
Rejected
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
- Customer is running OCP 4.12.53 Baremetal IPI.
- Sometimes when the integrated keepalived that handles the ingress VIP runs the chk_ingress VRRP script is failing due to timeout:
vrrp_script chk_ingress {
script "/usr/bin/timeout 0.9 /usr/bin/curl -o /dev/null -Lfs http://localhost:1936/healthz/ready"
interval 1
weight 20
rise 3
fall 2
}
2024-12-04T05:38:26.658757185+00:00 stderr F Wed Dec 4 05:38:26 2024: VRRP_Script(chk_ingress) failed (exited with status 124)
- It has been observed that when this happens the openshift-router process seems to freeze for ~1.5s:
2998696 05:38:24.766046 close(15) = 0
2998688 05:38:24.766091 <... nanosleep resumed>NULL) = 0
2998696 05:38:24.766107 futex(0xc000588148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
2998688 05:38:24.766118 futex(0x301d998, FUTEX_WAIT_PRIVATE, 0, {tv_sec=4, tv_nsec=155358380} <unfinished ...>
2998692 05:38:26.415706 <... epoll_pwait resumed>[{events=EPOLLIN, data={u32=1197311848, u64=140644196386664}}], 128, 4155, NULL, 3813427388) = 1
2998692 05:38:26.415804 futex(0x301d998, FUTEX_WAKE_PRIVATE, 1) = 1
2998692 05:38:26.415859 accept4(7, {sa_family=AF_INET6, sin6_port=htons(45398), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, [112 => 28], SOCK_CLOEXEC|SOCK_NONBLOCK) = 15
2998692 05:38:26.415933 epoll_ctl(4, EPOLL_CTL_ADD, 15, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, data={u32=1197309928, u64=140644196384744}} <unfinished ...>
- We checked the metrics and there is no throttling, the customer told that the clusters where they are having this issue are mostly idle ones.
Version-Release number of selected component (if applicable):
How reproducible:
In customer environment
Steps to Reproduce:
1.
2.
3.
Actual results:
The openshift-router process fails the check conducted by the keepalived pods due to timeout, this makes the ingress VIP flapping.
Expected results:
The openshift-router process should be able to reply to the curl in less than 0.9s during normal operations.
Additional info: