Details
-
Bug
-
Resolution: Not a Bug
-
Major
-
None
-
4.13, 4.12, 4.11, 4.10, 4.9, 4.14
-
None
-
Moderate
-
No
-
Rejected
-
False
-
Description
Description of problem:
During QE's resiliency testing, we found that sometimes haproxy will stop responding to health checks and the pod will get restarted. The frequency seems to be on average from 5-15 hours, or not at all, it's intermittent.
This slack thread captures evidence of this bug: https://redhat-internal.slack.com/archives/C04U0FP2EHY/p1678985417068499
It happens on Haproxy 2.2.24, 2.6.6, and 2.6.9 seemingly equally.
QE Resiliency test: https://github.com/openshift/svt/tree/master/reliability-v2
Version-Release number of selected component (if applicable):
Any OCP using haproxy 2.2: 4.9, 4.10, 4.11, 4.12, 4.13, 4.14
How reproducible:
Rare
Steps to Reproduce:
I don't have a way to reproduce this other than having QE start resiliency testing and waiting up to 18 hours.
Actual results:
Expected results:
Additional info:
The previous logs of the failed haproxy container: $ oc logs -n openshift-ingress router-default-8654d4ff7c-qns7r --previous ... I0316 13:24:04.427203 1 router.go:618] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n" I0316 13:24:09.440967 1 router.go:618] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n" I0316 13:24:14.374609 1 router.go:618] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n" I0316 13:24:36.010223 1 router.go:618] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n" I0316 13:24:41.000160 1 router.go:618] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n" I0316 13:24:51.752128 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I0316 13:24:51.752132 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I0316 13:25:01.763238 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I0316 13:25:01.763518 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I0316 13:25:09.323401 1 router.go:618] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 1 retry attempt(s).\n" I0316 13:25:11.752341 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I0316 13:25:11.752720 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I0316 13:25:12.907302 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I0316 13:25:14.434428 1 template.go:704] router "msg"="Shutdown requested, waiting 45s for new connections to cease" I0316 13:25:19.391630 1 router.go:618] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 1 retry attempt(s).\n" I0316 13:25:19.895977 1 healthz.go:261] process-running check failed: healthz [-]process-running failed: process is terminating Proof of restarting often: $ oc_ingress_get_router_pods NAME READY STATUS RESTARTS AGE router-default-8654d4ff7c-qns7r 1/1 Running 3 (3h3m ago) 24h router-default-8654d4ff7c-rdq68 1/1 Running 0