Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Not a Bug
Priority: Major
Fix Version/s: None
Affects Version/s: 4.13, 4.12, 4.11, 4.10, 4.9, 4.14
Component/s: Networking / router
Labels:
None

Severity:
Moderate
Regression:
No
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Links:

Description

Description of problem:

During QE's resiliency testing, we found that sometimes haproxy will stop responding to health checks and the pod will get restarted. The frequency seems to be on average from 5-15 hours, or not at all, it's intermittent.

This slack thread captures evidence of this bug: https://redhat-internal.slack.com/archives/C04U0FP2EHY/p1678985417068499

It happens on Haproxy 2.2.24, 2.6.6, and 2.6.9 seemingly equally.

QE Resiliency test: https://github.com/openshift/svt/tree/master/reliability-v2

Version-Release number of selected component (if applicable):

Any OCP using haproxy 2.2: 4.9, 4.10, 4.11, 4.12, 4.13, 4.14

How reproducible:

Rare

Steps to Reproduce:

I don't have a way to reproduce this other than having QE start resiliency testing and waiting up to 18 hours.

Actual results:

Expected results:

Additional info:

The previous logs of the failed haproxy container:
$ oc logs -n openshift-ingress router-default-8654d4ff7c-qns7r --previous 
...
I0316 13:24:04.427203       1 router.go:618] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
I0316 13:24:09.440967       1 router.go:618] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
I0316 13:24:14.374609       1 router.go:618] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
I0316 13:24:36.010223       1 router.go:618] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
I0316 13:24:41.000160       1 router.go:618] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
I0316 13:24:51.752128       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
I0316 13:24:51.752132       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
I0316 13:25:01.763238       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
I0316 13:25:01.763518       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
I0316 13:25:09.323401       1 router.go:618] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 1 retry attempt(s).\n"
I0316 13:25:11.752341       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
I0316 13:25:11.752720       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
I0316 13:25:12.907302       1 healthz.go:261] backend-http check failed: healthz
[-]backend-http failed: backend reported failure
I0316 13:25:14.434428       1 template.go:704] router "msg"="Shutdown requested, waiting 45s for new connections to cease" 
I0316 13:25:19.391630       1 router.go:618] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 1 retry attempt(s).\n"
I0316 13:25:19.895977       1 healthz.go:261] process-running check failed: healthz
[-]process-running failed: process is terminating 

Proof of restarting often:
$ oc_ingress_get_router_pods 
NAME                              READY   STATUS    RESTARTS       AGE
router-default-8654d4ff7c-qns7r   1/1     Running   3 (3h3m ago)   24h
router-default-8654d4ff7c-rdq68   1/1     Running   0

Attachments

Activity

People

Assignee:: Miciah Masters

Reporter:: Grant Spence

QA Contact:: Hongan Li

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2023/03/17 9:26 PM

Updated:: 2023/03/31 5:41 PM

Resolved:: 2023/03/30 7:11 PM