Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: 4.15.z
Affects Version/s: 4.17, 4.18
Component/s: Networking / On-Prem Load Balancer
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:

4.17.z
Target Version:

4.15.z
Release Blocker:
Rejected
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
In Progress
Release Note Type:
Release Note Not Required
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

This is a clone of issue ~~OCPBUGS-43741~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-43719~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-43428~~. The following is the description of the original issue:
—
As part of TRT investigations of k8s API disruptions, we have discovered there are times when haproxy considers underlying apiserver as Down, yet from k8s perspective the apiserver is healthy&functional.

From the customer perspective, during this time any call to the cluster API endpoint will fail. It simply looks like an outage.

Thorough investigation leads us to the following difference in how haproxy perceives apiserver being alive versus how k8s perceives it, i.e.

inter 1s fall 2 rise 3

and

     readinessProbe:
      httpGet:
        scheme: HTTPS
        port: 6443
        path: readyz
      initialDelaySeconds: 0
      periodSeconds: 5
      timeoutSeconds: 10
      successThreshold: 1
      failureThreshold: 3

We can see the top check is much stricter. And it belongs to haproxy. As a result, haproxy sees the following

2024-10-08T12:37:32.779247039Z [WARNING]  (29) : Server masters/master-2 is DOWN, reason: Layer7 wrong status, code: 500, info: "Internal Server Error", check duration: 5ms. 0 active and 0 backup servers left. 154 sessions active, 0 requeued, 0 remaining in queue.

much faster than k8s would consider something as wrong.

In order to remediate this issue, it has been agreed the haproxy checks should be softened and adjusted to the k8s readiness probe.

blocks

OCPBUGS-43743 Haproxy timeouts not aligned with k8s healthiness checks

Closed

clones

OCPBUGS-43741 Haproxy timeouts not aligned with k8s healthiness checks

Closed

is blocked by

OCPBUGS-43741 Haproxy timeouts not aligned with k8s healthiness checks

Closed

is cloned by

OCPBUGS-43743 Haproxy timeouts not aligned with k8s healthiness checks

Closed

links to

openshift/machine-config-operator#4663: [release-4.15] OCPBUGS-43742: Soften haproxy timeout for kubeapi probe

RHBA-2025:3055 OpenShift Container Platform 4.15.z bug fix update

(1 links to)

Assignee:: Mat Kowalski

Reporter:: OpenShift Prow Bot

Need Info From:: None

Contributors:: None

QA Contact:: Ross Brattain

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2024/10/23 1:25 PM

Updated:: 2025/07/19 1:31 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide