-
Bug
-
Resolution: Done-Errata
-
Critical
-
None
-
4.17, 4.18
This is a clone of issue OCPBUGS-43428. The following is the description of the original issue:
—
As part of TRT investigations of k8s API disruptions, we have discovered there are times when haproxy considers underlying apiserver as Down, yet from k8s perspective the apiserver is healthy&functional.
From the customer perspective, during this time any call to the cluster API endpoint will fail. It simply looks like an outage.
Thorough investigation leads us to the following difference in how haproxy perceives apiserver being alive versus how k8s perceives it, i.e.
inter 1s fall 2 rise 3
and
readinessProbe: httpGet: scheme: HTTPS port: 6443 path: readyz initialDelaySeconds: 0 periodSeconds: 5 timeoutSeconds: 10 successThreshold: 1 failureThreshold: 3
We can see the top check is much stricter. And it belongs to haproxy. As a result, haproxy sees the following
2024-10-08T12:37:32.779247039Z [WARNING] (29) : Server masters/master-2 is DOWN, reason: Layer7 wrong status, code: 500, info: "Internal Server Error", check duration: 5ms. 0 active and 0 backup servers left. 154 sessions active, 0 requeued, 0 remaining in queue.
much faster than k8s would consider something as wrong.
In order to remediate this issue, it has been agreed the haproxy checks should be softened and adjusted to the k8s readiness probe.
- blocks
-
OCPBUGS-43741 Haproxy timeouts not aligned with k8s healthiness checks
- ON_QA
- clones
-
OCPBUGS-43428 Haproxy timeouts not aligned with k8s healthiness checks
- Verified
- is blocked by
-
OCPBUGS-43428 Haproxy timeouts not aligned with k8s healthiness checks
- Verified
- is cloned by
-
OCPBUGS-43741 Haproxy timeouts not aligned with k8s healthiness checks
- ON_QA
- links to
-
RHBA-2024:8981 OpenShift Container Platform 4.17.z bug fix update