As part of TRT investigations of k8s API disruptions, we have discovered there are times when haproxy considers underlying apiserver as Down, yet from k8s perspective the apiserver is healthy&functional.
From the customer perspective, during this time any call to the cluster API endpoint will fail. It simply looks like an outage.
Thorough investigation leads us to the following difference in how haproxy perceives apiserver being alive versus how k8s perceives it, i.e.
inter 1s fall 2 rise 3
and
readinessProbe: httpGet: scheme: HTTPS port: 6443 path: readyz initialDelaySeconds: 0 periodSeconds: 5 timeoutSeconds: 10 successThreshold: 1 failureThreshold: 3
We can see the top check is much stricter. And it belongs to haproxy. As a result, haproxy sees the following
2024-10-08T12:37:32.779247039Z [WARNING] (29) : Server masters/master-2 is DOWN, reason: Layer7 wrong status, code: 500, info: "Internal Server Error", check duration: 5ms. 0 active and 0 backup servers left. 154 sessions active, 0 requeued, 0 remaining in queue.
much faster than k8s would consider something as wrong.
In order to remediate this issue, it has been agreed the haproxy checks should be softened and adjusted to the k8s readiness probe.
- blocks
-
OCPBUGS-43719 Haproxy timeouts not aligned with k8s healthiness checks
- Closed
- is cloned by
-
OCPBUGS-43719 Haproxy timeouts not aligned with k8s healthiness checks
- Closed
- is depended on by
-
OCPBUGS-43627 Prevent Drift From Expected Readiness Probe Configuration
- New
- is duplicated by
-
OCPBUGS-43446 On Prem LB Configuration Doesn't Match API Server Expectation
- Closed
- links to
-
RHEA-2024:6122 OpenShift Container Platform 4.18.z bug fix update