Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-43428

Haproxy timeouts not aligned with k8s healthiness checks

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • 4.14, 4.15, 4.16, 4.17, 4.18
    • None

      As part of TRT investigations of k8s API disruptions, we have discovered there are times when haproxy considers underlying apiserver as Down, yet from k8s perspective the apiserver is healthy&functional.

      From the customer perspective, during this time any call to the cluster API endpoint will fail. It simply looks like an outage.

      Thorough investigation leads us to the following difference in how haproxy perceives apiserver being alive versus how k8s perceives it, i.e.

      inter 1s fall 2 rise 3
      

      and

           readinessProbe:
            httpGet:
              scheme: HTTPS
              port: 6443
              path: readyz
            initialDelaySeconds: 0
            periodSeconds: 5
            timeoutSeconds: 10
            successThreshold: 1
            failureThreshold: 3
      

      We can see the top check is much stricter. And it belongs to haproxy. As a result, haproxy sees the following

      2024-10-08T12:37:32.779247039Z [WARNING]  (29) : Server masters/master-2 is DOWN, reason: Layer7 wrong status, code: 500, info: "Internal Server Error", check duration: 5ms. 0 active and 0 backup servers left. 154 sessions active, 0 requeued, 0 remaining in queue.
      

      much faster than k8s would consider something as wrong.

      In order to remediate this issue, it has been agreed the haproxy checks should be softened and adjusted to the k8s readiness probe.

              mkowalsk@redhat.com Mat Kowalski
              mkowalsk@redhat.com Mat Kowalski
              Ross Brattain Ross Brattain
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: