Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-43719

Haproxy timeouts not aligned with k8s healthiness checks

XMLWordPrintable

    • None
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required
    • Done

      This is a clone of issue OCPBUGS-43428. The following is the description of the original issue:

      As part of TRT investigations of k8s API disruptions, we have discovered there are times when haproxy considers underlying apiserver as Down, yet from k8s perspective the apiserver is healthy&functional.

      From the customer perspective, during this time any call to the cluster API endpoint will fail. It simply looks like an outage.

      Thorough investigation leads us to the following difference in how haproxy perceives apiserver being alive versus how k8s perceives it, i.e.

      inter 1s fall 2 rise 3
      

      and

           readinessProbe:
            httpGet:
              scheme: HTTPS
              port: 6443
              path: readyz
            initialDelaySeconds: 0
            periodSeconds: 5
            timeoutSeconds: 10
            successThreshold: 1
            failureThreshold: 3
      

      We can see the top check is much stricter. And it belongs to haproxy. As a result, haproxy sees the following

      2024-10-08T12:37:32.779247039Z [WARNING]  (29) : Server masters/master-2 is DOWN, reason: Layer7 wrong status, code: 500, info: "Internal Server Error", check duration: 5ms. 0 active and 0 backup servers left. 154 sessions active, 0 requeued, 0 remaining in queue.
      

      much faster than k8s would consider something as wrong.

      In order to remediate this issue, it has been agreed the haproxy checks should be softened and adjusted to the k8s readiness probe.

              bnemec@redhat.com Benjamin Nemec
              openshift-crt-jira-prow OpenShift Prow Bot
              Zhanqi Zhao Zhanqi Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: