Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-10511

Haproxy health checks fail and pod restarts during resiliency tests

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Major
    • None
    • 4.13, 4.12, 4.11, 4.10, 4.9, 4.14
    • Networking / router
    • None
    • Moderate
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      During QE's resiliency testing, we found that sometimes haproxy will stop responding to health checks and the pod will get restarted. The frequency seems to be on average from 5-15 hours, or not at all, it's intermittent.

      This slack thread captures evidence of this bug: https://redhat-internal.slack.com/archives/C04U0FP2EHY/p1678985417068499 

      It happens on Haproxy 2.2.24, 2.6.6, and 2.6.9 seemingly equally.

      QE Resiliency test: https://github.com/openshift/svt/tree/master/reliability-v2 

      Version-Release number of selected component (if applicable):

      Any OCP using haproxy 2.2: 4.9, 4.10, 4.11, 4.12, 4.13, 4.14

      How reproducible:

      Rare

      Steps to Reproduce:

      I don't have a way to reproduce this other than having QE start resiliency testing and waiting up to 18 hours. 

      Actual results:

       

      Expected results:

       

      Additional info:

      The previous logs of the failed haproxy container:
      $ oc logs -n openshift-ingress router-default-8654d4ff7c-qns7r --previous 
      ...
      I0316 13:24:04.427203       1 router.go:618] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
      I0316 13:24:09.440967       1 router.go:618] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
      I0316 13:24:14.374609       1 router.go:618] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
      I0316 13:24:36.010223       1 router.go:618] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
      I0316 13:24:41.000160       1 router.go:618] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
      I0316 13:24:51.752128       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I0316 13:24:51.752132       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I0316 13:25:01.763238       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I0316 13:25:01.763518       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I0316 13:25:09.323401       1 router.go:618] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 1 retry attempt(s).\n"
      I0316 13:25:11.752341       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I0316 13:25:11.752720       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I0316 13:25:12.907302       1 healthz.go:261] backend-http check failed: healthz
      [-]backend-http failed: backend reported failure
      I0316 13:25:14.434428       1 template.go:704] router "msg"="Shutdown requested, waiting 45s for new connections to cease" 
      I0316 13:25:19.391630       1 router.go:618] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 1 retry attempt(s).\n"
      I0316 13:25:19.895977       1 healthz.go:261] process-running check failed: healthz
      [-]process-running failed: process is terminating 
      
      Proof of restarting often:
      $ oc_ingress_get_router_pods 
      NAME                              READY   STATUS    RESTARTS       AGE
      router-default-8654d4ff7c-qns7r   1/1     Running   3 (3h3m ago)   24h
      router-default-8654d4ff7c-rdq68   1/1     Running   0     

      Attachments

        Activity

          People

            mmasters1@redhat.com Miciah Masters
            gspence@redhat.com Grant Spence
            Hongan Li Hongan Li
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: