Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-27234

Catalog pod health probes have significant delay, reaching timeout

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • None
    • 4.12.z
    • OLM / Registry
    • None
    • Critical
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

          Catalog pods are failing in this cluster, as the probes using `grpc_health_probe -addr=:50051` are taking too long and reaching their timeout.
      
      It should be noted that the probes are *successful*, but are otherwise taking too long to complete. The verbose log indicates that  

      Version-Release number of selected component (if applicable):

          4.12.27

      How reproducible:

          Currently ongoing with every catalog pod, but currently not seen in other clusters

      Steps to Reproduce:

          1. Marketplace operator starts catalog pods
          2. Catalog pod is scheduled and begins Running
          

      Actual results:

          Probes fail, eventually causing pod to reach CrashLoopBackOff

      Expected results:

          Probes succeed, pod is marked Ready

      Additional info:

          Attempting the probe manually seems to show the probe waiting for some time before completing, reaching the timeout:
      $ oc rsh certified-operators-42k78
      sh-4.4$ bash
      bash-4.4$ grpc_health_probe -addr=:50051 -v
      parsed options:
      > addr=:50051 conn_timeout=1s rpc_timeout=1s
      > tls=false
      > alts=false
      > spiffe=false
      establishing connection
      connection established (took 725.504µs)
      time elapsed: connect=725.504µs rpc=839.717µs
      status: SERVING 

            rh-ee-cchantse Catherine Chan-Tse
            rhn-support-jorbell Jordan Bell
            Jia Fan Jia Fan
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: