Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-57118

certified-operators are failing regularly due to startup probe timing out frequently and generating alert for KubePodCrashLooping

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.16
    • OLM
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • Rejected
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      certified-operators part of the marketplace namespace is failing frequently due to startup probe 
      
      
      2h19m       Normal    Created          pod/certified-operators-tkh2b   Created container registry-server
      2h19m       Normal    Started          pod/certified-operators-tkh2b   Started container registry-server
      2h18m       Warning   Unhealthy        pod/certified-operators-tkh2b   Startup probe failed: timeout: failed to connect service ":50051" within 1s
      2h18m       Normal    Killing          pod/certified-operators-tkh2b   Stopping container registry-server
      2h18m       Warning   Unhealthy        pod/certified-operators-tkh2b   Readiness probe errored: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1
      
          livenessProbe:
            exec:
              command:
              - grpc_health_probe
              - -addr=:50051
            failureThreshold: 3
            initialDelaySeconds: 10
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          name: registry-server
          ports:
          - containerPort: 50051
            name: grpc
            protocol: TCP
          readinessProbe:
            exec:
              command:
              - grpc_health_probe
              - -addr=:50051
            failureThreshold: 3
            initialDelaySeconds: 5
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
      ---
      
      Alert:
      
      Labels:
      
      alertname = KubePodCrashLooping
      container = registry-server
      endpoint = https-main
      job = kube-state-metrics
      managed_cluster = e74350bb-dc29-4716-860d-51269c4ef5d0
      namespace = openshift-marketplace
      openshift_io_alert_source = platform
      pod = community-operators-j2g4v
      prometheus = openshift-monitoring/k8s
      reason = CrashLoopBackOffservice = kube-state-metrics
      severity = warninguid = 8d236469-bec3-4423-9662-d0bf7fc14826

      Version-Release number of selected component (if applicable):

          

      How reproducible:

          Always

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

       certified-operators failing leading alertname = KubePodCrashLooping.

      Expected results:

          certified-operators should not fail and not generate the alert. 

      Additional info:

      The pods redhat-marketplace is also restarting due to probe failure. 

       

              rh-ee-cchantse Catherine Chan-Tse
              rhn-support-shishind Shivam shinde
              None
              None
              Xia Zhao Xia Zhao
              None
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated: