Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-8027

Subscription processing issues are hard to debug

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.17
    • OLM
    • None
    • Quality / Stability / Reliability
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Subscription processing issues are hard to debug    

      Version-Release number of selected component (if applicable):

          

      How reproducible:

      When there are issues of this type, the debugging hardship is 100% there.

      Steps to Reproduce:

          1. Have a slow cluster.
          2. Create a a CatalogSource, OperatorGroup and Subscription to install an operator
          3. Wait for the OLM Job to time out   

      Actual results:

      1. The hex-string-named pod is gone, so it's not possible at this point to figure out what it was stuck on for 10 minutes
      2. The Conditions in kubectl describe subsription are a mess. All lumped together so it's hard to see which field applies to which one, and it's not possible to see which ones constitute the current state and which ones are stale.
      3. It is not possible to extend the deadline of the job nor have it retain the pod for inspection.

      Expected results:

      1. Clear, actionable status information about the cause for failure, down to the root cause.
      2. Ability to tweak things to extend deadline, retain pods, etc.

      Additional info:

      In the case I'm facing the operator index pod looks healthy, already 2 minutes after start, and the CatalogSource status agrees.
      
      Conditions on the subscription resource are... unclear. Stale? How can the sources be all healthy and one of them unreachable at the same time?
      
        Conditions:
          Last Transition Time:  2025-07-24T09:29:28Z
          Message:               all available catalogsources are healthy
          Reason:                AllCatalogSourcesHealthy
          Status:                False
          Type:                  CatalogSourcesUnhealthy
          Message:               error using catalogsource stackrox-operator/stackrox-operator-test-index: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 172.30.4.208:50051: connect: connection refused"
          Reason:                ErrorPreventedResolution
          Status:                True
          Type:                  ResolutionFailed
          Reason:                UnpackingInProgress
          Status:                True
          Type:                  BundleUnpacking
          Message:               bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than specified deadline
          Reason:                BundleUnpackFailed
          Status:                True
          Type:                  BundleUnpackFailed
      
      the job decided to remove the pod after 10 minutes:  Type     Reason            Age   From            Message
        ----     ------            ----  ----            -------
        Normal   SuccessfulCreate  14m   job-controller  Created pod: 9ac57f1c7f9b5705fa3ec9f16aed4e3b7ed23d28d5729f0bea6aeb146fqhtmd
        Normal   SuccessfulDelete  4m7s  job-controller  Deleted pod: 9ac57f1c7f9b5705fa3ec9f16aed4e3b7ed23d28d5729f0bea6aeb146fqhtmd
        Warning  DeadlineExceeded  4m7s  job-controller  Job was active longer than specified deadline

      Actual logs.

      Slack thread.

              rh-ee-cchantse Catherine Chan-Tse
              mowsiany@redhat.com Marcin Owsiany
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                None
                None